Life and Death: Predicting Who Survives on the Titanic

Isaac Chan
5 min readMay 17, 2020

The Titanic dataset is probably the most dissected, most viewed and most analysed dataset on Kaggle. The dataset is relatively small in size, easily interpretable and gives great hands on experience on Exploratory Data Analysis (EDA) and Machine Learning.

In this post, I would like to share my experience with the EDA. Another post should cover more on Data Imputation and Predictions.

Source: Mfame.Guru

Missing Data?

One of the great aspects of this dataset is that data is missing, making it more realistic. Out of all the features of the data, the data which were missing include those on Age of Passengers, what Cabin they stayed in and where they had embarked on the cruise.

The Cabin feature had the highest proportion of missing data, at 77.1%, while Age had 19.9% and Embarked had 0.2%. This

Number of People Who Survived on The Titanic

As we can see, most passengers did not survive the Titanic.

Age of Passengers

We can see that the age of passengers are actually quite diverse. The bulk of them are aged 20 to 40 years, with some really elderly and young passengers too.

Age of Passengers and Class

Here, we can see that first class passengers were usually the oldest. They had higher median age, while also having passengers who were much older as compared to those from other classes. This is most likely because these passengers could have afforded a more expensive trip or it could be due to the different purposes of travelling.

Gender and Class of Passenger

Interestingly, passenger class 3 is dominated by males. We can also observe that overall, there are more males than females. The data never really tells us why this is so. I did some research on the Titanic and found out that many of these passengers were emigrating. Perhaps many of these young men were hoping to look for a better life abroad?

Age and Survival

I think this visualisation does show us some interesting insights. First, it seems that out of those who survived, a larger proportion of them were older passengers compared to those who did not survive. This could be due to more older passengers having stayed in First Class cabins, which had higher survival rates.

Second, out of those who survived, there were quite a proportion who were 10 years old and younger. This could be due to the efforts to save young children first before the ship sank.

Third, the proportion of those who survived seems to be more normally distribtued. Both curves do take a somewhat normal distribution shape, with the Survived curve have a lower peak and fatter tails, while the Did Not Survive curve seems more skewed towards the left, probably due to the higher concentrations of passengers in their 20s and 30s.

Survival and Fare

First, the fare of passengers seems to be quite skewed. Most of the fare seems to cost below less than 50 dollars value, but there are some obvious outliers. Even after applying the log-normal distribution to make the distribution more “normal”, it still seemed quite skewed.

Second, for those who survived, more of them were those who paid higher fares. Conversely, out of those who did not survive, there’s a higher proportion of them who had paid very low fares.

Lastly, a Point Biserial Correlation between fare and survival had a value of 0.23. This implies that there is a relationship between fare and survival, although the relationship isn’t that strong.

Removing Outliers For Fares

To handle the outliers, I decided to set an upper bound of the fare of the 75th percentile plus the Inter-Quartile Range (IQR) multiplied by 1.5 and a lower bound of the 25h percentile minus the IQR multiplied by 1.5.

Fare and Class

This seems to confirm what we already knew, that fares are higher for First-Class passengers. Conversely, fares are much lower for Third-Class passengers.

Also, we can see that distribution of fares from the median to the 75th percentile is much wider than that of the median to the 25th percentile. This reflects the skewness of the fares distribution that we had also seen earlier.

Missing Data For Cabin

Some simple data analysis shows that a significant proportion of passengers in Class 2 (91%) and Class 3 (98%) have missing data for which cabin they were staying in. Whereas for Class 1, 23% of them had missing data for cabins.

Survival and Gender

Interestingly, most men on the Titanic did not survive. Only 18% of them had survived eventually. Conversely, 70% of women had survived! Again, this could be due to women and children being saved first by leaving the cruise.

Relationship Between Age and Fare

Since both of these variables are continuous in nature, and I wasn’t sure if the relationship was linear or not, I decided to use Spearman Correlation Coefficient to test the relationship, which yielded a result of 0.08. The scatter plot also seems to imply some sort of weak relationship. However, when we plotted Age and Passenger Class, we could see that older passengers occupied First Class more, signifying that there is a relationship between fare and age, although it’s more indirect.

Conclusion

This article explore the relationship between variables where we picked up some interesting trends. In the next one, I hope to perform data imputation and run some Machine Learning models :)

--

--

Isaac Chan

An NLP Data Scientist always seeking to improve his skills :)