Titanic- Machine Learning From Disaster

Titanic- Machine Learning From Disaster

The tragic incident of the Titanic's sinking in 1912 stands as a poignant historical maritime catastrophe. Beyond the tragedy, the dataset associated with the Titanic has evolved into a pivotal resource for individuals engaged in the realms of data science and machine learning. In the following exploration, we will delve into the significance of the Titanic dataset as a valuable platform for those interested in machine learning, offering valuable lessons in aspects like data preparation, feature manipulation, and predictive modeling.

It is considered a Beginner Level Dataset, So we will work on the Dataset on a very basic basis.

Dataset Description

The training dataset serves as the foundation for constructing your machine-learning models. Within this training dataset, the outcomes for each passenger, commonly referred to as the "ground truth," are provided. Your model's construction relies on various attributes, or "features," such as the passengers' gender and class. Additionally, the possibility of feature engineering exists, allowing for the creation of novel attributes.

Data Glossary

FeatureExplanationKey Code
SurvivalSurvival0 = Not survived, 1 = Survived
PclassTicket class1 = 1st class, 2 = 2nd class, 3 = 3rd class
sexGender
AgeAge in years
sibspNumber of siblings/spouses on Titanic|
parchNumber of parents/children on Titanic|
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

Feature Elaboration

  • pclass: This variable serves as an indicator of socio-economic status (SES)
  • 1st = Upper class
  • 2nd = Middle class
  • 3rd = Lower class
  • age: If the age is less than 1, it's expressed fractionally. If age is estimated, it appears as xx.5
  • sibsp: The dataset outlines family relationships as follows...
  • Sibling = brother, sister, stepbrother, stepsister
  • Spouse = husband, wife (mistresses and fiancés were not considered)
  • parch: The dataset defines family relationships as follows...
  • Parent = mother, father
  • Child = daughter, son, stepdaughter, stepson
  • Certain children only traveled with a nanny; hence parch=0 for them.

Challenge

The challenge posed by the Titanic dataset is to predict which passengers survived the disaster based on various features such as their age, gender, class, and more. This challenge falls under the umbrella of supervised learning, as we have labeled data for the training set and aim to predict the labels for the test set.

Code Implementation

  • Importing Libraries
  • Reading the Dataset
  • Analyzing the Data

First, we will look at what are the features that are in the Dataset.

Output:

Titanic- Machine Learning From Disaster

Seems like we have a lot of attributes for our dataset. Then we need to classify these features, like which are categorical features and which are numerical features.

Categorical Features: These values categorize the instances into groups of comparable instances. Categorical features consist of nominal, ordinal, ratio, or interval-based values. This aids in choosing suitable visualizations, among other purposes.

Categorical: Survived, Sex, and Embarked. Ordinal: Pclass.

Numerical Features: These values vary across different instances. Numerical features encompass discrete, continuous, or time-series-based values. This assists in determining suitable plots for visualization, among other uses.

Continuous: Age, Fare. Discrete: SibSp, Parch.

Output:

Titanic- Machine Learning From Disaster

Mixed data types refer to the inclusion of both numeric and alphanumeric data within a single feature. These instances are potential candidates for corrective measures toward achieving the desired objective. We do have mixed data types in our dataset.

Here, the ticket information comprises a combination of numeric and alphanumeric data types, while the cabin data consists solely of alphanumeric characters.

Analyzing errors and typos within features can be more challenging in extensive datasets. However, examining a subset of samples from a smaller dataset might readily reveal the features that need rectification.

Here, if you look carefully, the "Name" attribute could potentially include inaccuracies or typographical errors, as various formats are employed to represent names. These formats encompass titles, parentheses, and quotation marks, often utilized for alternate or abbreviated names.

Output:

Titanic- Machine Learning From Disaster

Output:

Titanic- Machine Learning From Disaster

In the dataset, the "Cabin," "Age," and "Embarked" attributes exhibit a sequence of null values, ranked from highest to lowest occurrence.

There are seven features with integer or float data types, and this count reduces to six in the case of the test dataset. Additionally, five features are represented as strings or objects.

Output:

Titanic- Machine Learning From Disaster

Let's look at the various distribution of numerical features:

  • The dataset comprises a total of 891 samples, which corresponds to approximately 40% of the actual passenger count on the Titanic (2,224).
  • The feature "Survived" is categorical, taking values of 0 or 1.
  • Approximately 38% of the samples represent passengers who survived, which is in line with the actual survival rate of around 32%.
  • A significant majority of passengers (>75%) did not have parents or children accompanying them.
  • Nearly 30% of passengers had siblings and/or a spouse with them on the Titanic.
  • Fares displayed considerable variation, with a small proportion of passengers (<1%) paying as high as $512.
  • The number of elderly passengers (<1%) falling within the age range of 65-80 is minimal.

Output:

Titanic- Machine Learning From Disaster

Let's look at the various distribution of categorical features:

  • Each name within the dataset is distinct, with a total count matching the number of unique entries, which is 891.
  • The "Sex" variable presents two distinct values, with approximately 65% of the entries categorized as male, making "male" the most frequent value with a count of 577 out of 891.
  • Cabin values exhibit a noticeable presence of duplicates across multiple samples, implying that certain cabins were shared among multiple passengers.
  • The "Embarked" variable encompasses three possible values. The majority of passengers used the "S" port, which emerged as the predominant choice.
  • The "Ticket" feature demonstrates a relatively high proportion (22%) of duplicate values, resulting in a total of 681 unique entries out of 891.

Laying Down Assumptions

We will establish hypotheses derived from the data analysis conducted thus far. We might also seek to further substantiate these hypotheses before making necessary decisions. The following are the assumptions:

  • Our aim is to assess the degree of correlation between each feature and survival. This initial examination will be followed by a comparison with model-derived correlations at a later stage.
  • We should consider filling in missing data for the Age feature, as it appears to be linked to survival.
  • Completing the Embarked feature could be worthwhile as it might correlate with survival or another significant feature.
  • Due to its high duplicate ratio (22%), we might exclude the Ticket feature from our analysis, as it likely lacks a correlation with survival.
  • Given its significant incompleteness and numerous null values in both training and test datasets, we may opt to discard the Cabin feature.
  • Since PassengerId doesn't seem to contribute to survival, it could be omitted from the training dataset.
  • The Name feature is relatively unconventional and may not directly impact survival; hence, it's a candidate for removal.
  • To gauge the overall family size on board, we might craft a new feature called Family based on Parch and SibSp.
  • Extracting the Title from the Name feature could provide additional insight after engineering.
  • Transforming the Age feature into ordinal categories through the creation of Age bands is a potential strategy.
  • A Fare range feature could be generated if it aids in our analysis.
  • Females (Sex=female) exhibited higher survival rates.
  • Children (Age<?) had better survival odds.
  • Passengers in the upper class (Pclass=1) were more likely to survive.

Pivoting Features

We can swiftly assess the correlations between our features by creating pivot tables that cross-reference different features. However, this analysis is currently limited to features without any missing values. Additionally, it's prudent to perform this analysis specifically for features that fall into categorical (Sex), ordinal (Pclass), or discrete (SibSp, Parch) types.

Output:

Titanic- Machine Learning From Disaster

Output:

Titanic- Machine Learning From Disaster

Output:

Titanic- Machine Learning From Disaster

Output:

Titanic- Machine Learning From Disaster
  • Pclass A notable correlation (>0.5) is evident, particularly between Pclass=1 and Survived. We opt to incorporate this feature into our model.
  • Sex Our initial observation holds true, with a remarkably high survival rate of 74% among females.
  • SibSp and Parch, Some values within these features show no correlation. It might be prudent to derive a new feature or a combination of features from these individual ones.
  • Visualising Data

We can now proceed to validate certain assumptions by utilizing visualizations to analyze the data further.

Correlation Between Numerical Features

We will initiate our analysis by comprehending the correlations between numerical features and our desired outcome (Survived).

Output:

Titanic- Machine Learning From Disaster

Upon careful examination of the data, the following observations were noted:

  • Infants (Age <= 4) exhibited a significant survival rate.
  • The oldest passenger (Age = 80) survived.
  • A considerable number of passengers aged 15-25 did not survive.
  • The majority of passengers fell within the 15-35 age range.

Consequently, the subsequent decisions were taken:

  • Age should be integrated as a feature during model training.
  • Efforts should be made to fill in missing values for the Age feature.
  • Passengers' ages should be categorized into specific groups.

Correlation Between Numerical and Ordinal Features

We have the ability to merge several features to detect correlations through a singular visualization. This approach is applicable to both numerical and categorical features that possess numeric values.

Output:

Titanic- Machine Learning From Disaster

Upon careful examination of the data, the following observations were noted:

  • Pclass=3 accommodated the largest number of passengers; however, the majority did not survive.
  • Infants traveling in Pclass=2 and Pclass=3 predominantly survived..
  • Most passengers in Pclass=1 emerged as survivors.
  • Pclass exhibited variations in the distribution of passengers' ages.

In light of these observations, the subsequent decision was taken:

  • Pclass should be incorporated as a feature during model training.

Correlation Between Categorical Features

Next, we will establish correlations between categorical attributes and our objective for the solution.

Output:

Titanic- Machine Learning From Disaster

Upon careful examination of the data, the following observations were noted:

  • Female passengers exhibited a notably higher survival rate than their male counterparts..
  • An interesting exception occurred in cases where Embarked=C, as male passengers displayed a higher survival rate. This may imply a correlation between Pclass and Embarked, subsequently impacting Pclass and Survived, rather than a direct correlation between Embarked and Survived.
  • Among male passengers, the survival rate was comparatively higher for those in Pclass=3 as opposed to Pclass=2 for the C and Q ports.
  • It became evident that the ports of embarkation manifested varying survival rates, particularly for Pclass=3 and among male passengers..

Consequently, the subsequent decisions were taken:

  • Incorporate the Sex feature into the model training.
  • Address and integrate the Embarked feature, ensuring its inclusion in the model training.

Correlation Between Categorical and Numerical Features

We might also consider examining potential correlations between categorical features (with non-numeric values) and numeric features. For instance, we can explore the correlation between Embarked (a categorical non-numeric feature), Sex (another categorical non-numeric feature), Fare (a numeric continuous feature), and Survived (a categorical numeric feature).

Output:

Titanic- Machine Learning From Disaster

Upon careful examination of the data, the following observations were noted:

  • Passengers who paid higher fares exhibited a higher survival rate, affirming our hypothesis to create fare range categories.
  • Survival rates appear to correlate with the port of embarkation, providing support for our correlation between embarkation and survival and finalizing our correlation between port and survival.

Consequently, the subsequent decisions were taken:

  • It's advisable to group the Fare feature into bands for model training.
  • Data Wrangling

Data wrangling, also known as data munging or data preprocessing, is the process of cleaning, transforming, and organizing raw data into a more structured and usable format for analysis. It involves various tasks such as handling missing values, removing duplicates, formatting data, and merging data from different sources. Data wrangling ensures that the data is accurate, consistent, and ready for further analysis, making the overall data analysis process smoother and more effective.

Dropping Features

By removing features, we are working with a reduced set of data points. This accelerates our model's performance and simplifies the analysis.

We intend to eliminate the Cabin and Ticket features.

Output:

Titanic- Machine Learning From Disaster

Creating New Features

We aim to examine whether the Name feature can be manipulated to extract titles and then assess the connection between titles and survival rates. This analysis will be conducted before we decide to remove the Name and PassengerId features.

Now, we will utilize regular expressions to extract the Title feature. The regular expression pattern (\w+\.) identifies the initial word followed by a dot character within the Name feature. By using the expand=False flag, we get a DataFrame as an output.

Output:

Titanic- Machine Learning From Disaster

Upon careful examination of the data, the following observations were noted:

  • Many titles effectively group Age ranges. For instance, the Master title corresponds to an Age mean of 5 years.
  • Survival rates slightly differ across various Title Age groups.
  • Specific titles exhibit higher survival rates (e.g., Mme, Lady, Sir) or lower survival rates (e.g., Don, Rev, Jonkheer).

Consequently, the subsequent decisions were taken:

  • Consequently, we opt to preserve the recently derived Title feature for the purpose of training our model.

We have the option to substitute numerous titles with a more common name or categorize them as "Rare."

Output:

Titanic- Machine Learning From Disaster

We can convert the categorical titles to ordinal.

Output:

Titanic- Machine Learning From Disaster

We can now confidently remove the Name feature from the datasets. Additionally, there is no requirement for the PassengerId feature in the dataset.

Output:

Titanic- Machine Learning From Disaster

Converting Categorical Feature

We can proceed with transforming features that hold string values into numerical ones, as most model algorithms necessitate this format. This conversion is also essential for accomplishing our feature completion objective. To initiate this process, we will transform the Sex feature into a new feature named Gender, where female corresponds to 1 and male to 0.

Output:

Titanic- Machine Learning From Disaster

Continuous Features

Now, we need to commence the estimation and filling of features that have missing or null values. Let's initiate this process with the Age feature.

There are three approaches we can consider to fill in missing values for a numerical continuous feature:

  • A straightforward method involves generating random numbers within the range of the mean and [standard deviation]A more accurate way to estimate missing values is by utilizing other correlated features. In our case, we notice a correlation between Age, Gender, and Pclass. We can deduce Age values by using [median]values of Age across different combinations of Pclass and Gender. This means finding the median Age for Pclass=1 and Gender=0, Pclass=1 and Gender=1, and so on.
  • Combining methods 1 and 2 involves generating random numbers between the mean and standard deviation for sets of Pclass and Gender combinations rather than using median values.

Both methods 1 and 3 can introduce random fluctuations into our models, leading to varying outcomes in multiple runs. Therefore, we will opt for method 2 as our preferred choice.

Output:

Titanic- Machine Learning From Disaster

We'll initiate the process by creating an empty array to hold estimated Age values, which will be determined by combinations of Pclass and Gender.

Output:

Titanic- Machine Learning From Disaster

Next, we'll loop through Gender (0 or 1) and Pclass (1, 2, 3) to compute estimated Age values for the six possible combinations.

Output:

Titanic- Machine Learning From Disaster

We'll establish age groups and assess their correlations with Survived.

Output:

Titanic- Machine Learning From Disaster

We will substitute Age values with ordinal numbers corresponding to these age groups.

Output:

Titanic- Machine Learning From Disaster

We can not remove the AgeBand feature.

Output:

Titanic- Machine Learning From Disaster

New Feature Through Existing Features

We have the option to generate a fresh attribute called FamilySize by amalgamating Parch and SibSp. This would then allow us to remove Parch and SibSp from our datasets.

Output:

Titanic- Machine Learning From Disaster

We have the opportunity to produce an additional attribute referred to as IsAlone.

Output:

Titanic- Machine Learning From Disaster

We should drop the Parch, SibSp, and FamilySize attributes and instead consider the IsAlone feature.

Output:

Titanic- Machine Learning From Disaster

We can also generate a synthetic attribute by combining Pclass and Age.

Output:

Titanic- Machine Learning From Disaster

Categorical Feature

The Embarked feature is represented by S, Q, and C values, denoting the port of embarkation. Our dataset contains two instances with missing values in this feature. We can conveniently replace these gaps with the most frequent occurrence.

Output:

Titanic- Machine Learning From Disaster

Output:

Titanic- Machine Learning From Disaster

Categorical to Numerical

We can proceed by transforming the EmbarkedFill feature into a fresh numeric feature called port.

Output:

Titanic- Machine Learning From Disaster

We can not create FareBand.

Output:

Titanic- Machine Learning From Disaster

Transform the Fare feature into ordinal values using the FareBand categories.

Output:

Titanic- Machine Learning From Disaster

Modeling

Now we are set to proceed with training a model and making predictions for our desired solution. We have a range of over 60 predictive modeling algorithms at our disposal. However, to streamline our selection process, it's important to consider the nature of the problem and the specific solution requirements. In our case, we are dealing with a classification and regression problem. Our goal is to establish relationships between the output (whether a passenger survived or not) and various other variables or features (such as Gender, Age, and Port of embarkation). This falls within the category of supervised learning, as we are using a provided dataset to train our model. Based on these criteria - supervised learning combined with classification and regression - we can narrow down our options to a few suitable models. These include

  • Logistic Regression
  • KNN (k-Nearest Neighbors)
  • Support Vector Machines
  • Naive Bayes classifier
  • Decision Tree
  • Random Forest
  • Perceptron
  • Artificial neural network

RVM (Relevance Vector Machine)

Output:

Titanic- Machine Learning From Disaster

Logistic Regression

Logistic Regression proves valuable to implement early in the analysis. It assesses the connection between a categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities through a logistic function, representing the cumulative logistic distribution.

Output:

Titanic- Machine Learning From Disaster

We can employ Logistic Regression to validate our assumptions and decisions regarding feature creation and completion objectives. This can be achieved by calculating the coefficients of the features in the decision function.

Positive coefficients amplify the log-odds of the response (and subsequently boost the probability), whereas negative coefficients diminish the log-odds of the response (and thereby lower the probability).

  • The highest positive coefficient is associated with Sex, indicating that as the value of Sex increases (from male: 0 to female: 1), the likelihood of Survived=1 increases significantly.
  • Conversely, as Pclass value increases, the likelihood of Survived=1 decreases notably.
  • This underscores the significance of Age*Class as a valuable artificial feature to model, given its second highest negative correlation with Survived.
  • Similarly, Title also demonstrates a strong positive correlation as the second highest.

Output:

Titanic- Machine Learning From Disaster

SVM

Support Vector Machines (SVMs) are models in supervised learning equipped with corresponding learning algorithms that examine data for classification and regression analysis. When provided with a collection of training samples, each labeled as part of either of two classes, an SVM training algorithm constructs a model that assigns new test samples to one of these two categories. This characteristic classifies SVM as a non-probabilistic binary linear classifier.

Output:

Titanic- Machine Learning From Disaster

K-Nearest Neighbour

In the field of pattern recognition, the k-Nearest Neighbors algorithm (abbreviated as k-NN) is an approach devoid of specific parameters, often employed for tasks of classification and regression. This method entails determining the category of a sample based on the collective opinion of its nearby instances. The sample is allocated to the class that is most prevalent among its k closest neighbors (where k is a positive integer, often small). When k equals 1, the item is directly assigned to the class of the closest neighbor.

Output:

Titanic- Machine Learning From Disaster

Naive Bayes

Naive Bayes classifiers belong to a category of straightforward probabilistic classifiers that utilize Bayes' theorem while assuming strong (naive) independence among the features. These classifiers are known for their excellent scalability, as they necessitate a number of parameters that grow linearly with the count of variables (features) in a given learning scenario.

Output:

Titanic- Machine Learning From Disaster

Perceptron

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels, and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees.

Output:

Titanic- Machine Learning From Disaster

Linear SVC

Output:

Titanic- Machine Learning From Disaster

Stochastic Gradient Descent

Output:

Titanic- Machine Learning From Disaster

Decision Tree

This model employs a decision tree as a predictive tool that establishes a connection between features (depicted as branches of the tree) and conclusions regarding the target value (represented by tree leaves). When the target variable has a finite range of values, the tree structures are referred to as classification trees. In these structures, class labels are found on the leaves, while branches represent combinations of features that correspond to those class labels. On the other hand, if the target variable can have continuous values, usually real numbers, the resulting decision trees are known as regression trees.

Output:

Titanic- Machine Learning From Disaster

Random Forest

Random Forests stand as one of the widely embraced techniques. They are an ensemble learning approach used for tasks like classification and regression. This method involves creating numerous decision trees (n_estimators=100) during training and then determining the mode of classes (for classification) or the average prediction (for regression) from the outcomes of individual trees.

Output:

Titanic- Machine Learning From Disaster

The model confidence score is the highest among the models evaluated so far.

Model Evaluation

We can now assess and compare the performance of all our models to determine the optimal one for our task. Although Decision Tree and Random Forest yield identical scores, we opt for Random Forest due to its capability to mitigate the tendency of decision trees to excessively adapt to their training data, a phenomenon known as overfitting.

Output:

Titanic- Machine Learning From Disaster

We will opt Random Forest over Decision Tree, due to its overcoming nature for overfitting.

Conclusion

The Titanic dataset has evolved from being a historical record of a tragic event to a valuable tool for learning and practicing machine-learning techniques. It provides a hands-on opportunity to explore data preprocessing, feature engineering, exploratory data analysis, model building, and evaluation. Aspiring data scientists and machine learning enthusiasts can gain a deeper understanding of the intricacies and challenges involved in real-world data analysis through the Titanic- Machine Learning From Disaster competition.






Latest Courses