Case Study: Exploratory Data Analysis of the Titanic Dataset#
To see how these ideas come together, we now explore a real dataset.The Titanic dataset contains information about passengers and whether they survived the disaster.
Load Dataset#
import seaborn as sns
df = sns.load_dataset("titanic")
df.head()
Structure#
df.shape
df.info()
Key variables include:
survival status
passenger class
age
fare
gender
Missing Values#
df.isnull().sum()
Survival Count#
sns.countplot(data=df, x="survived")
Survival by Gender#
sns.countplot(data=df, x="sex", hue="survived")
Observation: Survival differs strongly between males and females.
Survival by Passenger Class#
sns.countplot(data=df, x="class", hue="survived")
plt.title("Survival by Class")
plt.show()
Observation: Higher-class passengers appear more likely to survive.
Age Distribution#
df["age"].hist(bins=30)
Observation: Ages vary widely with some missing values.
Fare vs Survival#
sns.boxplot(data=df, x="survived", y="fare")
Observation: Survivors tend to have paid higher fares.
Multivariate View#
sns.pairplot(df[["age","fare","survived"]].dropna(), hue="survived")
This allows simultaneous comparison of multiple relationships.
What We Learned from the Case Study#
From simple visual exploration, we discovered:
survival differs by gender
survival differs by passenger class
fare relates to survival
age varies widely
These patterns suggest hypotheses, but they do not prove causation.
Further statistical analysis would be required to confirm them.
Interpretation EDA reveals patterns but does NOT prove causation.
Mini Exercises: Try yourself:#
Which passenger class had highest survival rate?
Did higher fare increase survival?
Plot age distribution by survival.
Check correlation between fare and age.
Identify one unexpected pattern.
BONUS: Visual Cheat Sheet#
Goal |
Visualization |
|---|---|
Distribution |
Histogram |
Outliers |
Boxplot |
Counts |
Bar chart |
Relationship |
Scatter plot |
Correlation |
Heatmap |
Multivariable |
Pairplot |