Case Study: Exploratory Data Analysis of the Titanic Dataset#

To see how these ideas come together, we now explore a real dataset.The Titanic dataset contains information about passengers and whether they survived the disaster.

Load Dataset#

import seaborn as sns
df = sns.load_dataset("titanic")
df.head()

Structure#

df.shape
df.info()

Key variables include:

  • survival status

  • passenger class

  • age

  • fare

  • gender

Missing Values#

df.isnull().sum()

Survival Count#

sns.countplot(data=df, x="survived")

Survival by Gender#

sns.countplot(data=df, x="sex", hue="survived")

Observation: Survival differs strongly between males and females.

Survival by Passenger Class#

sns.countplot(data=df, x="class", hue="survived")
plt.title("Survival by Class")
plt.show()

Observation: Higher-class passengers appear more likely to survive.

Age Distribution#

df["age"].hist(bins=30)

Observation: Ages vary widely with some missing values.

Fare vs Survival#

sns.boxplot(data=df, x="survived", y="fare")

Observation: Survivors tend to have paid higher fares.

Multivariate View#

sns.pairplot(df[["age","fare","survived"]].dropna(), hue="survived")

This allows simultaneous comparison of multiple relationships.

What We Learned from the Case Study#

From simple visual exploration, we discovered:

  • survival differs by gender

  • survival differs by passenger class

  • fare relates to survival

  • age varies widely

These patterns suggest hypotheses, but they do not prove causation.

Further statistical analysis would be required to confirm them.

Interpretation EDA reveals patterns but does NOT prove causation.

Mini Exercises: Try yourself:#

  1. Which passenger class had highest survival rate?

  2. Did higher fare increase survival?

  3. Plot age distribution by survival.

  4. Check correlation between fare and age.

  5. Identify one unexpected pattern.

BONUS: Visual Cheat Sheet#

Goal

Visualization

Distribution

Histogram

Outliers

Boxplot

Counts

Bar chart

Relationship

Scatter plot

Correlation

Heatmap

Multivariable

Pairplot