Summarizing Variables: Categorical vs Numerical

Summarizing Variables: Categorical vs Numerical#

EDA becomes much easier when we combine interpretation with quick, repeatable code. In this section, we will summarize two kinds of variables:

Categorical variables (group labels)
Numerical variables (measured quantities)

Throughout, assume your dataset is in a DataFrame called df.

A quick setup#

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Summarizing Categorical Variables#

Categorical variables describe membership in groups (e.g., sex, class, department). A good first question is: Which categories are most common?

Frequency table (counts)#

col = "your_categorical_column"
df[col].value_counts(dropna=False)

Relative frequency (proportions)#

df[col].value_counts(normalize=True, dropna=False)

Bar chart#

col = "your_categorical_column"

counts = df[col].value_counts(dropna=False)
counts.plot(kind="bar")
plt.xlabel(col)
plt.ylabel("Count")
plt.title(f"Counts of {col}")
plt.show()

Interpretation tip:
A dominant category may indicate imbalance (important for modeling and fairness).

Summarizing Numerical Variables#

Numerical variables represent quantities (e.g., age, income, fare).
A good first question is: What is typical, and how much do values vary?

Summary statistics#

col = "your_numeric_column"
df[col].describe()

Quick quantiles (often more informative than the mean)#

df[col].quantile([0.0, 0.25, 0.5, 0.75, 1.0])

Interpretation tip:
If the mean and median are far apart, the distribution may be skewed.

Visualizing Distributions and Detecting Outliers#

Numbers summarize. Visuals reveal shape.

Histogram (distribution shape)#

Histograms reveal the overall shape of the data; something summary statistics alone cannot show.

col = "your_numeric_column"

df[col].hist(bins=30)
plt.xlabel(col)
plt.ylabel("Frequency")
plt.title(f"Histogram of {col}")
plt.show()

What to look for:

Skewness (long tail on one side)
Multiple peaks (possible subgroups)
Unexpected spikes (data entry issues)

Box plot (outliers)#

col = "your_numeric_column"

sns.boxplot(x=df[col])
plt.title(f"Boxplot of {col}")
plt.show()

If a distribution is strongly skewed, consider log transformation before modeling.

Reminder:
Outliers are not always “bad.” They may be errors, or they may be rare but meaningful cases.

A useful helper: identify categorical vs numerical columns#

When datasets are large, it helps to quickly separate column types.

numeric_cols = df.select_dtypes(include="number").columns.tolist()
categorical_cols = df.select_dtypes(exclude="number").columns.tolist()

numeric_cols, categorical_cols

BONUS: Quick EDA summary loop (fast and practical)#

This prints a short summary for each column and makes simple plots.

for col in df.columns:
    print("\n" + "="*60)
    print("Column:", col)
    print("Type:", df[col].dtype)
    print("Missing:", df[col].isna().sum())

    if pd.api.types.is_numeric_dtype(df[col]):
        print(df[col].describe())
        df[col].hist(bins=30)
        plt.title(f"Histogram of {col}")
        plt.show()

        sns.boxplot(x=df[col])
        plt.title(f"Boxplot of {col}")
        plt.show()
    else:
        print(df[col].value_counts(dropna=False).head(10))
        df[col].value_counts(dropna=False).head(10).plot(kind="bar")
        plt.title(f"Top categories of {col}")
        plt.show()

Mini “Try It” Questions (for students)#

Pick one categorical column. Which category is most common? Is it heavily imbalanced?
Pick one numeric column. Is it skewed? How can you tell from the histogram?
Does the boxplot show outliers? What might explain them?