Handling Missing Data and Data Quality Issues#

Exploratory analysis is not only about discovering patterns. It is also about identifying problems. Real-world data is rarely perfect. Missing values, duplicates, inconsistent formats, and unexpected values are common. If these issues are ignored, they can distort conclusions and weaken models.

Before modeling, we must understand the quality of our data.

Identifying Missing Values#

The first step is to measure how much data is missing.

df.isna().sum()

This shows the number of missing values in each column.

To view proportions:

df.isna().mean()

High missingness may require:

  • Removing rows

  • Removing columns

  • Imputation (Later Topic)

  • Investigating the source

Missing data is not always random. Understanding why values are missing is often as important as the missingness itself.

Handling Missing Values#

Common approaches include:

  1. Dropping rows

df.dropna()
  1. Filling with summary statistics

df["col"].fillna(df["col"].mean(), inplace=True)
  1. Filling categorical values

df["category"].fillna("Unknown", inplace=True)

Choice depends on:

  • proportion missing

  • importance of variable

  • modeling goals

Detecting Duplicates#

df.duplicated().sum()

Remove duplicates if necessary:

df = df.drop_duplicates()

Duplicates can inflate sample size artificially.

Checking Data Types and Convert if neccesary#

Sometimes numeric data is stored as text. Convert if necessary:

df["col"] = pd.to_numeric(df["col"], errors="coerce")

Sanity Checks: Look for impossible values:

  • Negative ages

  • Future dates -Extreme outliers

EDA includes validating whether values make logical sense.

Data Transformation During EDA#

Sometimes variables need reshaping to reveal patterns.

#Log Transformation (for skewed data)
import numpy as np
df["log_income"] = np.log(df["income"])
Creating New Features
df["bmi"] = df["weight"] / (df["height"]**2)
Converting Dates
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year

Feature engineering (later topic) often begins during exploration.

Why Data Quality Matters: Models assume structure in the data. If the data is flawed, the model will learn flawed patterns.

Exploratory Data Analysis is not only about discovering insight, it is about building trust in the data.