BONUS: EDA Example for Common Tasks#
This bonus example demonstrates several common operations performed during Exploratory Data Analysis (EDA). We begin by inspecting the dataset using df.dtypes, count(), and isna().sum() to understand data types and identify missing values. We also loop through columns and use pd.isna() for row-level checks.
Next, we convert and format dates using pd.to_datetime() and .strftime(). We clean and analyze text with string methods such as .str.contains(), .split(), .replace(), and .capitalize(), and detect numeric strings using .isdigit().
We further demonstrate how to access specific rows with .iloc[], check whether data is empty using .empty, and combine datasets with pd.merge().
Together, these operations illustrate essential inspection, cleaning, and transformation steps that prepare data for meaningful analysis and modeling.
import pandas as pd
# ----------------------------
# Toy data (so you can run this)
# ----------------------------
df = pd.DataFrame({
"name": ["alice", "bob", None],
"age": [22, None, 30],
"city": ["New York", "Boston", "New York"],
"date_str": ["03/05/2026", "03/06/2026", None],
"comment": ["good job 100", "needs improvement", "ok 7"]
})
df_extra = pd.DataFrame({
"city": ["New York", "Boston"],
"state": ["NY", "MA"]
})
# ----------------------------
# Quick checks
# ----------------------------
print("dtypes:\n", df.dtypes, "\n") # df.dtypes
print("count (non-missing):\n", df.count(), "\n") # count()
print("missing per column:\n", df.isna().sum(), "\n") # isna().sum()
# ----------------------------
# Example loop over columns + pd.isna
# ----------------------------
for col in df.columns:
missing = df[col].isna().sum()
dtype = df[col].dtype
print(f"{col:8} | dtype={dtype} | missing={missing}")
# Example: row-wise check with pd.isna
for i, val in enumerate(df["age"]):
if pd.isna(val): # if pd.isna(...)
print(f"Row {i}: age is missing")
# ----------------------------
# Date parsing + formatting
# ----------------------------
df["date"] = pd.to_datetime(df["date_str"], format="%m/%d/%Y", errors="coerce")
df["date_fmt"] = df["date"].dt.strftime("%Y-%m-%d") # .strftime
print("\nFormatted dates:\n", df[["date_str", "date_fmt"]], "\n")
# ----------------------------
# age.empty (Series is rarely empty; shown for completeness)
# ----------------------------
age = df["age"].dropna()
print("age.empty:", age.empty) # age.empty
# ----------------------------
# String operations: contains, split, replace, capitalize
# ----------------------------
# .str.contains
mask_ny = df["city"].fillna("").str.contains("New", case=False)
print("\nRows where city contains 'New':\n", df[mask_ny], "\n")
# .split() / .split(' ')
df["words"] = df["comment"].fillna("").str.split(" ") # .split(' ')
print("Tokenized comments:\n", df[["comment", "words"]], "\n")
# words.index( ... ) + .append + .isdigit
tokens = df.loc[0, "words"] # e.g., ["good","job","100"]
idx_job = tokens.index("job") # words.index(...)
tokens.append("extra") # .append
only_digits = [w for w in tokens if w.isdigit()] # .isdigit()
print("Example tokens:", tokens)
print("Index of 'job':", idx_job)
print("Digit-only tokens:", only_digits, "\n")
# .replace + .capitalize
df["name_clean"] = df["name"].fillna("unknown").str.replace("_", " ", regex=False).str.capitalize()
print("Clean names:\n", df[["name", "name_clean"]], "\n")
# ----------------------------
# pd.merge (join extra info)
# ----------------------------
df_merged = pd.merge(df, df_extra, on="city", how="left") # pd.merge
print("Merged:\n", df_merged[["city", "state"]], "\n")
# ----------------------------
# .iloc[0] (first row)
# ----------------------------
first_row = df_merged.iloc[0] # .iloc[0]
print("First row (iloc[0]):\n", first_row)
dtypes:
name str
age float64
city str
date_str str
comment str
dtype: object
count (non-missing):
name 2
age 2
city 3
date_str 2
comment 3
dtype: int64
missing per column:
name 1
age 1
city 0
date_str 1
comment 0
dtype: int64
name | dtype=str | missing=1
age | dtype=float64 | missing=1
city | dtype=str | missing=0
date_str | dtype=str | missing=1
comment | dtype=str | missing=0
Row 1: age is missing
Formatted dates:
date_str date_fmt
0 03/05/2026 2026-03-05
1 03/06/2026 2026-03-06
2 NaN NaN
age.empty: False
Rows where city contains 'New':
name age city date_str comment date date_fmt
0 alice 22.0 New York 03/05/2026 good job 100 2026-03-05 2026-03-05
2 NaN 30.0 New York NaN ok 7 NaT NaN
Tokenized comments:
comment words
0 good job 100 [good, job, 100]
1 needs improvement [needs, improvement]
2 ok 7 [ok, 7]
Example tokens: ['good', 'job', '100', 'extra']
Index of 'job': 1
Digit-only tokens: ['100']
Clean names:
name name_clean
0 alice Alice
1 bob Bob
2 NaN Unknown
Merged:
city state
0 New York NY
1 Boston MA
2 New York NY
First row (iloc[0]):
name alice
age 22.0
city New York
date_str 03/05/2026
comment good job 100
date 2026-03-05 00:00:00
date_fmt 2026-03-05
words [good, job, 100, extra]
name_clean Alice
state NY
Name: 0, dtype: object