BONUS: EDA Example for Common Tasks

BONUS: EDA Example for Common Tasks#

This bonus example demonstrates several common operations performed during Exploratory Data Analysis (EDA). We begin by inspecting the dataset using df.dtypes, count(), and isna().sum() to understand data types and identify missing values. We also loop through columns and use pd.isna() for row-level checks.

Next, we convert and format dates using pd.to_datetime() and .strftime(). We clean and analyze text with string methods such as .str.contains(), .split(), .replace(), and .capitalize(), and detect numeric strings using .isdigit().

We further demonstrate how to access specific rows with .iloc[], check whether data is empty using .empty, and combine datasets with pd.merge().

Together, these operations illustrate essential inspection, cleaning, and transformation steps that prepare data for meaningful analysis and modeling.

import pandas as pd

# ----------------------------
# Toy data (so you can run this)
# ----------------------------
df = pd.DataFrame({
    "name": ["alice", "bob", None],
    "age": [22, None, 30],
    "city": ["New York", "Boston", "New York"],
    "date_str": ["03/05/2026", "03/06/2026", None],
    "comment": ["good job 100", "needs improvement", "ok 7"]
})

df_extra = pd.DataFrame({
    "city": ["New York", "Boston"],
    "state": ["NY", "MA"]
})

# ----------------------------
# Quick checks
# ----------------------------
print("dtypes:\n", df.dtypes, "\n")           # df.dtypes
print("count (non-missing):\n", df.count(), "\n")  # count()
print("missing per column:\n", df.isna().sum(), "\n")  # isna().sum()

# ----------------------------
# Example loop over columns + pd.isna
# ----------------------------
for col in df.columns:
    missing = df[col].isna().sum()
    dtype = df[col].dtype
    print(f"{col:8} | dtype={dtype} | missing={missing}")

# Example: row-wise check with pd.isna
for i, val in enumerate(df["age"]):
    if pd.isna(val):  # if pd.isna(...)
        print(f"Row {i}: age is missing")

# ----------------------------
# Date parsing + formatting
# ----------------------------
df["date"] = pd.to_datetime(df["date_str"], format="%m/%d/%Y", errors="coerce")
df["date_fmt"] = df["date"].dt.strftime("%Y-%m-%d")  # .strftime
print("\nFormatted dates:\n", df[["date_str", "date_fmt"]], "\n")

# ----------------------------
# age.empty (Series is rarely empty; shown for completeness)
# ----------------------------
age = df["age"].dropna()
print("age.empty:", age.empty)  # age.empty

# ----------------------------
# String operations: contains, split, replace, capitalize
# ----------------------------
# .str.contains
mask_ny = df["city"].fillna("").str.contains("New", case=False)
print("\nRows where city contains 'New':\n", df[mask_ny], "\n")

# .split() / .split(' ')
df["words"] = df["comment"].fillna("").str.split(" ")  # .split(' ')
print("Tokenized comments:\n", df[["comment", "words"]], "\n")

# words.index( ... ) + .append + .isdigit
tokens = df.loc[0, "words"]           # e.g., ["good","job","100"]
idx_job = tokens.index("job")         # words.index(...)
tokens.append("extra")                # .append
only_digits = [w for w in tokens if w.isdigit()]  # .isdigit()
print("Example tokens:", tokens)
print("Index of 'job':", idx_job)
print("Digit-only tokens:", only_digits, "\n")

# .replace + .capitalize
df["name_clean"] = df["name"].fillna("unknown").str.replace("_", " ", regex=False).str.capitalize()
print("Clean names:\n", df[["name", "name_clean"]], "\n")

# ----------------------------
# pd.merge (join extra info)
# ----------------------------
df_merged = pd.merge(df, df_extra, on="city", how="left")  # pd.merge
print("Merged:\n", df_merged[["city", "state"]], "\n")

# ----------------------------
# .iloc[0] (first row)
# ----------------------------
first_row = df_merged.iloc[0]  # .iloc[0]
print("First row (iloc[0]):\n", first_row)

dtypes:
 name            str
age         float64
city            str
date_str        str
comment         str
dtype: object 

count (non-missing):
 name        2
age         2
city        3
date_str    2
comment     3
dtype: int64 

missing per column:
 name        1
age         1
city        0
date_str    1
comment     0
dtype: int64 

name     | dtype=str | missing=1
age      | dtype=float64 | missing=1
city     | dtype=str | missing=0
date_str | dtype=str | missing=1
comment  | dtype=str | missing=0
Row 1: age is missing

Formatted dates:
      date_str    date_fmt
0  03/05/2026  2026-03-05
1  03/06/2026  2026-03-06
2         NaN         NaN 

age.empty: False

Rows where city contains 'New':
     name   age      city    date_str       comment       date    date_fmt
0  alice  22.0  New York  03/05/2026  good job 100 2026-03-05  2026-03-05
2    NaN  30.0  New York         NaN          ok 7        NaT         NaN 

Tokenized comments:
              comment                 words
0       good job 100      [good, job, 100]
1  needs improvement  [needs, improvement]
2               ok 7               [ok, 7] 

Example tokens: ['good', 'job', '100', 'extra']
Index of 'job': 1
Digit-only tokens: ['100'] 

Clean names:
     name name_clean
0  alice      Alice
1    bob        Bob
2    NaN    Unknown 

Merged:
        city state
0  New York    NY
1    Boston    MA
2  New York    NY 

First row (iloc[0]):
 name                            alice
age                              22.0
city                         New York
date_str                   03/05/2026
comment                  good job 100
date              2026-03-05 00:00:00
date_fmt                   2026-03-05
words         [good, job, 100, extra]
name_clean                      Alice
state                              NY
Name: 0, dtype: object