Grouping Data with groupby#

While filtering + aggregation lets us summarize a subset of data, the groupby() method allows us to compute statistics across categories.
This is the classic split–apply–combine process:

  1. Split data into groups based on one or more columns.

  2. Apply an aggregation function to each group.

  3. Combine results into a new DataFrame or Series.


Basic Syntax#

df.groupby(‘column_name’)[‘target_column’].aggregation_function() where:

  • groupby('column_name') → splits the data into groups.

  • ['target_column'] → selects the column to aggregate.

  • .aggregation_function() → applies functions like mean(), sum(), count().

## Example: Salary by Department

import pandas as pd

# Sample dataset
data = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)

# Average salary per department
df.groupby('Department')['Salary'].mean()
  Department Employee  Salary
0         HR    Alice   50000
1         HR      Bob   52000
2         IT  Charlie   60000
3         IT    David   62000
4    Finance      Eva   58000
5    Finance    Frank   60000
Department
Finance    59000.0
HR         51000.0
IT         61000.0
Name: Salary, dtype: float64

Grouping by Multiple Columns#

# Example dataset with Region added
data2 = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Region': ['East','West','East','West','East','West'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)

# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()
Department  Region
Finance     East      58000.0
            West      60000.0
HR          East      50000.0
            West      52000.0
IT          East      60000.0
            West      62000.0
Name: Salary, dtype: float64

#Exporting Data in Pandas

After processing your data in Pandas, you can save it to files in various formats:

Format

Function

Key Parameters / Notes

Example Usage

CSV

to_csv()

index=False to skip row numbers, sep='\t' for tab-delimited

df.to_csv('data.csv', index=False)

Excel

to_excel()

sheet_name='Sheet1', requires openpyxl

df.to_excel('data.xlsx', index=False)

JSON

to_json()

orient='records', lines=True for line-delimited JSON

df.to_json('data.json', orient='records', lines=True)

Pickle

to_pickle()

Python-specific, fast binary format

df.to_pickle('data.pkl')

HTML

to_html()

Saves as an HTML table

df.to_html('data.html', index=False)

Parquet

to_parquet()

Efficient columnar format, great for big data

df.to_parquet('data.parquet', index=False)

Tip: Always choose the format based on your use case:

  • CSV → universal, easy sharing

  • Excel → spreadsheets

  • JSON → web APIs or NoSQL

  • Parquet → large datasets, high performance

# Save CSV
df.to_csv('output.csv', index=False)
print("CSV file 'output.csv' created successfully.")

# Save tab-separated CSV
df2.to_csv('output_tab.csv', sep='\t', index=False)
print("Tab-separated CSV file 'output_tab.csv' created successfully.")

# Save Excel
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
print("Excel file 'output.xlsx' created successfully.")

# Save JSON
df2.to_json('output.json', orient='records', lines=True)
print("JSON file 'output.json' created successfully.")
CSV file 'output.csv' created successfully.
Tab-separated CSV file 'output_tab.csv' created successfully.
Excel file 'output.xlsx' created successfully.
JSON file 'output.json' created successfully.

Exporting Pandas Data in Google Colab#

In Colab, you can save files directly to Google Drive. First, mount your Drive:

from google.colab import drive
drive.mount('/content/drive')  # Follow the link and paste the authorization code

The pandas Ecosystem: How It Fits In#

Pandas does not exist in a vacuum. It is a central hub in the Python data science stack:

  • NumPy: Provides the foundational n-dimensional array object. Pandas DataFrames are built on top of NumPy arrays.

  • Matplotlib/Seaborn: Used for visualization. You can plot data directly from DataFrames and Series.

  • Scikit-learn: The premier machine learning library. It accepts DataFrames and Series as inputs for model training.

  • Jupyter Notebooks: The ideal interactive environment for exploratory data analysis with pandas.

When to Use Pandas (And When Not To)#

Use pandas when:#

  • Working with tabular data (like spreadsheets or database tables)

  • Data cleaning and preprocessing

  • Exploratory data analysis

  • Medium-sized datasets (up to a few gigabytes)

Consider alternatives when:#

  • Working with very large datasets that don’t fit in memory.

  • Need extremely high performance for numerical computations (consider NumPy directly)

  • Working with unstructured data like images or text