Grouping Data with groupby
#
While filtering + aggregation lets us summarize a subset of data, the groupby()
method allows us to compute statistics across categories.
This is the classic split–apply–combine process:
Split data into groups based on one or more columns.
Apply an aggregation function to each group.
Combine results into a new DataFrame or Series.
Basic Syntax#
df.groupby(‘column_name’)[‘target_column’].aggregation_function() where:
groupby('column_name')
→ splits the data into groups.['target_column']
→ selects the column to aggregate..aggregation_function()
→ applies functions likemean()
,sum()
,count()
.
## Example: Salary by Department
import pandas as pd
# Sample dataset
data = {
'Department': ['HR','HR','IT','IT','Finance','Finance'],
'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)
# Average salary per department
df.groupby('Department')['Salary'].mean()
Department Employee Salary
0 HR Alice 50000
1 HR Bob 52000
2 IT Charlie 60000
3 IT David 62000
4 Finance Eva 58000
5 Finance Frank 60000
Department
Finance 59000.0
HR 51000.0
IT 61000.0
Name: Salary, dtype: float64
Grouping by Multiple Columns#
# Example dataset with Region added
data2 = {
'Department': ['HR','HR','IT','IT','Finance','Finance'],
'Region': ['East','West','East','West','East','West'],
'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)
# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()
Department Region
Finance East 58000.0
West 60000.0
HR East 50000.0
West 52000.0
IT East 60000.0
West 62000.0
Name: Salary, dtype: float64
#Exporting Data in Pandas
After processing your data in Pandas, you can save it to files in various formats:
Format |
Function |
Key Parameters / Notes |
Example Usage |
---|---|---|---|
CSV |
|
|
|
Excel |
|
|
|
JSON |
|
|
|
Pickle |
|
Python-specific, fast binary format |
|
HTML |
|
Saves as an HTML table |
|
Parquet |
|
Efficient columnar format, great for big data |
|
Tip: Always choose the format based on your use case:
CSV → universal, easy sharing
Excel → spreadsheets
JSON → web APIs or NoSQL
Parquet → large datasets, high performance
# Save CSV
df.to_csv('output.csv', index=False)
print("CSV file 'output.csv' created successfully.")
# Save tab-separated CSV
df2.to_csv('output_tab.csv', sep='\t', index=False)
print("Tab-separated CSV file 'output_tab.csv' created successfully.")
# Save Excel
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
print("Excel file 'output.xlsx' created successfully.")
# Save JSON
df2.to_json('output.json', orient='records', lines=True)
print("JSON file 'output.json' created successfully.")
CSV file 'output.csv' created successfully.
Tab-separated CSV file 'output_tab.csv' created successfully.
Excel file 'output.xlsx' created successfully.
JSON file 'output.json' created successfully.
Exporting Pandas Data in Google Colab#
In Colab, you can save files directly to Google Drive. First, mount your Drive:
from google.colab import drive
drive.mount('/content/drive') # Follow the link and paste the authorization code
The pandas Ecosystem: How It Fits In#
Pandas does not exist in a vacuum. It is a central hub in the Python data science stack:
NumPy: Provides the foundational n-dimensional array object. Pandas DataFrames are built on top of NumPy arrays.
Matplotlib/Seaborn: Used for visualization. You can plot data directly from DataFrames and Series.
Scikit-learn: The premier machine learning library. It accepts DataFrames and Series as inputs for model training.
Jupyter Notebooks: The ideal interactive environment for exploratory data analysis with pandas.
When to Use Pandas (And When Not To)#
Use pandas when:#
Working with tabular data (like spreadsheets or database tables)
Data cleaning and preprocessing
Exploratory data analysis
Medium-sized datasets (up to a few gigabytes)
Consider alternatives when:#
Working with very large datasets that don’t fit in memory.
Need extremely high performance for numerical computations (consider NumPy directly)
Working with unstructured data like images or text