Applying Aggregation Functions Directly to a DataFrame#
One of the strengths of Pandas is that you can apply statistical and aggregation methods directly to a DataFrame or Series. These methods summarize data and provide insights without needing extra loops or manual calculations.
Common Aggregation Methods#
Here are some of the most commonly used methods:
Method |
Description |
Works On |
|---|---|---|
|
Returns the sum of values |
DataFrame / Series |
|
Returns the average (mean) value |
DataFrame / Series |
|
Counts non-null values |
DataFrame / Series |
|
Returns the minimum value |
DataFrame / Series |
|
Returns the maximum value |
DataFrame / Series |
|
Returns the standard deviation |
DataFrame / Series |
|
Returns the variance |
DataFrame / Series |
|
Generates summary statistics (count, mean, std, min, quartiles, max) |
DataFrame / Series |
Example: Aggregating a Series
import pandas as pd
# Salary data
salaries = pd.Series([50000, 60000, 55000, 65000, 70000])
print("Sum:", salaries.sum())
print("Mean:", salaries.mean())
print("Max:", salaries.max())
print("Std Dev:", salaries.std())
Sum: 300000
Mean: 60000.0
Max: 70000
Std Dev: 7905.694150420948
Each method is applied directly to the Series, returning a single value.
Example: Aggregating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 28],
'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)
print(df.sum(numeric_only=True)) # Sum of numeric columns
print(df.mean(numeric_only=True)) # Mean of numeric columns
print(df.describe())
Age 82
Salary 165000
dtype: int64
Age 27.333333
Salary 55000.000000
dtype: float64
Age Salary
count 3.000000 3.0
mean 27.333333 55000.0
std 3.055050 5000.0
min 24.000000 50000.0
25% 26.000000 52500.0
50% 28.000000 55000.0
75% 29.000000 57500.0
max 30.000000 60000.0
Notice how these functions automatically ignore non-numeric columns (like “Name”).
More Advanced: Filtering Data & Apply Statistical Functions#
We can combine row filtering with aggregation functions to analyze subsets of a DataFrame.
The general syntax is:
df[df[‘column_name’]
value][‘target_column’].function()
where:
df[…] → filters the rows that meet the condition
[‘target_column’] → selects the column to aggregate
.function() → applies the aggregation function
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 35, 28, 40],
'Salary': [50000, 66000, 55000, 70000]
}
df = pd.DataFrame(data)
# Average salary of employees older than 30
avg_salary = df[df['Age'] > 30]['Salary'].mean()
print(avg_salary)
# Maximum salary for employees younger than 30
df[df['Age'] < 30]['Salary'].max()
# Count employees with salary above 60,000
df[df['Salary'] > 60000]['Name'].count()
# Standard deviation of salary for people aged 25–40
df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()
68000.0
np.float64(7767.45346515403)
So the syntax pattern is:
df[ df[‘condition’] ][‘column’].aggregation()
Expression |
Meaning |
|---|---|
|
Mean of Salary where Age > 30 |
|
Count of employees with Salary > 60k |
|
Standard deviation of Salary for 25–40 year olds |
This pattern allows you to filter data first, then aggregate only on the rows that meet your condition.
Grouping Data with groupby#
While filtering + aggregation lets us summarize a subset of data, the groupby() method allows us to compute statistics across categories.
This is the classic split–apply–combine process:
Split data into groups based on one or more columns.
Apply an aggregation function to each group.
Combine results into a new DataFrame or Series.
Basic Syntax#
df.groupby(‘column_name’)[‘target_column’].aggregation_function() where:
groupby('column_name')→ splits the data into groups.['target_column']→ selects the column to aggregate..aggregation_function()→ applies functions likemean(),sum(),count().
## Example: Salary by Department
import pandas as pd
# Sample dataset
data = {
'Department': ['HR','HR','IT','IT','Finance','Finance'],
'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)
# Average salary per department
df.groupby('Department')['Salary'].mean()
Department Employee Salary
0 HR Alice 50000
1 HR Bob 52000
2 IT Charlie 60000
3 IT David 62000
4 Finance Eva 58000
5 Finance Frank 60000
Department
Finance 59000.0
HR 51000.0
IT 61000.0
Name: Salary, dtype: float64
Grouping by Multiple Columns#
# Example dataset with Region added
data2 = {
'Department': ['HR','HR','IT','IT','Finance','Finance'],
'Region': ['East','West','East','West','East','West'],
'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)
# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()
Department Region
Finance East 58000.0
West 60000.0
HR East 50000.0
West 52000.0
IT East 60000.0
West 62000.0
Name: Salary, dtype: float64
Exporting Data in Pandas#
After processing your data in Pandas, you can save it to files in various formats:
Format |
Function |
Key Parameters / Notes |
Example Usage |
|---|---|---|---|
CSV |
|
|
|
Excel |
|
|
|
JSON |
|
|
|
Pickle |
|
Python-specific, fast binary format |
|
HTML |
|
Saves as an HTML table |
|
Parquet |
|
Efficient columnar format, great for big data |
|
Tip: Always choose the format based on your use case:
CSV → universal, easy sharing
Excel → spreadsheets
JSON → web APIs or NoSQL
Parquet → large datasets, high performance
# Save CSV
df.to_csv('output.csv', index=False)
print("CSV file 'output.csv' created successfully.")
# Save tab-separated CSV
df2.to_csv('output_tab.csv', sep='\t', index=False)
print("Tab-separated CSV file 'output_tab.csv' created successfully.")
# Save Excel
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
print("Excel file 'output.xlsx' created successfully.")
# Save JSON
df2.to_json('output.json', orient='records', lines=True)
print("JSON file 'output.json' created successfully.")
CSV file 'output.csv' created successfully.
Tab-separated CSV file 'output_tab.csv' created successfully.
Excel file 'output.xlsx' created successfully.
JSON file 'output.json' created successfully.