Applying Aggregation Functions Directly to a DataFrame

Applying Aggregation Functions Directly to a DataFrame#

One of the strengths of Pandas is that you can apply statistical and aggregation methods directly to a DataFrame or Series. These methods summarize data and provide insights without needing extra loops or manual calculations.

Common Aggregation Methods#

Here are some of the most commonly used methods:

Method	Description	Works On
`.sum()`	Returns the sum of values	DataFrame / Series
`.mean()`	Returns the average (mean) value	DataFrame / Series
`.count()`	Counts non-null values	DataFrame / Series
`.min()`	Returns the minimum value	DataFrame / Series
`.max()`	Returns the maximum value	DataFrame / Series
`.std()`	Returns the standard deviation	DataFrame / Series
`.var()`	Returns the variance	DataFrame / Series
`.describe()`	Generates summary statistics (count, mean, std, min, quartiles, max)	DataFrame / Series

Example: Aggregating a Series

import pandas as pd

# Salary data
salaries = pd.Series([50000, 60000, 55000, 65000, 70000])

print("Sum:", salaries.sum())
print("Mean:", salaries.mean())
print("Max:", salaries.max())
print("Std Dev:", salaries.std())

Sum: 300000
Mean: 60000.0
Max: 70000
Std Dev: 7905.694150420948

Each method is applied directly to the Series, returning a single value.

Example: Aggregating a DataFrame

data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 30, 28],
    'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)

print(df.sum(numeric_only=True))   # Sum of numeric columns
print(df.mean(numeric_only=True))  # Mean of numeric columns
print(df.describe())

Age           82
Salary    165000
dtype: int64
Age          27.333333
Salary    55000.000000
dtype: float64
             Age   Salary
count   3.000000      3.0
mean   27.333333  55000.0
std     3.055050   5000.0
min    24.000000  50000.0
25%    26.000000  52500.0
50%    28.000000  55000.0
75%    29.000000  57500.0
max    30.000000  60000.0

Notice how these functions automatically ignore non-numeric columns (like “Name”).

More Advanced: Filtering Data & Apply Statistical Functions#

We can combine row filtering with aggregation functions to analyze subsets of a DataFrame.

The general syntax is:

df[df[‘column_name’] value][‘target_column’].function()

where:

df[…] → filters the rows that meet the condition
[‘target_column’] → selects the column to aggregate
.function() → applies the aggregation function

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [24, 35, 28, 40],
    'Salary': [50000, 66000, 55000, 70000]
}
df = pd.DataFrame(data)

# Average salary of employees older than 30
avg_salary = df[df['Age'] > 30]['Salary'].mean()
print(avg_salary)

# Maximum salary for employees younger than 30
df[df['Age'] < 30]['Salary'].max()

# Count employees with salary above 60,000
df[df['Salary'] > 60000]['Name'].count()

# Standard deviation of salary for people aged 25–40
df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()

68000.0

np.float64(7767.45346515403)

So the syntax pattern is:

df[ df[‘condition’] ][‘column’].aggregation()

Expression	Meaning
`df[df['Age'] > 30]['Salary'].mean()`	Mean of Salary where Age > 30
`df[df['Salary'] > 60000]['Name'].count()`	Count of employees with Salary > 60k
`df[(df['Age'] >= 25) & (df['Age'] <= 40)]['Salary'].std()`	Standard deviation of Salary for 25–40 year olds

This pattern allows you to filter data first, then aggregate only on the rows that meet your condition.

Grouping Data with `groupby`#

While filtering + aggregation lets us summarize a subset of data, the groupby() method allows us to compute statistics across categories.
This is the classic split–apply–combine process:

Split data into groups based on one or more columns.
Apply an aggregation function to each group.
Combine results into a new DataFrame or Series.

Basic Syntax#

df.groupby(‘column_name’)[‘target_column’].aggregation_function() where:

groupby('column_name') → splits the data into groups.
['target_column'] → selects the column to aggregate.
.aggregation_function() → applies functions like mean(), sum(), count().

## Example: Salary by Department

import pandas as pd

# Sample dataset
data = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Employee': ['Alice','Bob','Charlie','David','Eva','Frank'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df = pd.DataFrame(data)
print(df)

# Average salary per department
df.groupby('Department')['Salary'].mean()

  Department Employee  Salary
       HR    Alice   50000
       HR      Bob   52000
       IT  Charlie   60000
       IT    David   62000
  Finance      Eva   58000
  Finance    Frank   60000

Department
Finance    59000.0
HR         51000.0
IT         61000.0
Name: Salary, dtype: float64

Grouping by Multiple Columns#

# Example dataset with Region added
data2 = {
    'Department': ['HR','HR','IT','IT','Finance','Finance'],
    'Region': ['East','West','East','West','East','West'],
    'Salary': [50000, 52000, 60000, 62000, 58000, 60000]
}
df2 = pd.DataFrame(data2)

# Group by Department and Region
df2.groupby(['Department','Region'])['Salary'].mean()

Department  Region
Finance     East      58000.0
            West      60000.0
HR          East      50000.0
            West      52000.0
IT          East      60000.0
            West      62000.0
Name: Salary, dtype: float64

Exporting Data in Pandas#

After processing your data in Pandas, you can save it to files in various formats:

Format	Function	Key Parameters / Notes	Example Usage
CSV	`to_csv()`	`index=False` to skip row numbers, `sep='\t'` for tab-delimited	`df.to_csv('data.csv', index=False)`
Excel	`to_excel()`	`sheet_name='Sheet1'`, requires `openpyxl`	`df.to_excel('data.xlsx', index=False)`
JSON	`to_json()`	`orient='records'`, `lines=True` for line-delimited JSON	`df.to_json('data.json', orient='records', lines=True)`
Pickle	`to_pickle()`	Python-specific, fast binary format	`df.to_pickle('data.pkl')`
HTML	`to_html()`	Saves as an HTML table	`df.to_html('data.html', index=False)`
Parquet	`to_parquet()`	Efficient columnar format, great for big data	`df.to_parquet('data.parquet', index=False)`

Tip: Always choose the format based on your use case:

CSV → universal, easy sharing
Excel → spreadsheets
JSON → web APIs or NoSQL
Parquet → large datasets, high performance

# Save CSV
df.to_csv('output.csv', index=False)
print("CSV file 'output.csv' created successfully.")

# Save tab-separated CSV
df2.to_csv('output_tab.csv', sep='\t', index=False)
print("Tab-separated CSV file 'output_tab.csv' created successfully.")

# Save Excel
df.to_excel('output.xlsx', index=False, sheet_name='Sheet1')
print("Excel file 'output.xlsx' created successfully.")

# Save JSON
df2.to_json('output.json', orient='records', lines=True)
print("JSON file 'output.json' created successfully.")

CSV file 'output.csv' created successfully.
Tab-separated CSV file 'output_tab.csv' created successfully.
Excel file 'output.xlsx' created successfully.
JSON file 'output.json' created successfully.