Adding a new column to DataFrame#
A new column can be added to a pandas DataFrame by assigning a value, list, or Series to a new column name. If the assigned data is a list or Series, its length must match the number of rows in the DataFrame. You can also assign a single value, which will be applied to all rows.
import pandas as pd
df = pd.DataFrame({
"Name": ["Alice", "Bob", "Charlie"],
"Math": [90, 85, 95]
})
print(df)
# Add a new column with a list
df["English"] = [88, 92, 80]
print(df)
# Add a new column with a single value
df["Pass"] = True
print(df)
Name Math
0 Alice 90
1 Bob 85
2 Charlie 95
Name Math English
0 Alice 90 88
1 Bob 85 92
2 Charlie 95 80
Name Math English Pass
0 Alice 90 88 True
1 Bob 85 92 True
2 Charlie 95 80 True
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 30, 28],
'Salary': [50000, 60000, 55000]
}
df = pd.DataFrame(data)
print(df)
# Increase all salaries by 10%
df['Salary'] = df['Salary'] * 1.10
# Add 5 years to everyone’s age
df['Age'] = df['Age'] + 5
print(df)
Name Age Salary
0 Alice 24 50000
1 Bob 30 60000
2 Charlie 28 55000
Name Age Salary
0 Alice 29 55000.0
1 Bob 35 66000.0
2 Charlie 33 60500.0
Arithmetic Between Columns#
You can also perform arithmetic between two or more columns to create new features.
# Create a new column 'Income_per_Age'
df['Income_per_Age'] = df['Salary'] / df['Age']
print(df)
Name Age Salary Income_per_Age
0 Alice 29 55000.0 1896.551724
1 Bob 35 66000.0 1885.714286
2 Charlie 33 60500.0 1833.333333
Applying Built-in Pandas/Numpy Functions#
Pandas integrates with NumPy functions, allowing you to apply common statistics directly.
import numpy as np
# Calculate average salary
print(df['Salary'].mean())
# Standard deviation of Age
print(df['Age'].std())
# Apply numpy square root
print(np.sqrt(df['Age']))
60500.0
3.0550504633038935
0 5.385165
1 5.916080
2 5.744563
Name: Age, dtype: float64
Applying Functions with apply()#
Sometimes you need custom transformations. The apply() method lets you apply a function to an entire column (Series) or to each row/column in a DataFrame.
# Apply to a Series
df['Age_squared'] = df['Age'].apply(lambda x: x**2)
# Apply to DataFrame across rows
df['Total'] = df[['Age','Salary']].apply(lambda row: row['Age'] + row['Salary'], axis=1)
print(df)
Name Age Salary Income_per_Age Age_squared Total
0 Alice 29 55000.0 1896.551724 841 55029.0
1 Bob 35 66000.0 1885.714286 1225 66035.0
2 Charlie 33 60500.0 1833.333333 1089 60533.0
Note that, we can also apply a Function Elementwise with applymap() and to a Single Column with map() but not covering in this course.
Filtering Data in Pandas#
Once you know how to select columns and rows, the next step is learning how to filter data. Filtering helps you focus on only the relevant part of your dataset, whether that means removing unnecessary columns, isolating rows that meet certain conditions, or preparing features for modeling.
Filtering Columns#
Column filtering is about selecting only the columns you need or dropping the ones you don’t. This reduces memory usage and keeps your DataFrame manageable.
# Select a single column
df['Age']
# Select multiple columns
df[['Name', 'Age']]
Name | Age | |
---|---|---|
0 | Alice | 29 |
1 | Bob | 35 |
2 | Charlie | 33 |
Dropping Unused Columns#
# Drop the 'Age_squared' column
df = df.drop(columns=['Age_squared'])
print(df)
Name Age Salary Income_per_Age Total
0 Alice 29 55000.0 1896.551724 55029.0
1 Bob 35 66000.0 1885.714286 66035.0
2 Charlie 33 60500.0 1833.333333 60533.0
This is especially useful when preparing data for machine learning, where only selected features are required.
Filtering Rows (using Boolean Indexing)#
Row filtering is usually done with Boolean indexing, where you apply a condition and return only the rows where that condition is true.
# Filter rows where Age > 30
df[df['Age'] > 30]
Name | Age | Salary | Income_per_Age | Total | |
---|---|---|---|---|---|
1 | Bob | 35 | 66000.0 | 1885.714286 | 66035.0 |
2 | Charlie | 33 | 60500.0 | 1833.333333 | 60533.0 |
Combining Multiple Conditions#
You can combine conditions using & (and) or | (or).
# Filter rows where Age > 30 AND Salary > 60000
df[(df['Age'] > 30) & (df['Salary'] > 60000)]
Name | Age | Salary | Income_per_Age | Total | |
---|---|---|---|---|---|
1 | Bob | 35 | 66000.0 | 1885.714286 | 66035.0 |
2 | Charlie | 33 | 60500.0 | 1833.333333 | 60533.0 |
Remember to wrap each condition in parentheses.
Filtering Strings#
You can filter rows where a text column contains specific values
# Filter rows where Name contains "Bob"
df[df['Name'].str.contains("Bob")]
Name | Age | Salary | Income_per_Age | Total | |
---|---|---|---|---|---|
1 | Bob | 35 | 66000.0 | 1885.714286 | 66035.0 |
Unique Values and Counting#
Sometimes you want to check how many unique values a column has, or count how often each appears.
# Unique names
print(df['Name'].unique())
# Count frequency of each name
print(df['Name'].value_counts())
['Alice' 'Bob' 'Charlie']
Name
Alice 1
Bob 1
Charlie 1
Name: count, dtype: int64