Types of Feature Engineering#
Feature engineering generally involves three major types of operations:
Feature Creation
Feature Transformation
Feature Selection
Each type plays a different role in improving the quality of input data for machine learning models.
1. Feature Creation#
Feature creation involves generating new variables from existing data. These new variables may capture relationships that the original features do not directly represent.
Examples:
Original Features |
Engineered Feature |
|---|---|
height, weight |
BMI = weight / height² |
purchase_count, visits |
average_purchase_value |
date |
day_of_week, month, is_weekend |
Example: Housing Dataset#
Original features:
Size_sqft |
Bedrooms |
Price |
|---|---|---|
1500 |
3 |
450000 |
Engineered features:
price_per_sqft
bedrooms_per_sqft
house_age
These engineered features may better capture housing patterns.
import pandas as pd
df = pd.DataFrame({
"size_sqft":[1500,1800,1200],
"price":[450000,520000,350000]
})
df["price_per_sqft"] = df["price"] / df["size_sqft"]
df
| size_sqft | price | price_per_sqft | |
|---|---|---|---|
| 0 | 1500 | 450000 | 300.000000 |
| 1 | 1800 | 520000 | 288.888889 |
| 2 | 1200 | 350000 | 291.666667 |
Remember, Feature creation often uses domain knowledge to design useful variables.
Types of Feature Creation:#
Common creation techniques include:
Polynomial Features
Interaction Features
1a. Polynomial Features#
Polynomial features are created by raising existing features to a power. They allow models to capture nonlinear relationships between variables.
For example, suppose we are predicting house prices using the size of a house.
Size (sqft) |
Price |
|---|---|
1000 |
200000 |
1500 |
300000 |
2000 |
450000 |
The relationship between size and price may not be perfectly linear. To capture nonlinear patterns, we can create polynomial features:
Example:#
Size |
Size² |
|---|---|
1000 |
1,000,000 |
1500 |
2,250,000 |
These new variables allow models to learn curved relationships.
Example Python Code#
from sklearn.preprocessing import PolynomialFeatures
import pandas as pd
df = pd.DataFrame({
"size":[1000,1500,2000]
})
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df)
pd.DataFrame(poly_features, columns=["size","size_squared"])
Polynomial features are commonly used with models such as:
Linear Regression
Logistic Regression
They allow simple models to capture complex patterns.
1b. Interaction Features#
Interaction features are created by multiplying two or more features together. They capture relationships where the combined effect of multiple variables matters.
Example#
Suppose we want to predict house prices using:
house size
neighborhood quality score
Size |
Neighborhood Score |
|---|---|
1500 |
8 |
1500 |
4 |
Two houses may have the same size, but if one is in a better neighborhood, the price may be higher. An interaction feature can capture this relationship.
Example Table#
Size |
Neighborhood |
Size × Neighborhood |
|---|---|---|
1500 |
8 |
12000 |
1500 |
4 |
6000 |
This feature helps the model understand that price depends on both variables together.
Example Python Code#
df["size_neighborhood_interaction"] = df["size"] * df["neighborhood_score"]
Why These Features Matter?#
Polynomial and interaction features help models capture complex relationships in data.
Polynomial Features Capture nonlinear patterns
Interaction Features Capture relationships between multiple variables
These techniques are widely used in:
regression models
recommendation systems
predictive analytics