Introduction to “Regression”#

Regression is one of the most fundamental techniques in data analysis and machine learning. At its core, regression is about understanding relationships between variables and using those relationships to make predictions.

In many real-world problems, we are not just interested in classification (yes/no), but in predicting a continuous value. For example:

  • What will be the price of a house given its size?

  • How will sales change with advertising budget?

  • What temperature can we expect tomorrow?

These types of problems naturally lead to regression models.

In regression, we typically define:

  • Dependent variable (target): the value we want to predict

  • Independent variables (features): the inputs used to make predictions

A More Intuitive Way to Think About Regression#

Before jumping into formulas, it helps to think about regression in a more natural way.

Imagine you are looking at data points plotted on a graph; perhaps house sizes on the x-axis and prices on the y-axis. The points are scattered, but you can visually sense a trend: larger houses tend to cost more.

Regression is essentially the process of drawing a line (or curve) that best captures this trend. t is not about perfectly matching every single point. Instead, it is about capturing the overall pattern so that we can:

  • Understand how variables are related

  • Make reasonable predictions for new, unseen data

In that sense, regression is both:

  • A predictive tool, and

  • An explanatory tool

Another Perspective: Best Fit Line#

Another way to think about regression is through the idea of a best fit line.

Illustration of How Linear Regression Models Relationships Using a Straight Line. Source:Dede,Medium.com

Here, the goal is to draw a line that minimizes the overall error between actual data points and predicted values. This line may not pass through every point, but it balances all of them in the best possible way.

Goal of Regression: Best-Fit Line#

The goal is to find the best-fit line.

The central goal of regression is to find a function that best represents the relationship between input variables (features) and the target variable. In linear regression, this function takes the form of a straight line.

But what does “best” actually mean?

The best-fit line is the line that most closely follows the overall pattern in the data. It does not try to pass through every data point. Instead, it balances all points by minimizing the overall prediction error.

A good regression line should:

  • Fit the data well without overreacting to noise

  • Minimize prediction errors across all data points

  • Capture the underlying relationship between variables

This is an important shift in thinking:

We are not trying to eliminate error completely, we are trying to minimize it in a principled way.

Visual Intuition#

If you imagine multiple possible lines passing through the same dataset:

  • Some lines will be too steep

  • Some will be too flat

  • Some will miss most points


Linear Regression Fit

Residual Visualization

Left: Best-fit regression line capturing the overall trend. Right: Residuals showing the difference between actual and predicted values.

The best-fit line is the one that balances all deviations (errors) in the most optimal way.

Comparing Different Fits#

Not all lines fit the data equally well. Some lines capture the trend better, while others produce large errors.

Different lines produce different total errors. The best-fit line is the one that minimizes the overall residual error.

From above diagram notice that, each line produces a different total error:

  • Blue line → smallest error → best fit

  • Green line → moderate error

  • Red line → very large error → poor fit

This motivates why we need a formal way (cost function) to measure error. This is exactly what leads us to concepts of Residuals, Loss function and Cost function (MSE) (we will cover later in this chapter).

Why Regression Matters#

Regression appears everywhere, often in ways we do not immediately notice:

  • In economics, to understand how income affects spending

  • In healthcare, to study how dosage impacts recovery

  • In business, to forecast future demand

  • In social sciences, to analyze relationships between variables

What makes regression powerful is that it provides quantitative insight. It does not just say “these are related” — it tells us how strongly they are related and in what direction.