Example: Recommender Systems — Complete Walkthrough#
Content-Based · User-Based CF · Item-Based CF#
What you’ll build: Three recommender systems from scratch, with every calculation shown as a printed table you can trace step by step.
Dataset: 5 users (Alice, Bob, Carol, Dave, Eve) rate 5 movies on a scale of 1–5.
A ? means they haven’t seen it yet; that’s what we want to predict.
Inception |
Interstellar |
The Notebook |
Alien |
Titanic |
|
|---|---|---|---|---|---|
Alice |
5 |
4 |
2 |
? |
? |
Bob |
5 |
4 |
1 |
5 |
2 |
Carol |
1 |
2 |
5 |
2 |
5 |
Dave |
4 |
5 |
2 |
4 |
1 |
Eve |
2 |
1 |
5 |
3 |
4 |
1. Content Based Recommender Systems#
Step 1: The Rating Matrix (Alice’s History)
User |
Inception |
Interstellar |
The Notebook |
Alien |
Titanic |
|---|---|---|---|---|---|
Alice |
5 |
4 |
2 |
? |
? |
Bob |
3 |
5 |
1 |
4 |
2 |
Carol |
1 |
2 |
5 |
1 |
4 |
Notice that, Alice rated Inception (5) and Interstellar (4) highly (both are Sci-fi genre). She rated The Notebook (2) low, the Romance genre. We need to predict her ratings for Alien and Titanic.
Step 2: Build Alice’s Genre Preference Profile (User Profile)
Average rating per genre across movies she has seen:
Genre |
Movies seen |
Avg rating |
|---|---|---|
Sci-fi |
Inception (5), Interstellar (4) |
(5 + 4) / 2 = 4.5 |
Action |
Inception (5) |
5 / 1 = 5.0 |
Romance |
The Notebook (2) |
2 / 1 = 2.0 |
Detailed Calculation:
Sci-fi: Alice watched Inception and Interstellar= (5 + 4) / 2 = 4.5
Action: Alice watched Inception = 5 / 1 = 5.0
Romance: Alice watched The Notebook = 2 / 1 = 2.0
Alice’s profile: Likes Sci-fi (4.5) and Action (5.0) Dislikes Romance (2.0)
Step 3: Score Unseen Movies
General Formula: Predicted score for a movie:
Movie |
Genres |
Predicted score |
Calculation |
Recommend? |
|---|---|---|---|---|
Alien |
Action, Sci-fi |
4.75 |
(5.0 + 4.5) / 2 |
Yes |
Titanic |
Romance |
2.0 |
(2.0) |
No |
Step-by-Step Calculation:
Alien (the movie has 2 genre: Action + Sci-fi) = Score(Alien)=(5.0 + 4.5) / 2 = 4.75
Titanic (the movie has 1 genre: Romance) = Score(Titanic)= 2.0/ 1 =2.0
Final Recommendation: Recommend “Alien” because its genres closely match Alice’s demonstrated preferences.
2. User-Based Collaborative Filtering Recommender Systems#
Step 1: User-Item Rating Matrix where (? = unseen)
User/Item (Movie) |
Inception |
Interstellar |
The Notebook |
Alien |
Titanic |
|---|---|---|---|---|---|
Alice |
5 |
4 |
2 |
? |
? |
Bob |
5 |
4 |
1 |
5 |
2 |
Carol |
1 |
2 |
5 |
2 |
5 |
Dave |
4 |
5 |
2 |
4 |
1 |
Eve |
2 |
1 |
5 |
3 |
4 |
We use the 3 movies that Alice and other users have both rated (Inception, Interstellar, and The Notebook) to compute similarity. Then we use ratings from the most similar users to predict Alice’s missing ratings.
Step 2: Compute Similarity Between Alice and Each User
We compute similarity using only the movies that both Alice and other users have rated (overlapping / co-rated items): Inception, Interstellar, The Notebook.
Similarity is computed using cosine similarity or Pearson correlation
Then, we select the top-k most similar users to predict Alice’s missing ratings.
Alice’s rating vector:
Option A: Cosine Similarity#
Example: Alice vs Bob#
Cosine Similarity Results:#
User |
Their ratings |
Alice’s ratings |
Similarity |
|---|---|---|---|
Bob |
5 / 4 / 1 |
5 / 4 / 2 |
0.99 (very similar) |
Dave |
4 / 5 / 2 |
5 / 4 / 2 |
0.98 (similar) |
Carol |
1 / 2 / 5 |
5 / 4 / 2 |
0.63 |
Eve |
2 / 1 / 5 |
5 / 4 / 2 |
0.65 |
Bob and Dave are Alice’s nearest neighbors.
Option B: Pearson Correlation#
Alice’s mean:
Example: Alice vs Bob#
Bob’s mean:
Centered vectors:
Pearson Similarity Results#
User |
Ratings (I / IS / N) |
Similarity |
|---|---|---|
Bob |
5 / 4 / 1 |
0.99 (very similar) |
Dave |
4 / 5 / 2 |
0.79 (similar) |
Carol |
1 / 2 / 5 |
-1.00 (very different) |
Eve |
2 / 1 / 5 |
-0.84 (very different) |
Key Insight:
Cosine similarity → measures angle (raw rating patterns)
Pearson correlation → measures similarity after removing user bias (mean-centered)
For recommendation systems, Pearson is often preferred because it handles users with different rating scales.
Step 3: Choose Top-k Neighbors
We compute similarity using the movies Alice and the other users have both rated. Then we use the top-k most similar users to predict Alice’s missing ratings.
Since we use:
we select the top 2 most similar users:
Neighbor |
Similarity |
|---|---|
Bob |
0.99 |
Dave |
0.98 |
Step 4: Predict Alice’s Missing Ratings
We use the weighted average formula:
Unseen Movie |
Similar Users Used |
Prediction Formula |
Score |
|---|---|---|---|
Alien |
Bob (sim 0.99, rated 5) |
(0.99×5 + 0.98×4) / (0.99 + 0.98) |
4.50 |
Titanic |
Bob (sim 0.99, rated 2) |
(0.99×2 + 0.98×1) / (0.99 + 0.98) |
1.50 |
Step-by-Step Calculation#
Alien: Predict Alice’s Rating for Alien#
Bob rated Alien = 5
Dave rated Alien = 4
Titanic: Predict Alice’s Rating for Titanic#
Bob rated Titanic = 2
Dave rated Titanic = 1
Final Recommendation#
Movie |
Predicted Rating |
Recommend? |
|---|---|---|
Alien |
4.50 |
Yes |
Titanic |
1.50 |
No |
Recommendation: Recommend Alien (predicted 4.50), because Alice’s nearest neighbors, Bob and Dave, both rated Alien highly.
Remember:In user-based collaborative filtering, we rely entirely on similar users’ behavior; we do not use genre or content information.
3. Item-Based Collaborative Filtering Recommender Systems#
Item-based CF flips the question: instead of asking “who has similar taste to Alice?”, it asks “which movies tend to get rated similarly by the same people?”
Idea: Instead of finding similar users, we find similar items (movies).
We predict Alice’s ratings by finding movies similar to those she already rated, and using her past ratings to estimate new ones.
If two movies are rated similarly by many users → they are similar
We predict a user’s rating based on movies they already liked
Step 1: Represent Movies as Vectors
Instead of comparing users, we compare how movies are rated across users.
Movie(Item)/User |
Alice |
Bob |
Carol |
Dave |
Eve |
|---|---|---|---|---|---|
Inception |
5 |
5 |
1 |
4 |
2 |
Interstellar |
4 |
4 |
2 |
5 |
1 |
The Notebook |
2 |
1 |
5 |
2 |
5 |
Alien |
? |
5 |
2 |
4 |
3 |
Titanic |
? |
2 |
5 |
1 |
4 |
Here, we have adjusted the the matrix: rows are now movies, columns are users. Why? Instead of comparing users, we compare how movies were rated across all users.
Here, as you can see, each movie is represented using ratings from all users:
Inception = [5, 5, 1, 4, 2]
Interstellar = [4, 4, 2, 5, 1]
The Notebook = [2, 1, 5, 2, 5]
Alien = [?, 5, 2, 4, 3] (Here, Alice has not rated Alien) → ignore Alice → [5, 2, 4, 3]
Titanic = [?, 2, 5, 1, 4] → ignore Alice → [2, 5, 1, 4]
We ignore Alice’s missing values (?) when computing similarity.
Why Do We Ignore “?” (Missing Ratings) above to represent vector?#
When computing similarity between two movies, we must use only users who have rated both movies.
Missing values (“?”) mean the user has not seen or rated the movie, so we do not know their preference. Including them would introduce incorrect or undefined values in the calculation.
Key Idea: Similarity is computed only on co-rated users (users who rated both items). You can’t compare two movies based on a user who hasn’t watched one of them.
Example#
We want to compute similarity between Inception and Alien.
Original vectors:
Inception = [5, 5, 1, 4, 2]
Alien = [?, 5, 2, 4, 3]
Here, Alice has not rated Alien. Problem:
“?” is unknown → cannot multiply or compute distance
Leads to invalid similarity
So, Remove Alice and use only:
Inception = [5, 1, 4, 2]
Alien = [5, 2, 4, 3]
(users: Bob, Carol, Dave, Eve)
Now both vectors have:
Same length
Only known values
Valid comparison
Summary#
Ignore missing values (“?”)
Use only overlapping users
Ensures fair and correct similarity computation
Step 2: Compute Item-Item Similarity
So, we compute similarity between movies using only co-rated users (ignore “?”).
From Step 1, we have found: Movie Vectors (Example: Inception vs Alien)
Using users: Bob, Carol, Dave, Eve
Inception = [5, 1, 4, 2]
Alien = [5, 2, 4, 3]
So, we compare each movie’s rating vector using only users who rated both movies (Bob, Carol, Dave, Eve).
Similarity Table#
Movie Pair |
Raters |
Cosine Similarity |
Pearson Similarity |
Interpretation |
|---|---|---|---|---|
Inception vs Alien |
Bob (5,5), Carol (1,2), Dave (4,4), Eve (2,3) |
0.983 |
0.99 |
Very similar audiences |
Interstellar vs Alien |
Bob (4,5), Carol (2,2), Dave (5,4), Eve (1,3) |
0.943 |
0.71 |
Similar audiences |
Inception vs Titanic |
Bob (5,2), Carol (1,5), Dave (4,1), Eve (2,4) |
0.587 |
-0.90 |
Opposite preferences by Pearson |
The Notebook vs Titanic |
Bob (1,2), Carol (5,5), Dave (2,1), Eve (5,4) |
0.974 |
0.89 |
Very similar audiences |
Step-by-Step Example Calculation (Inception vs Alien)#
Option A: Cosine Similarity#
Calculation
Option B: Pearson Correlation#
Step 1: Compute Means
Step 2: Centered Vectors
Step 3: Compute Pearson
Method Comparison#
Method |
Similarity |
|---|---|
Cosine |
0.98 |
Pearson |
0.99 |
Key Insight#
Cosine similarity → compares raw rating patterns
Pearson correlation → compares rating patterns after removing bias (mean-centered)
Pearson is often better when users have different rating scales.
Step 3: Predict Alice’s Ratings (Item-Based):
We use movies Alice has already rated as anchors:
Inception (5)
Interstellar (4)
The Notebook (2)
Top-N Selection (i.e N=2)#
For each target movie, we select the top-2 most similar movies among the movies Alice has already rated.
Here, we use:
Using cosine similarity, the selected neighbors are:
For Alien → Inception (0.98), Interstellar (0.94)
For Titanic → The Notebook (0.97), Inception (0.59)
(Neighbors are selected using cosine similarity for prediction. Pearson correlation can also be used; however, negative similarities should be ignored. An example is provided below.)
General Formula#
Predict Alien#
Predict Titanic#
Final Results#
Movie |
Predicted Score |
Recommend? |
|---|---|---|
Alien |
4.51 |
Yes |
Titanic |
3.14 |
Maybe / weaker |
(OPTION B) Using Pearson Prediction by Ignoring Negative Similarities#
When using Pearson correlation, negative similarity means opposite preference. So for prediction, we ignore negative similarities and use only positive similarities.
Predict Alien using Pearson#
Positive similarities:
sim(Alien, Inception) = 0.99
sim(Alien, Interstellar) = 0.71
Predict Titanic using Pearson#
Positive similarity:
sim(Titanic, The Notebook) = 0.89
sim(Titanic, Inception) = -0.90 → ignored
Pearson Results After Ignoring Negative Similarities#
Movie |
Pearson Score |
Recommend? |
|---|---|---|
Alien |
4.58 |
Yes |
Titanic |
2.00 |
No |
Interpretation#
Using Pearson, Titanic receives a low predicted score because its strongest positive match is The Notebook, which Alice rated low.
Final Recommendation#
Recommendation: Recommend Alien (predicted 4.51) because it is highly similar to movies Alice already rated highly.
Intuition: If Alice liked movies similar to Alien, she will likely like Alien as well.
Remember: Item-based uses movie similarity, not users. Since movies don’t change, their similarity stays consistent. This allows us to compute similarities once and reuse them, making the system efficient.
We use Top-N similar items (N = 2)
Predictions are based on weighted averages of similar movies
Item-based filtering leverages movie similarity, not user similarity
Here’s the core intuition from all three method: All three methods predicted Alice should watch Alien, but for completely different reasons.
Content-based said “Alien is Sci-fi, and Alice likes Sci-fi.”
User-based said “Bob and Dave love Alien, and they think just like Alice.”
Item-based said “Alien gets rated by the same people who rated Inception highly.”
The practical reason why real systems like Amazon and Netflix prefer item-based CF is scalability. With 100 million users, you’d need to compare every user pair (that’s 10 quadrillion comparisons). But movies are fewer, their similarity scores are stable, and you can pre-compute them once overnight and reuse them for every user lookup in milliseconds.