Example: Recommender Systems — Complete Walkthrough#

Content-Based · User-Based CF · Item-Based CF#

What you’ll build: Three recommender systems from scratch, with every calculation shown as a printed table you can trace step by step.


Dataset: 5 users (Alice, Bob, Carol, Dave, Eve) rate 5 movies on a scale of 1–5.
A ? means they haven’t seen it yet; that’s what we want to predict.

Inception

Interstellar

The Notebook

Alien

Titanic

Alice

5

4

2

?

?

Bob

5

4

1

5

2

Carol

1

2

5

2

5

Dave

4

5

2

4

1

Eve

2

1

5

3

4

1. Content Based Recommender Systems#

Step 1: The Rating Matrix (Alice’s History)

User

Inception
(Action, Sci-fi)

Interstellar
(Sci-fi)

The Notebook
(Romance)

Alien
(Action, Sci-fi)

Titanic
(Romance)

Alice

5

4

2

?

?

Bob

3

5

1

4

2

Carol

1

2

5

1

4

Notice that, Alice rated Inception (5) and Interstellar (4) highly (both are Sci-fi genre). She rated The Notebook (2) low, the Romance genre. We need to predict her ratings for Alien and Titanic.

Step 2: Build Alice’s Genre Preference Profile (User Profile)

Average rating per genre across movies she has seen:

Genre

Movies seen

Avg rating

Sci-fi

Inception (5), Interstellar (4)

(5 + 4) / 2 = 4.5

Action

Inception (5)

5 / 1 = 5.0

Romance

The Notebook (2)

2 / 1 = 2.0

Detailed Calculation:

  • Sci-fi: Alice watched Inception and Interstellar= (5 + 4) / 2 = 4.5

  • Action: Alice watched Inception = 5 / 1 = 5.0

  • Romance: Alice watched The Notebook = 2 / 1 = 2.0

Alice’s profile: Likes Sci-fi (4.5) and Action (5.0) Dislikes Romance (2.0)

Step 3: Score Unseen Movies

General Formula: Predicted score for a movie:

\[ \text{Score(movie)} = \frac{\sum (\text{user's average rating for each genre in the movie})}{\text{number of genres in the movie}} \]

Movie

Genres

Predicted score

Calculation

Recommend?

Alien

Action, Sci-fi

4.75

(5.0 + 4.5) / 2

Yes

Titanic

Romance

2.0

(2.0)

No

Step-by-Step Calculation:

  • Alien (the movie has 2 genre: Action + Sci-fi) = Score(Alien)=(5.0 + 4.5) / 2 = 4.75

  • Titanic (the movie has 1 genre: Romance) = Score(Titanic)= 2.0/ 1 =2.0

Final Recommendation: Recommend “Alien” because its genres closely match Alice’s demonstrated preferences.

2. User-Based Collaborative Filtering Recommender Systems#

Step 1: User-Item Rating Matrix where (? = unseen)

User/Item (Movie)

Inception

Interstellar

The Notebook

Alien

Titanic

Alice

5

4

2

?

?

Bob

5

4

1

5

2

Carol

1

2

5

2

5

Dave

4

5

2

4

1

Eve

2

1

5

3

4

We use the 3 movies that Alice and other users have both rated (Inception, Interstellar, and The Notebook) to compute similarity. Then we use ratings from the most similar users to predict Alice’s missing ratings.

Step 2: Compute Similarity Between Alice and Each User

We compute similarity using only the movies that both Alice and other users have rated (overlapping / co-rated items): Inception, Interstellar, The Notebook.

  • Similarity is computed using cosine similarity or Pearson correlation

  • Then, we select the top-k most similar users to predict Alice’s missing ratings.

Alice’s rating vector:

\[ Alice = [5, 4, 2] \]

Option A: Cosine Similarity#

\[ \text{similarity}(A,B) = \frac{A \cdot B}{||A|| \times ||B||} \]

Example: Alice vs Bob#

\[ \frac{(5×5 + 4×4 + 2×1)}{\sqrt{5^2+4^2+2^2} \cdot \sqrt{5^2+4^2+1^2}} = \frac{43}{\sqrt{45} \cdot \sqrt{42}} \approx 0.989 \]

Cosine Similarity Results:#

User

Their ratings
(Inception / Interstellar / Notebook)

Alice’s ratings

Similarity

Bob

5 / 4 / 1

5 / 4 / 2

0.99 (very similar)

Dave

4 / 5 / 2

5 / 4 / 2

0.98 (similar)

Carol

1 / 2 / 5

5 / 4 / 2

0.63

Eve

2 / 1 / 5

5 / 4 / 2

0.65

Bob and Dave are Alice’s nearest neighbors.

Option B: Pearson Correlation#

\[ \text{Pearson}(A,B) = \frac{\sum (A_i - \bar{A})(B_i - \bar{B})} {\sqrt{\sum (A_i - \bar{A})^2} \sqrt{\sum (B_i - \bar{B})^2}} \]

Alice’s mean:

\[ \bar{A} = \frac{5 + 4 + 2}{3} = 3.67 \]

Example: Alice vs Bob#

Bob’s mean:

\[ \bar{B} = \frac{5 + 4 + 1}{3} = 3.33 \]

Centered vectors:

\[ Alice = [1.33, 0.33, -1.67] \]
\[ Bob = [1.67, 0.67, -2.33] \]
\[ \text{Pearson}(Alice, Bob) \approx 0.98 \]

Pearson Similarity Results#

User

Ratings (I / IS / N)

Similarity

Bob

5 / 4 / 1

0.99 (very similar)

Dave

4 / 5 / 2

0.79 (similar)

Carol

1 / 2 / 5

-1.00 (very different)

Eve

2 / 1 / 5

-0.84 (very different)


Key Insight:

  • Cosine similarity → measures angle (raw rating patterns)

  • Pearson correlation → measures similarity after removing user bias (mean-centered)

For recommendation systems, Pearson is often preferred because it handles users with different rating scales.


Step 3: Choose Top-k Neighbors

We compute similarity using the movies Alice and the other users have both rated. Then we use the top-k most similar users to predict Alice’s missing ratings.

Since we use:

\[ k = 2 \]

we select the top 2 most similar users:

\[ \text{Nearest neighbors} = \{Bob, Dave\} \]

Neighbor

Similarity

Bob

0.99

Dave

0.98


Step 4: Predict Alice’s Missing Ratings

We use the weighted average formula:

\[ \text{Predicted rating} = \frac{\sum(\text{similarity} \times \text{neighbor rating})}{\sum(\text{similarity})} \]

Unseen Movie

Similar Users Used

Prediction Formula

Score

Alien

Bob (sim 0.99, rated 5)
Dave (sim 0.98, rated 4)

(0.99×5 + 0.98×4) / (0.99 + 0.98)

4.50

Titanic

Bob (sim 0.99, rated 2)
Dave (sim 0.98, rated 1)

(0.99×2 + 0.98×1) / (0.99 + 0.98)

1.50


Step-by-Step Calculation#

Alien: Predict Alice’s Rating for Alien#

Bob rated Alien = 5
Dave rated Alien = 4

\[ \text{Predicted rating for Alien} = \frac{(0.99 \times 5) + (0.98 \times 4)}{0.99 + 0.98} = \frac{4.95 + 3.92}{1.97} = 4.50 \]

Titanic: Predict Alice’s Rating for Titanic#

Bob rated Titanic = 2
Dave rated Titanic = 1

\[ \text{Predicted rating for Titanic} = \frac{(0.99 \times 2) + (0.98 \times 1)}{0.99 + 0.98} = \frac{1.98 + 0.98}{1.97} = 1.50 \]

Final Recommendation#

Movie

Predicted Rating

Recommend?

Alien

4.50

Yes

Titanic

1.50

No

Recommendation: Recommend Alien (predicted 4.50), because Alice’s nearest neighbors, Bob and Dave, both rated Alien highly.

Remember:In user-based collaborative filtering, we rely entirely on similar users’ behavior; we do not use genre or content information.

3. Item-Based Collaborative Filtering Recommender Systems#

Item-based CF flips the question: instead of asking “who has similar taste to Alice?”, it asks “which movies tend to get rated similarly by the same people?”

Idea: Instead of finding similar users, we find similar items (movies).

We predict Alice’s ratings by finding movies similar to those she already rated, and using her past ratings to estimate new ones.

  • If two movies are rated similarly by many users → they are similar

  • We predict a user’s rating based on movies they already liked

Step 1: Represent Movies as Vectors

Instead of comparing users, we compare how movies are rated across users.

Movie(Item)/User

Alice

Bob

Carol

Dave

Eve

Inception

5

5

1

4

2

Interstellar

4

4

2

5

1

The Notebook

2

1

5

2

5

Alien

?

5

2

4

3

Titanic

?

2

5

1

4

Here, we have adjusted the the matrix: rows are now movies, columns are users. Why? Instead of comparing users, we compare how movies were rated across all users.

Here, as you can see, each movie is represented using ratings from all users:

  • Inception = [5, 5, 1, 4, 2]

  • Interstellar = [4, 4, 2, 5, 1]

  • The Notebook = [2, 1, 5, 2, 5]

  • Alien = [?, 5, 2, 4, 3] (Here, Alice has not rated Alien) → ignore Alice → [5, 2, 4, 3]

  • Titanic = [?, 2, 5, 1, 4] → ignore Alice → [2, 5, 1, 4]

We ignore Alice’s missing values (?) when computing similarity.


Why Do We Ignore “?” (Missing Ratings) above to represent vector?#

When computing similarity between two movies, we must use only users who have rated both movies.

Missing values (“?”) mean the user has not seen or rated the movie, so we do not know their preference. Including them would introduce incorrect or undefined values in the calculation.

Key Idea: Similarity is computed only on co-rated users (users who rated both items). You can’t compare two movies based on a user who hasn’t watched one of them.

Example#

We want to compute similarity between Inception and Alien.

Original vectors:

  • Inception = [5, 5, 1, 4, 2]

  • Alien = [?, 5, 2, 4, 3]

Here, Alice has not rated Alien. Problem:

  • “?” is unknown → cannot multiply or compute distance

  • Leads to invalid similarity

So, Remove Alice and use only:

  • Inception = [5, 1, 4, 2]

  • Alien = [5, 2, 4, 3]

(users: Bob, Carol, Dave, Eve)

Now both vectors have:

  • Same length

  • Only known values

  • Valid comparison

Summary#

  • Ignore missing values (“?”)

  • Use only overlapping users

  • Ensures fair and correct similarity computation


Step 2: Compute Item-Item Similarity

So, we compute similarity between movies using only co-rated users (ignore “?”).

From Step 1, we have found: Movie Vectors (Example: Inception vs Alien)

Using users: Bob, Carol, Dave, Eve

  • Inception = [5, 1, 4, 2]

  • Alien = [5, 2, 4, 3]

So, we compare each movie’s rating vector using only users who rated both movies (Bob, Carol, Dave, Eve).


Similarity Table#

Movie Pair

Raters

Cosine Similarity

Pearson Similarity

Interpretation

Inception vs Alien

Bob (5,5), Carol (1,2), Dave (4,4), Eve (2,3)

0.983

0.99

Very similar audiences

Interstellar vs Alien

Bob (4,5), Carol (2,2), Dave (5,4), Eve (1,3)

0.943

0.71

Similar audiences

Inception vs Titanic

Bob (5,2), Carol (1,5), Dave (4,1), Eve (2,4)

0.587

-0.90

Opposite preferences by Pearson

The Notebook vs Titanic

Bob (1,2), Carol (5,5), Dave (2,1), Eve (5,4)

0.974

0.89

Very similar audiences


Step-by-Step Example Calculation (Inception vs Alien)#

Option A: Cosine Similarity#

\[ \text{Cosine}(i,j) = \frac{i \cdot j}{||i|| \times ||j||} \]

Calculation

\[ \frac{(5\times5 + 1\times2 + 4\times4 + 2\times3)} {\sqrt{5^2+1^2+4^2+2^2} \cdot \sqrt{5^2+2^2+4^2+3^2}} \]
\[ = \frac{25 + 2 + 16 + 6}{\sqrt{46} \cdot \sqrt{54}} = \frac{49}{\sqrt{46} \cdot \sqrt{54}} \approx 0.98 \]

Option B: Pearson Correlation#

\[ \text{Pearson}(i,j) = \frac{\sum (i_k - \bar{i})(j_k - \bar{j})} {\sqrt{\sum (i_k - \bar{i})^2} \cdot \sqrt{\sum (j_k - \bar{j})^2}} \]

Step 1: Compute Means

\[ \bar{i} = \frac{5 + 1 + 4 + 2}{4} = 3 \]
\[ \bar{j} = \frac{5 + 2 + 4 + 3}{4} = 3.5 \]

Step 2: Centered Vectors

\[ i - \bar{i} = [2, -2, 1, -1] \]
\[ j - \bar{j} = [1.5, -1.5, 0.5, -0.5] \]

Step 3: Compute Pearson

\[ \frac{(2\times1.5) + (-2\times-1.5) + (1\times0.5) + (-1\times-0.5)} {\sqrt{(2^2+(-2)^2+1^2+(-1)^2)} \cdot \sqrt{(1.5^2+(-1.5)^2+0.5^2+(-0.5)^2)}} \]
\[ = \frac{3 + 3 + 0.5 + 0.5}{\sqrt{10} \cdot \sqrt{5}} = \frac{7}{\sqrt{10} \cdot \sqrt{5}} \approx 0.99 \]

Method Comparison#

Method

Similarity

Cosine

0.98

Pearson

0.99

Key Insight#

  • Cosine similarity → compares raw rating patterns

  • Pearson correlation → compares rating patterns after removing bias (mean-centered)

Pearson is often better when users have different rating scales.


Step 3: Predict Alice’s Ratings (Item-Based):

We use movies Alice has already rated as anchors:

  • Inception (5)

  • Interstellar (4)

  • The Notebook (2)

Top-N Selection (i.e N=2)#

For each target movie, we select the top-2 most similar movies among the movies Alice has already rated.

Here, we use:

\[ N = 2 \]

Using cosine similarity, the selected neighbors are:

  • For AlienInception (0.98), Interstellar (0.94)

  • For TitanicThe Notebook (0.97), Inception (0.59)

(Neighbors are selected using cosine similarity for prediction. Pearson correlation can also be used; however, negative similarities should be ignored. An example is provided below.)

General Formula#

\[ \text{Score}(i) = \frac{\sum (\text{sim}(i,j) \times r_{Alice,j})}{\sum \text{sim}(i,j)} \]

Predict Alien#

\[ \frac{(0.98 \times 5) + (0.94 \times 4)}{0.98 + 0.94} = \frac{4.90 + 3.76}{1.92} = \frac{8.66}{1.92} \approx 4.51 \]

Predict Titanic#

\[ \frac{(0.97 \times 2) + (0.59 \times 5)}{0.97 + 0.59} = \frac{1.94 + 2.95}{1.56} = \frac{4.89}{1.56} \approx 3.14 \]

Final Results#

Movie

Predicted Score

Recommend?

Alien

4.51

Yes

Titanic

3.14

Maybe / weaker


(OPTION B) Using Pearson Prediction by Ignoring Negative Similarities#

When using Pearson correlation, negative similarity means opposite preference. So for prediction, we ignore negative similarities and use only positive similarities.


Predict Alien using Pearson#

Positive similarities:

  • sim(Alien, Inception) = 0.99

  • sim(Alien, Interstellar) = 0.71

\[ \frac{(0.99 \times 5) + (0.71 \times 4)}{0.99 + 0.71} = \frac{4.95 + 2.84}{1.70} = \frac{7.79}{1.70} \approx 4.58 \]

Predict Titanic using Pearson#

Positive similarity:

  • sim(Titanic, The Notebook) = 0.89

  • sim(Titanic, Inception) = -0.90 → ignored

\[ \frac{0.89 \times 2}{0.89} = 2.00 \]

Pearson Results After Ignoring Negative Similarities#

Movie

Pearson Score

Recommend?

Alien

4.58

Yes

Titanic

2.00

No


Interpretation#

Using Pearson, Titanic receives a low predicted score because its strongest positive match is The Notebook, which Alice rated low.


Final Recommendation#

Recommendation: Recommend Alien (predicted 4.51) because it is highly similar to movies Alice already rated highly.

Intuition: If Alice liked movies similar to Alien, she will likely like Alien as well.

Remember: Item-based uses movie similarity, not users. Since movies don’t change, their similarity stays consistent. This allows us to compute similarities once and reuse them, making the system efficient.

  • We use Top-N similar items (N = 2)

  • Predictions are based on weighted averages of similar movies

  • Item-based filtering leverages movie similarity, not user similarity

Here’s the core intuition from all three method: All three methods predicted Alice should watch Alien, but for completely different reasons.

  • Content-based said “Alien is Sci-fi, and Alice likes Sci-fi.”

  • User-based said “Bob and Dave love Alien, and they think just like Alice.”

  • Item-based said “Alien gets rated by the same people who rated Inception highly.”

The practical reason why real systems like Amazon and Netflix prefer item-based CF is scalability. With 100 million users, you’d need to compare every user pair (that’s 10 quadrillion comparisons). But movies are fewer, their similarity scores are stable, and you can pre-compute them once overnight and reuse them for every user lookup in milliseconds.