Skimming through recommendation systems

I’m skimming through some background on recommendation systems - here is what I learned.

Source: Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)

What is the ML objective?

Objective Comment
Maximize # clicks Con: clickbait
Time spent on each recommended item Con: may be short based on content of recommended item
Total # relevant items based on user interactions Best = strongest signal because based on user interaction

Strengths of these approaches

Feature CF Content-based
Can handle new items    
”Cold start problem” X Yes
Do not need item features (can be expensive to compute) Yes X
Can handle niche interests (few users have this preference) X Yes

Best: hybrid filtering:

  • Sequential vs parallel: Do you first CF → Content (or vice versa) = sequential, or do the two in parallel and merge the results?
  • Most popular: sequential hybrid filtering


  • Use light feature model like CBOW for short text
  • Use heavier model like BERT for more advanced queries


  • Tags: CBOW
  • Search history: BERT for each search, then average
  • Liked items: Item ID → Embedding learned during training → average

Approach #1: Matrix factorization

Based on feedback matrix

  • Users opinions about items
  • Entries = positive (+1) or not observed (0)


Matrix = (users x items) → decompose into (users x embedding) (embedding x items)

  • Initially: decomposition is random
  • Train with loss function

Loss function

Intuition about the loss function:

  1. Idea #1: MSE over <user, item> pairs which are observed (+1 entries). This is bad because it does not penalize errors in the unobserved pairs (0 entries).
  2. Idea #2: MSE over <user, item> pairs of observed + unobserved. This is bad because it is dominated by the unobserved pairs, because the feedback matrix is sparse
  3. Idea #3: ****Best:**** weighted observed + unobserved


First approach: SGD

Better approach: WALS = alternating least squares

  1. Fix one matrix → optimize the other
  2. Switch

Pro: converges faster than SGD


Use dot product of learned embeddings

Limitations of Approach #1

Cold start problem: can be difficult for new users

E.g. if purely based on user interactions ⇒ not enough interactions for new users

Approach #2: two tower NN

Similarity = compute dot product → cross-entropy loss.


Construct positive negative <user,item> pairs


Cross entropy

Inference time

NN problem ⇒ Use approx. NNs algorithm


Slower to train + slower inference


  • Precision@k = # relevant items / k
  • mAP/nDCG:
    • mAP for binary relevance scores
    • nDCG otherwise

Other evaluation metrics

  • Diversity = how dissimilar recommended items are to each other → want more diversity in recommendations!
  • Click-rate: # clicked / # recommended ; but: subjective to clickbait

Model architecture

  1. Generate candidates
    1. Use two-tower network because (1) can handle new users + (2) does not require item features
    2. Use multiple networks trained on different goals: popular, trending, relevant to location
  2. Scoring
    1. Use content-based filtering for accuracy
    2. Score each candidate item
    3. Output ranked list
  3. Re-ranking (business logic)
    1. Privacy
    2. Bias
    3. Duplicates
    4. Misinformation

Challenges & solutions

  • Serving speed → use multistage models (e.g. YouTube)
  • New users = cold start problem → use two towers
  • New items → randomly include to collect click rate
  • How to address seasonality (dependencies on times/seasons)?
  • How to include negative feedback e.g. dislikes?


Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)


Oliver K. Ernst
August 30, 2023