# Skimming through recommendation systems

I’m skimming through some background on recommendation systems - here is what I learned.

Source: Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)

## What is the ML objective?

Objective Comment
Maximize # clicks Con: clickbait
Time spent on each recommended item Con: may be short based on content of recommended item
Total # relevant items based on user interactions Best = strongest signal because based on user interaction

## Strengths of these approaches

Feature CF Content-based
Can handle new items
”Cold start problem” X Yes
Do not need item features (can be expensive to compute) Yes X
Can handle niche interests (few users have this preference) X Yes

Best: hybrid filtering:

• Sequential vs parallel: Do you first CF → Content (or vice versa) = sequential, or do the two in parallel and merge the results?
• Most popular: sequential hybrid filtering

## Encoding

• Use light feature model like CBOW for short text
• Use heavier model like BERT for more advanced queries

Examples:

• Tags: CBOW
• Search history: BERT for each search, then average
• Liked items: Item ID → Embedding learned during training → average

## Approach #1: Matrix factorization

Based on feedback matrix

• Entries = positive (+1) or not observed (0)

### Decomposition

Matrix = (users x items) → decompose into (users x embedding) (embedding x items)

• Initially: decomposition is random
• Train with loss function

### Loss function

1. Idea #1: MSE over <user, item> pairs which are observed (+1 entries). This is bad because it does not penalize errors in the unobserved pairs (0 entries).
2. Idea #2: MSE over <user, item> pairs of observed + unobserved. This is bad because it is dominated by the unobserved pairs, because the feedback matrix is sparse
3. Idea #3: ****Best:**** weighted observed + unobserved

### Optimization

First approach: SGD

Better approach: WALS = alternating least squares

1. Fix one matrix → optimize the other
2. Switch

Pro: converges faster than SGD

### Inference

Use dot product of learned embeddings

### Limitations of Approach #1

Cold start problem: can be difficult for new users

E.g. if purely based on user interactions ⇒ not enough interactions for new users

## Approach #2: two tower NN

Similarity = compute dot product → cross-entropy loss.

### Dataset

Construct positive negative <user,item> pairs

Cross entropy

### Inference time

NN problem ⇒ Use approx. NNs algorithm

### Limitations

Slower to train + slower inference

## Evaluation

• Precision@k = # relevant items / k
• mAP/nDCG:
• mAP for binary relevance scores
• nDCG otherwise

## Other evaluation metrics

• Diversity = how dissimilar recommended items are to each other → want more diversity in recommendations!
• Click-rate: # clicked / # recommended ; but: subjective to clickbait

## Model architecture

1. Generate candidates
1. Use two-tower network because (1) can handle new users + (2) does not require item features
2. Use multiple networks trained on different goals: popular, trending, relevant to location
2. Scoring
1. Use content-based filtering for accuracy
2. Score each candidate item
3. Output ranked list
1. Privacy
2. Bias
3. Duplicates
4. Misinformation

## Challenges & solutions

• Serving speed → use multistage models (e.g. YouTube)
• New users = cold start problem → use two towers
• New items → randomly include to collect click rate
• How to address seasonality (dependencies on times/seasons)?
• How to include negative feedback e.g. dislikes?

## Source

Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)

Oliver K. Ernst
August 30, 2023