A quick dive into search system metrics

A quick breakdown of some search system metrics.

MRR = Mean reciprocal rank

Consider rank of 1st relevant item, then average:

\[MRR@m = \frac{1}{m} \sum_{i=1}^m \frac{1}{\text{rank}_i}\]

Pro: easy to interpret
Con: Only 1st result is considered

Recall@k

\[\frac{\text{\# of relevant items among top-k items in output list}}{\text{total relevant items}}\]

Con: can be very many relevant items ⇒ denominator = large.
Con: does not consider ranking of results within k.

Prec@k

\[\frac{\text{\# of relevant items in top-k}}{k}\]

Con: does not consider ranking of results within k.

mAP

Mean average precision.

First, what is average precision (AP)?

AP = average precision @ different values of k

\[AP@k = \frac{1}{\text{total no relevant items}} \sum_{i=1}^k \text{Prec@i} \times \text{Rel@i}\]

The relevance vector is typically a vector of 0s and 1s, where 1 indicates a relevant item. The denominator is the total number of relevant items in the dataset - this is the same as in recall.

Second: what is the mean average precision (mAP)? It is the average taken across many queries.

\[mAP = \frac{1}{N} \sum_{i=1}^N AP_i\]

Pro: mAP includes ranking of results + precision.
Con: Works best for binary relevance, e.g. item is or is-not relevant.

DCG = discounted cumulative gain

\[DCG@p =\sum_{i=1}^p \frac{\text{rel}_i}{\log_2(i+1)}\]

Essentially, we are summing up scores from top to bottom, and discounting scores further near bottom.

Note that the logarithm factor is essentially made up. From Wikipedia:

“Previously there was no theoretically sound justification for using a logarithmic reduction factor[3] other than the fact that it produces a smooth reduction. But Wang et al. (2013)[2] gave theoretical guarantee for using the logarithmic reduction factor in Normalized DCG (NDCG). The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner.”

Con: not normalized - this makes it hard to compare across datasets.

nDCG

Normalized DCG:

\[\text{nDCG}@p = \frac{\text{DCG}@p}{\text{IDCG}@p}\]

where IDCG = ideal ranging = recommend most ideal items first.

Pro: considers ranking order of results.
Pro: normalized makes it easy to compare across datasets.
Con: Harder to interpret than precision or recall - while nDCG = 1 indicates a perfect ranking, lower scores have no direct interpretation.

When to use nDCG vs mAP?

mAP is better for binary relevance, e.g. item is or is-not relevant.

nDCG is better for graded relevance, e.g. item is somewhat relevant.

Other evaluation metrics

Other good things to measure in search systems:

Click-through rate (CTR)

\[CTR = \frac{\text{\# clicked results}}{\text{total \# of suggested results}}\]

Average time spent on suggest search results

Conclusion

There are many metrics to measure search systems. The most important thing is to pick a metric that is relevant to your use case.