Created on 2022-06-22

The idea behind F1-score

What is F1-score?

F1 score is a statistic used to assess the performance of a classification algorithm. It ranges between values of 0 (bad) and 1 (amazing). In cases where we have a large class imbalance in our dataset, F1-score is prefered over accuracy. Let’s understand why that is. To do so, we need to recall two other related statistics, namely precission and recall.

Precission

Precision captures the severity with which our classifier misclassifies negative examples as positive. We call these False Positives. It is defined as

\[precission = \frac{TP}{TP+FP}.\]

Recall

Recall captures the severity with which our classifier mistakes positive samples for negative. We call these False Negatives. It is defined as

\[recall = \frac{TP}{TP+FN}.\]

In addition to these two metrics, let’s also define and understand another seemingly unrelated concept, a harmonic mean.

Harmonic Mean

The harmonic mean is an alternative centrality measure to the arithemtic or the geometric mean. To understand it, let’s look at its definition. Given a sample of real positive numbers $x_1, x_2,…,x_n$, a harmonic mean is defined as

\[\bar{x}_H = \frac{n}{\frac{1}{x_1} + \frac{1}{x_2} + ... + \frac{1}{x_n}}.\]

In words, the harmonic mean is the size of the sample divided by the sum of the reciprocals of our observations. So what are the consequences of this definition? It is fairly easy to see that if $x_i$ is large, its contribution to the denominator of $\bar{x}_H$ will be small, hence $\bar{x}_H$ will stay relatively large. However, if $x_i$ is small, its contribution to the denominator will be large (its reciprocal), hence $\bar{x}_H$ will shrink. Therefore we can conclude that the harmonic mean implicitely weighs the small observations as more influential.

F1-score is a harmonic mean

That’s right, F1-score is just a harmonic mean of precission and recall. That is

\[F1 = \frac{2}{\frac{1}{precission} + \frac{1}{recall}}.\]

Hence, F1-score is a centrality measure of precission and recall that is biased towards 0 if one of the two values is disproportionately smaller than the other. So your classifier will only achieve a high F1-score if it is good at both precission and recall. If it sucks at one or the other, your F1-score will reflect this.

Why is this usefull for imbalanced datasets? In imbalanced datasets your classifier often defaults to just classifying everything as the more prevelant class (let’s say the positive one). In that case your recall is excelent but your precission sucks. In addition, if the imbalance is severe, you would also find your accuracy to be quite good. However, in this case the accuracy is a misleading metric as it doesn’t show that your classifier actually defaulted to labeling all the datapoints as positive. In such cases F1-score comes to rescue!