Why does Kaggle use Log-loss?

If you're not familiar with Kaggle, it's an organization dedicated to data science competitions to both provide ways for companies to potentially do analytics at less cost, as well as to identify talented data scientists.

Competitions are scored using a variety of functions, and the most common for binary classification tasks with confidence is something called log-loss, which is essentially \(\sum_{i=1}^{n} \log(p_i)\), where \(p_i\) is your model's claimed confidence for test data point \(i\)'s correct label. Why does Kaggle use this scoring function? Here I'll follow Terry Tao's argument.

Ideally what we'd like is a scoring function \(f(x)\) that yields the maximum expected score precisely when the claimed confidence \(x_i\) in the correct label for \(i\) is actually what the submitter believes is the true probability (or frequency) of that outcome. This means that we want \[L(x)=p\cdot f(x) + (1-p)\cdot f(1-x)\] for fixed \(p\) to be maximized when \(x=p\). Differentiating, this means \[L'(x) = p\cdot f'(x) - (1-p)\cdot f'(1-x) = 0\] when \(x=p\), hence \(p\cdot f'(p) = (1-p)\cdot f'(1-p)\) for all \(p\). This will be satisfied by any admissible \(f(x)\) with \(x\cdot f'(x)\) symmetric around \(x=\frac{1}{2}\), but if we extend our analysis to multinomial outcomes we get the stronger conclusion that in fact \(x\cdot f'(x) = c_0\) for some constant \(c_0\). This in turn implies \(f(x)=c_0\cdot \log(x)+c_1\). If we want \(f(1/2)=0\) and \(f(1)=1\), we end up with \(f(x)={\log}_2(2x)\) and the expected score is \[L(x)=x\cdot {\log}_2(2x) + (1-x)\cdot {\log}_2(2(1-x)).\]

The Angry Statistician

Search This Blog

Probability and Cumulative Dice Sums

Why does Kaggle use Log-loss?

Comments

Post a Comment

Popular posts from this blog

Mining the First 3.5 Million California Unclaimed Property Records

Mixed Models in R - Bigger, Faster, Stronger

A Bayes' Solution to Monty Hall