- Get link
- X
- Other Apps
If you're not familiar with Kaggle, it's an organization dedicated to data science competitions to both provide ways for companies to potentially do analytics at less cost, as well as to identify talented data scientists.
Competitions are scored using a variety of functions, and the most common for binary classification tasks with confidence is something called log-loss, which is essentially \(\sum_{i=1}^{n} \log(p_i)\), where \(p_i\) is your model's claimed confidence for test data point \(i\)'s correct label. Why does Kaggle use this scoring function? Here I'll follow Terry Tao's argument.
Ideally what we'd like is a scoring function \(f(x)\) that yields the maximum expected score precisely when the claimed confidence \(x_i\) in the correct label for \(i\) is actually what the submitter believes is the true probability (or frequency) of that outcome. This means that we want \[L(x)=p\cdot f(x) + (1-p)\cdot f(1-x)\] for fixed \(p\) to be maximized when \(x=p\). Differentiating, this means \[L'(x) = p\cdot f'(x) - (1-p)\cdot f'(1-x) = 0\] when \(x=p\), hence \(p\cdot f'(p) = (1-p)\cdot f'(1-p)\) for all \(p\). This will be satisfied by any admissible \(f(x)\) with \(x\cdot f'(x)\) symmetric around \(x=\frac{1}{2}\), but if we extend our analysis to multinomial outcomes we get the stronger conclusion that in fact \(x\cdot f'(x) = c_0\) for some constant \(c_0\). This in turn implies \(f(x)=c_0\cdot \log(x)+c_1\). If we want \(f(1/2)=0\) and \(f(1)=1\), we end up with \(f(x)={\log}_2(2x)\) and the expected score is \[L(x)=x\cdot {\log}_2(2x) + (1-x)\cdot {\log}_2(2(1-x)).\]
Competitions are scored using a variety of functions, and the most common for binary classification tasks with confidence is something called log-loss, which is essentially \(\sum_{i=1}^{n} \log(p_i)\), where \(p_i\) is your model's claimed confidence for test data point \(i\)'s correct label. Why does Kaggle use this scoring function? Here I'll follow Terry Tao's argument.
Ideally what we'd like is a scoring function \(f(x)\) that yields the maximum expected score precisely when the claimed confidence \(x_i\) in the correct label for \(i\) is actually what the submitter believes is the true probability (or frequency) of that outcome. This means that we want \[L(x)=p\cdot f(x) + (1-p)\cdot f(1-x)\] for fixed \(p\) to be maximized when \(x=p\). Differentiating, this means \[L'(x) = p\cdot f'(x) - (1-p)\cdot f'(1-x) = 0\] when \(x=p\), hence \(p\cdot f'(p) = (1-p)\cdot f'(1-p)\) for all \(p\). This will be satisfied by any admissible \(f(x)\) with \(x\cdot f'(x)\) symmetric around \(x=\frac{1}{2}\), but if we extend our analysis to multinomial outcomes we get the stronger conclusion that in fact \(x\cdot f'(x) = c_0\) for some constant \(c_0\). This in turn implies \(f(x)=c_0\cdot \log(x)+c_1\). If we want \(f(1/2)=0\) and \(f(1)=1\), we end up with \(f(x)={\log}_2(2x)\) and the expected score is \[L(x)=x\cdot {\log}_2(2x) + (1-x)\cdot {\log}_2(2(1-x)).\]
Comments
Post a Comment