Skip to main content


Showing posts from April, 2013

Why does Kaggle use Log-loss?

Lunchtime Sports Science: Introducing tanh5

As I mentioned in a previous article on ratings systems, the log5 estimate for participant 1 beating participant 2 given respective success probabilities \( p_1, p_2 \) is
p &= \frac{p_1 q_2}{p_1 q_2+q_1 p_2}\\
&= \frac{p_1/q_1}{p_1/q_1+p_2/q_2}\\
\frac{p}{q} &= \frac{p_1}{q_1} \cdot \frac{q_2}{p_2}
\end{align} where \( q_1=1-p_1, q_2=1-p_2, q=1-p \).

Where does this come from? Assume that both participants each played average opposition. In a Bradley-Terry setting, this means \begin{align}
p_1 &= \frac{R_1}{R_1 + 1}\\
p_2 &= \frac{R_2}{R_2 + 1},
\end{align} where \( R_1 \) and \( R_2 \) are the (latent) Bradley-Terry ratings; the \( 1 \) in the denominators is an estimate for the average rating of the participants they've played en route to achieving their respective success probabilities.
In a Bradley-Terry setting, it's true that the product of the ratings in the entire pool is taken to equal 1. But participants don't play themselves! T…

Lunchtime Sports Science: Fitting a Bradley-Terry Model

Power rankings are game rankings that also allow you to estimate the likely outcome if two opponents were to face each other. One of the simplest of these models is known as the Bradley-Terry-Luce model (or commonly, Bradley-Terry). The idea is that each player \( i \) is assumed to have an unknown rating \( R_i \). If players \( i \) and \( j \) compete, the probability that \( i \) wins under this model is expected to be about \[ \frac{R_i}{R_i + R_j}. \] This model is very popular for hockey and other games; one commonly seen version is called KRACH.

Let's fit a Bradley-Terry model to the current season of NCAA D1 men's hockey. The Frozen Four starts on Thursday, April 11, so you'll get to see how well your predictions do.

You'll need to have R installed. Once R is installed, install the "BradleyTerry2" package that's freely available for R (thanks to Heather Turner and David Firth). To do this, start R and run the following command; you'll have to…

Lunchtime Sports Science: Cracking a New Sport

This is the first and what will be a series of relatively short pieces on sports analytics. I'll be using a variety of sports for examples, including both team sports and single-player sports, and I'll also make my code available through my GitHub account.

Here are my recommended tools. If you're unfamiliar with some of these, don't worry. You'll pick them up as you go along, and they form a powerful suite that will keep you on the cutting edge even as a professional data scientist.
Hardware - Ideally you want at least 4GB of RAM for larger data sets, but you'll be able to do high-level analysis with almost any modern computing hardware.Linux operating system - You can certainly do top-notch data analysis using any operating system, but Linux is an excellent (and free) working environment. There are a variety of ways to install and use Linux, but I'd recommend Ubuntu's Windows installer. This will allow you to easily install Ubuntu alongside Windows, and…