Skip to main content


Showing posts from 2015

Why does Kaggle use Log-loss?

An Enormous Number of Kilograms

For years the kilogram has been defined with respect to a platinum and iridium cylinder, but this is now no longer the case. Here's a puzzle about kilograms that's easy to state and understand, but the answer is very, very surprising.

I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a googol and the smallest number that satisfies the conditions of the Archimedes cattle problem.

When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.

We can see this by letting our \((n+1)\)-digit number \( x = 10 m + a\), where \…

Solving a Math Puzzle using Physics

The following math problem, which appeared on a Scottish maths paper, has been making the internet rounds.

The first two parts require students to interpret the meaning of the components of the formula \(T(x) = 5 \sqrt{36+x^2} + 4(20-x) \), and the final "challenge" component involves finding the minimum of \( T(x) \) over \( 0 \leq x \leq 20 \). Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.

One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where \( \theta_1, \theta_2\) are the angles to the normal and \(v_1, v_2\) are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water …

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models. These are models that contain both fixed and random effects. There are multiple ways of defining fixed vs random random effects, but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean.

Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository):
model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id) fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log) ) The fixed effects are year, field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime periods); off…

Elo's Rating System as a Forgetful Logistic Model

Elo's rating system became famous from its use in chess, but it and variations are now used in sports like the NFL to eSports like League of Legends. It also was infamously used on various "Hot or Not" type websites, as shown in this scene from the movie "Social Network":

Of course, there's a mistake in the formula in the movie!
What is the Elo rating system? As originally proposed, it presumes that if two players A and B have ratings \(R_A\) and \(R_B\), then the expected score of player A is \[\frac{1}{1+10^{\frac{R_B-R_A}{400}}}.\] Furthermore, if A has a current rating of \(R_A\) and plays some more games, then the updated rating \({R_A}'\) is given by \({R_A}' = R_A + K(S_A-E_A)\), where \(K\) is an adjustment factor, \(S_A\) is the number of points scored by A and \(E_A\) was the expected number of points scored by A based on the rating \(R_A\).
Now, the expected score formula given above has the same form as a logistic regression model. What'…

Power Rankings: Looking at a Very Simple Method

One of the simplest and most common power ranking models is known as the Bradley-Terry-Luce model, which is equivalent to other famous models such the logistic model and the Elo rating system. I'll be referring to "teams" here, but of course the same ideas apply to any two-participant game.

Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.

In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" \(R_i\), and that the probability of \(i\) beating \(j\) is \( \frac{R_i}{R_i+R_j} \). Note that each \(R_i\) is assumed to be strictly positive. Where does this model structure come from?

Here are three reasonable constraints for a power ranking model:
 If \(R_i\) and \(R_j\) have equal strength, the probability of one b…

Getting Started Doing Baseball Analysis without Coding

There's lot of confusion about how best to get started doing baseball analysis. It doesn't have to be difficult! You can start doing it right away, even if you don't know anything about R, Python, Ruby, SQL or machine learning (most GMs can't code). Learning these and other tools makes it easier and faster to do analysis, but they're only part of the process of constructing a well-reasoned argument. They're important, of course, because they can turn 2 months of hard work into 10 minutes of typing. Even if you don't like mathematics, statistics, coding or databases, they're mundane necessities that can make your life much easier and your analysis more powerful.

Here are two example problems. You don't have to do these specifically, but they illustrate the general idea. Write up your solutions, then publish them for other people to make some (hopefully) helpful comments and suggestions. This can be on a blog or through a versioning control platform l…

Some Potentially Useful SQL Resources

Who Controls the Pace in Basketball, Offense or Defense?

During a recent chat with basketball analyst Seth Partnow, he mentioned a topic that came up during a discussion at the recent MIT Sloan Sports Analytics Conference. Who  controls the pace of a game more, the offense or defense? And what is the percentage of pace responsibility for each side? The analysts came up with a rough consensus opinion, but is there a way to answer this question analytically? I came up with one approach that examines the variations in possession times, but it suddenly occurred to me that this question could also be answered immediately by looking at the offense-defense asymmetry of the home court advantage.
As you can see in the R output of my NCAA team model code in one of my public basketball repositories, the offense at home scores points at a rate about \( e^{0.0302} = 1.031 \) times the rate on a neutral court, everything else the same. Likewise, the defense at home allows points at a rate about \( e^{-0.0165} = 0.984\) times the rate on a neutral court; …

Baseball's Billion Dollar Equation

In 1999 Voros McCracken infamously speculated about the amount of control the pitcher had over balls put in play. Not so much, as it turned out, and DIPS was born. It's tough to put a value on something like DIPS, but if an MLB team had developed and exploited it for several years, it could potentially have been worth hundreds of millions of dollars. Likewise, catcher framing could easily have been worth hundreds of millions.

How about a billion dollar equation? Sure, look at the baseball draft. An 8th round draft pick like Paul Goldschmidt could net you a $200M surplus. And then there's Chase Headley, Matt Carpenter, Brandon Belt, Jason Kipnis and Matt Adams. The commonality? All college position players easily identified as likely major leaguers purely through statistical analysis. You can also do statistical analysis for college pitchers, of course, but ideally you'd also want velocities. These are frequently available through public sources, but you may have to put the…

A Very Rough Guide to Getting Started in Data Science: Part II, The Big Picture

Data science to a beginner seems completely overwhelming. Not only are there huge numbers of programming languages, packages and algorithms, but even managing your data is an entire area itself. Some examples are the languages R, Python, Ruby, Perl, Julia, Mathematica, MATLAB/Octave; packages SAS, STATA, SPSS; algorithms linear regression, logistic regression, nested model, neural nets, support vector machines, linear discriminant analysis and deep learning.
For managing your data some people use Excel, or a relational database like MySQL or PostgreSQL. And where do things like big data, NoSQL and Hadoop fit in? And what's gradient descent and why is it important? But perhaps the most difficult part of all is that you actually need to know and understand statistics, too.

It does seem overwhelming, but there's a simple key idea - data science is using data to answer a question. Even if you're only sketching a graph using a stick and a sandbox, you're still doing data sc…

More Measles: Vaccination Rates and School Funding

I took a look at California's personal belief exemption rate (PBE) for kindergarten vaccinations in Part I. California also provides poverty information for public schools through the Free or Reduced Price Meals data sets, both of which conveniently include California's school codes. Cleaned versions of these data sets and my R code are in my vaccination GitHub.

We can use the school code as a key to join these two data sets. But remember, the FRPM data set only includes data about public schools, so we'll have to retain the private school data for PBEs by doing what's called a left outer join. This still performs a join on the school code key, but if any school codes included in the left data don't have corresponding entries in the right data set we still retain them. The missing values for the right data set in this case are set to NULL.

We can perform a left outer join in R by using "merge" with the option "all.x=TRUE". I'll start by look…

Mere Measles: A Look at Vaccination Rates in California, Part I

California is currently at the epicenter of a measles outbreak, a disease that was considered all but eradicated in the US as of a few years ago. Measles is a nasty disease; it's easily transmitted and at its worst can cause measles encephalitis, leading to brain damage or even death.

The increasing problem in the US with the measles, whooping cough and other nearly-eradicated diseases stems from a liberal personal belief exemption policy in California and other states. This wasn't a major problem until Andrew Wakefieldfamously and fraudulently tied autism to the MMR vaccine. This has led to thousands of unnecessary deaths as well as needless misery for thousands more. I myself caught a case of the whooping cough in San Diego a few years ago as a consequence of Wakefield's fraud. I've had several MMR vaccines over my life, but adults may still only be partially immune; this is yet another reason why a healthy level of herd immunity is so critical to maintain.

PBE rates…

Touring Waldo; Overfitting Waldo; Scanning Waldo; Waldo, Waldo, Waldo

Randal Olson has written a nice article on finding Waldo - Here’s Waldo: Computing the optimal search strategy for finding Waldo. Randal presents a variety of machine learning methods to find very good search paths among the 68 known locations of Waldo. Of course, there's no need for an approximation; modern algorithms can optimize tiny problems like these exactly.

One approach would be to treat this as a traveling salesman problem with Euclidean distances as edge weights, but you'll need to add a dummy node that has edge weight 0 to every node. Once you have the optimal tour, delete the dummy node and you have your optimal Hamiltonian path.

I haven't coded in the dummy node yet, but here's the Waldo problem as a traveling salesman problem using TSPLIB format.

The Condorde software package optimizes this in a fraction of a second:

I'll be updating this article to graphically show you the results for the optimal Hamiltonian path. There are also many additional quest…

Short Notes: Get CUDA and gputools Running on Ubuntu 14.10

Here's a basic guide for getting CUDA 7.0 and the R package gputools running perfectly under Ubuntu 14.10. It's not difficult, but there are a few issues and this will be helpful to have in a single place.

If you're running Ubuntu 14.10, I'd recommend installing CUDA 7.0. NVIDIA has a 7.0 Debian package specifically for 14.10; this wasn't the case for CUDA 6.5, which only had a Debian package for 14.04.

To get access to CUDA 7.0, you'll first need to register as a CUDA developer.

Join The CUDA Registered Developer Program

Once you have access, navigate to the CUDA 7.0 download page and get the Debian package.

CUDA 7.0 Release Candidate Downloads

You'll either need to be running the NVIDIA 340 or 346 drivers. If you're having trouble upgrading, I'd suggest adding the xorg-edgers PPA.

Once your NVIDIA driver is set, install the CUDA 7.0 Debian package you've downloaded. Don't forget to remove any previously installed CUDA packages or repositories…

A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs

Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be …

How Unfair are the NFL's Overtime Rules?

In 2010 the NFL amended its overtime rules, and in 2012 extended these to all regular season games. Previously, overtime was handled by sudden death - the first team to score won. The team winning a coin flip can elect to kick or receive (they invariably receive, as they should).

Assuming the game ends in the first overtime, the team with the first possession wins under the following scenarios:
scores a touchdown on the first drivekicks a field goal on the first drive; other team fails to score on the second driveboth teams kick a field goal on the first and second drives; win in sudden deathdoesn't score on the first drive; defensive score during second driveneither team scores on first or second drives; win in sudden death Under this overtime procedure, roughly how often should be expect the team winning the coin flip to win the game?
For an average team the empirical probabilities of the above events per drive are: \(\mathrm{defensiveTD} = \mathrm{Pr}(\text{defensive touchdown}) …

A Short Note on Automatic Algorithm Optimization via Fast Matrix Exponentiation

Alexander Borzunov has written an interesting article about his Python code that uses fast matrix exponentiation to automatically optimize certain algorithms. It's definitely a recommended read.

In his article, Alexander mentions that it's difficult to directly derive a matrix exponentiation algorithm for recursively-defined sequences such as \[
F_n = \begin{cases}
0, & n = 0\\
1, & n = 1\\
1, & n = 2\\
7(2F_{n-1} + 3F_{n-2})+4F_{n-3}+5n+6, & n \geq 3
\] While it's true that it's not entirely simple, there is a relatively straightforward way to do this that's worth knowing.

The only difficultly is due to the term \(5n+6\), but we can eliminate it by setting
\(F_n = G_n + an+b\), then solving for appropriate values of \(a, b\).

Substituting and grouping terms we have \[
G_n + an+b = 7(2G_{n-1} + 3G_{n-2})+4G_{n-3} + 39an-68a+39b+5n+6,
\] and equating powers of \(n\) we need to solve the equations \[
a &= 39a+5,\\
b &…

Young Alan Turing and the Arctangent

With the release of the new film "The Imitation Game", I decided to read the biography this excellent film was based on - Alan Turing: The Enigma. In it, the author Andrew Hodges relates the story that the 15-year-old Alan Turing derived the Maclaurin series for the \(\arctan\) function, i.e. \[\arctan(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \ldots\] This is trivial using calculus, but it's explicitly stated that young Alan Turing neither knew nor used calculus. How would you derived such a series without calculus?

This is a tricky problem, and I'd suggest first tackling the much easier problem of deriving the Maclaurin series for \(\exp(x)\) from the relation \( \exp(2x) = \exp(x)\cdot \exp(x)\). This is an underconstrained relation, so you'll need to assume \(c_0 = 1, c_1 = 1\).

Getting back to \(\arctan\), you could start with the half-angle formula for the tangent: \[\tan(2x) = \frac{2\tan(x)}{1-{\tan}^2(x)}.\] Now use the Weierstrass-like su…