Thursday, October 15, 2015

An Enormous Number of Kilograms

For years the kilogram has been defined with respect to a platinum and iridium cylinder, but this is now no longer the case. Here's a puzzle about kilograms that's easy to state and understand, but the answer is very, very surprising.

I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a googol and the smallest number that satisfies the conditions of the Archimedes cattle problem.

When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.

We can see this by letting our \((n+1)\)-digit number \( x = 10 m + a\), where \(m\) is an \(n\)-digit number and \(0\leq a < 10\). Moving \(a\) to the front we get \(y = 10^n a + m\), and our requirement is \(y = 2x\). This gives: \begin{align} 20 m + 2 &= 10^n a + m \\ 19 m &= a(10^n-2) \\ m &= \frac{2a(5\cdot 10^{n-1} - 1)}{19} \end{align} The smallest \(m\), if one exists, requires \(a,n\) such that 19 divides \(5\cdot 10^{n-1}-1\) (as 19 can't divide \(2 a\)) and the result has \(n\)-digits. It's easy to check that the smallest value of \(n\) that satisfies the first condition is \(n=17\). To get the smallest solution we try \(a=1\), but this yields a value with only 16 digits. Setting \(a=2\), however, yields the 17-digit \(m = 10526315789473684\). The smallest solution to our puzzle is therefore the 18-digit number \(105263157894736842\); that's surprisingly large.

Numbers with this type of property are known as parasitic numbers.

Later, I wondered if there were numbers with the slightly different, but equally interesting property, that moving the last digit to the front converted ("autoconverts") it from a value under one unit of measurement to an equivalent value under a different unit of measurement.

The first one I tried was moving the last digit to the front converts from Celsius to Fahrenheit. This is a fun puzzle that eventually made its way into the New York Times. The smallest such value is 275 C, which exactly equals 527 F. What's the next smallest temperature?

How about moving the first digit to the end? We'll need to use the little-known fact that, legally, a pound is exactly equal to 0.45359237 kilograms. Given this, does there exist a number such that moving the first digit to the end converts from pounds to kilograms? The answer is yes, but the smallest solution has 108,437,840 digits! The solution is similar to the above, but as it's computationally more involved I've written Sage code to solve it, which you can find in my GitHub puzzles repository.

The smallest number that autoconverts from gallons to liters, incidentally, is even bigger at 382,614,539 digits!

Saturday, October 10, 2015

Solving a Math Puzzle using Physics

The following math problem, which appeared on a Scottish maths paper, has been making the internet rounds.

The first two parts require students to interpret the meaning of the components of the formula \(T(x) = 5 \sqrt{36+x^2} + 4(20-x) \), and the final "challenge" component involves finding the minimum of \( T(x) \) over \( 0 \leq x \leq 20 \). Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.

One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where \( \theta_1, \theta_2\) are the angles to the normal and \(v_1, v_2\) are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water and 1/4 on land. Furthermore, the crocodile travels along the riverbank once it hits land, so \( \theta_2 = 90^{\circ}\) and \(\sin{\theta_2} = 1\). Snell's law now says that the path of least time satisfies \[ \sin{\theta_1} = \frac{x}{\sqrt{36+x^2}} = \frac{4}{5},\] giving us \( 25x^2 = 16x^2 + 24^2\). Solving, \( 3^2 x^2 = 24^2, x^2 = 8^2\) and the solution is \(x = 8\).

Sunday, October 4, 2015

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models. These are models that contain both fixed and random effects. There are multiple ways of defining fixed vs random random effects, but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean.

Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository):
model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id)
fit <- glmer(model,
The fixed effects are year, field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime periods); offense (strength of offense), defense (strength of defense) and game_id are all random effects. The reason for modeling team offenses and defenses as random vs fixed effects is that I view them as random samples from the same distribution. As mentioned above, this results in statistical shrinkage or "regression to the mean" for these values, which is particularly useful for partially completed seasons.

One of the problems with large mixed models is that they can be very slow to fit. For example, the model above takes several hours on a 12-core workstation, which makes it very difficult to test model changes and tweaks. Is there any way to speed up the fitting process? Certainly! One way is to add two options to the above code:

fit <- glmer(model,
      control=glmerControl(optimizer = "nloptwrap")
What do these do? Model fitting is an optimization process. Part of that process involves the estimation of particular integrals, which can be very slow; the option "nAGQ=0" tells glmer to ignore estimating those integrals. For some models this has minimal impact on parameter estimates, and this NCAA hockey model is one of those. The second option tells glmer to fit using the "nloptwrap" optimizer (there are several other optimizers available, too), which tends to be faster than the default optimization method.

The impact can be rather startling. With the default options the above model takes about 3 hours to fit. Add these two options, and the model fitting takes 30 seconds with minimal impact on the parameter estimates, or approximately 400 times faster.

Thursday, October 1, 2015

Elo's Rating System as a Forgetful Logistic Model

Elo's rating system became famous from its use in chess, but it and variations are now used in sports like the NFL to eSports like League of Legends. It also was infamously used on various "Hot or Not" type websites, as shown in this scene from the movie "Social Network":

Of course, there's a mistake in the formula in the movie!

What is the Elo rating system? As originally proposed, it presumes that if two players A and B have ratings \(R_A\) and \(R_B\), then the expected score of player A is \[\frac{1}{1+10^{\frac{R_B-R_A}{400}}}.\] Furthermore, if A has a current rating of \(R_A\) and plays some more games, then the updated rating \({R_A}'\) is given by \({R_A}' = R_A + K(S_A-E_A)\), where \(K\) is an adjustment factor, \(S_A\) is the number of points scored by A and \(E_A\) was the expected number of points scored by A based on the rating \(R_A\).

Now, the expected score formula given above has the same form as a logistic regression model. What's the connection between the two? One answer is that Elo's rating system is a type of online version of a logistic model. An online algorithm is an algorithm that only sees each piece of data once. As applied to a statistical model, it's a model with parameter estimates that are updated as new data comes in, but not refitting on the entire data set. It can also be considered a memoryless model; it has "forgotten" the old data and only knows the current parameter estimates. The appeal of such models is that they're extremely efficient, can operate on enormous data sets and parameter estimates can be updated in real-time.

Okay, let's say we have a "forgetful" logistic model. Can we derive an updating rule, and does it look like Elo's? I'm going to give one possible derivation under the simplifying assumption that games are won or lost, with no ties.

We don't know how many games A and B had previously played, so let's assume they both had previously played \(n\) games and have just played \(m\) additional games between them, with A scoring \(S_A\) points. That means they've both played \(n+m\) games, but we're just going to forget this again, so let's adjust everything so that they end up with \(n\) games. One way to do this is to normalize \(n\) and \(m\) so that they sum to \(n\), thus \(n\) becomes \(n\frac{n}{n+m}\) and \(m\) becomes \(m\frac{n}{n+m}\).

We're now assuming they had each played \(n\frac{n}{n+m}\) games in the past, have just played \(m\frac{n}{n+m}\) additional games and A scored \(S_A \frac{n}{n+m}\) points (it has to be adjusted, too!) in those games.

Again, we're memoryless, so we don't know how strong the opponents were that each had played in the past, so we're going to assume that they had both played opponents that were as strong as themselves and had won half and lost half of those games. After all, people generally prefer to play competitive games.

Define \(d\) by \({R_A}' = R_A + d\) and let \(c = R_A - R_B\); we also require that \({R_B}' = R_B - d\). The log-likelihood \(L\) of A having scored \(S_A \frac{n}{n+m}\) points is \[ \frac{-2 n^2}{n+m}\log(1+e^{-d}) -\frac{n^2}{n+m}d-\frac{m n}{n+m}\log(1+e^{-c-2d}) - \frac{(m-S_A)n(c+2d)}{n+m}.\] Factoring out the constant term \(n/(n+m)\) simplifies this to \[ L = -2 n\log(1+e^{-d}) - n d - m \log(1+e^{-c-2d}) - (m-S_A)(c+2d).\] Taking the partial derivative of \(L\) with respect to \(d\) we get
\frac{\partial L}{\partial d} &= 2n \frac{e^{-d}}{1+e^{-d}} -n + 2m \frac{e^{-c-2d}}{1+e^{-c-2d}}-2(m-S_A) \\
&= -n\frac{1-e^{-d}}{1+e^{-d}} + 2 S_A - 2m\frac{1}{1+e^{-c-2d}} \\
&= -n\tanh(d/2) + 2 S_A - 2m\frac{1}{1+e^{-c-2d}}.
\end{align} What is \( m\frac{1}{1+e^{-c-2d}} \)? This is actually just \( {E_A}' \), the expected score for A when playing B for \(m\) games, but assuming the updated ratings for both players. Finally, setting \(\frac{\partial L}{\partial d} = 0\), we get \[ n\tanh(d/2) = 2(S_A - {E_A}')\] and hence \[ \tanh(d/2) = \frac{2}{n} (S_A - {E_A}').\] Assuming \(n\) is large relative to \(S_A - {E_A}'\), we have \( \tanh(d/2) \approx d/2\) and \( {E_A}' \approx E_A \). This is Elo's updating rule in the form \[ d = \frac{4}{n} (S_A - E_A ).\] If we rescale with the constant \( \sigma \), the updating rule becomes \[ d = \frac{4\sigma }{n} (S_A - E_A ).\] We also now see that the adjustment factor \( K = \frac{4\sigma }{n}\).

Saturday, July 11, 2015

Power Rankings: Looking at a Very Simple Method

One of the simplest and most common power ranking models is known as the Bradley-Terry-Luce model, which is equivalent to other famous models such the logistic model and the Elo rating system. I'll be referring to "teams" here, but of course the same ideas apply to any two-participant game.

Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.

In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" \(R_i\), and that the probability of \(i\) beating \(j\) is \( \frac{R_i}{R_i+R_j} \). Note that each \(R_i\) is assumed to be strictly positive. Where does this model structure come from?

Here are three reasonable constraints for a power ranking model:
  1.  If \(R_i\) and \(R_j\) have equal strength, the probability of one beating the other should be \( \frac{1}{2}\).
  2. As the strength of one team strictly approaches 0 (infinitely weak) with the other team fixed, the probability of the other team winning strictly increases to 1.
  3. As the strength of one team strictly approaches 1 (infinitely strong) with the other team fixed, the probability of the other team winning strictly decreases to 0.
Note that our model structure satisfies all three constraints. Can you think of other simple model structures that satisfy all three constraints?

Given this model and a set of teams and match results, how can we estimate the \(R_i\). The maximum-likelihood estimators are the set of \( R_i \) that maximizes the probability of the observed outcomes actually happening. For any given match this probability of team \( i \) beating team \( j \) is \( \frac{R_i}{R_i+R_j} \), so the overall probability of the observed outcomes of the matches \( M \) occurring is \[ \mathcal{L} = \prod_{m\in M} \frac{R_{w(m)}}{R_{w(m)}+R_{l(m)}},\] where \( w(m) \) is then winner and \( l(m) \) the loser of match \( m \). We can transform this into a sum by taking logarithms; \[ \log\left( \mathcal{L} \right) = \log\left(R_{w(m)}\right) - \log\left(R_{w(m)}+R_{l(m)}\right).\] Before going further, let's make a useful reparameterization by setting \( e^{r_i} = R_i \); this makes sense as we're requiring the \( R_i \) to be strictly positive. We then get \[ \log\left( \mathcal{L} \right) = r_{w(m)} - \log\left(e^{r_{w(m)}}+e^{r_{l(m)}}\right).\] Taking partial derivatives we get \begin{eqnarray*}
\frac{\partial \log\left( \mathcal{L} \right)}{\partial r_i} &=& \sum_{w(m)=i} 1 - \frac{e^{r_{w(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_{l(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}}\\
&=& \sum_{w(m)=i} 1 - \frac{e^{r_i}}{e^{r_i}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_i}}{e^{r_{w(m)}}+e^{r_i}}\\
\end{eqnarray*} But this is just the number of actual wins minus the expected wins! Thus, the maximum likelihood estimators for the \( r_i \) satisfy \( O_i = E_i \) for all teams \( i \), where \( O_i \) is the actual (observed) number of wins for team \( i \), and \( E_i \) is the expected number of wins for team \( i \) based on our model. That's a nice property!

If you'd like to experiment with some actual data, and to see that the resulting fit does indeed satisfy this property, here's an example BTL model using NCAA men's ice hockey scores. You can, of course, actually use this property to iteratively solve for the MLE estimators \( R_i \). Note that you'll have to fix one of the \( R_i \) to be a particular value (or add some other constraint), as the model probabilities are invariant with respect to multiplication of the \( R_i \) by the same positive scalar.

Getting Started Doing Baseball Analysis without Coding

There's lot of confusion about how best to get started doing baseball analysis. It doesn't have to be difficult! You can start doing it right away, even if you don't know anything about R, Python, Ruby, SQL or machine learning (most GMs can't code). Learning these and other tools makes it easier and faster to do analysis, but they're only part of the process of constructing a well-reasoned argument. They're important, of course, because they can turn 2 months of hard work into 10 minutes of typing. Even if you don't like mathematics, statistics, coding or databases, they're mundane necessities that can make your life much easier and your analysis more powerful.

Here are two example problems. You don't have to do these specifically, but they illustrate the general idea. Write up your solutions, then publish them for other people to make some (hopefully) helpful comments and suggestions. This can be on a blog or through a versioning control platform like GitHub (which is also great for versioning any code or data your use). Try to write well! A great argument, but poorly written and poorly presented isn't going to be very convincing. Once it's finished, review and revise, review and revise, review and revise. When a team you follow makes a move, treat it as a puzzle for you to solve. Why did they do it, and was it a good idea?
  1. Pick a recent baseball trade. For example, the Padres traded catcher Yasmani Grandal for Dodgers outfielder Matt Kemp. It's never that simple of course; the Padres aren't paying all of Matt Kemp's salary. Find out what the salary obligations were for each club in this trade. Using your favorite public projection system, where were the projected surplus values for each player at the time of the trade? Of course, there were other players involved in that trade, too. What were the expected surplus values of those players? From the perspective of surplus values, who won or lost this trade? Finally, why do you think each team made this trade, especially considering that they were division rivals? Do you think one or both teams made any mistakes in reasoning; if so, what were they, and did the other team take advantage of those mistakes?
  2. Pick any MLB team and review the draft picks they made in the 2015 draft for the first 10 rounds. Do you notice any trends or changes from the 2014 draft? Do these picks agree or disagree with the various public pre-draft player rankings? Which picks were designed to save money to help sign other picks? Identify those tough signs. Was the team actually able to sign them, and were the picks to save money still reasonably good picks? Do you best to identify which picks you thought were good and bad, write them down in a notebook with your reasoning, then check back in 6 months and a year. Was your reasoning correct? If not, what were your mistakes and how can you avoid making them in the future?