# The Angry Statistician

## Tuesday, February 9, 2016

### When is a Lead Safe in the NBA?

I'll present a simple derivation. Start by observing there are about 50 scoring groups per team per game (scoring groups include all baskets and free throws that occur at the same time), with each scoring group worth about two points. Assume scoring events by team are Poisson distributed with parameter \(\lambda = \frac{50\cdot t}{48\cdot 60} = \frac{t}{57.6}\). Using a normal approximation, the difference of these two distributions is normal with mean 0 and variance \(\sqrt{2}\lambda\), giving a standard deviation of \(0.1863\sqrt{t}\).

Using this approximation, what is a 90% safe lead? A 90% tail is 1.28 standard deviations, \(1.28\cdot 0.1863\sqrt{t} = 0.2385\sqrt{t}\) scoring groups. As a scoring group is about two points, this means a 90% safe lead, assuming a jump ball, is about \(0.477\sqrt{t}\) points (Clauset et. al. obtained \(0.4602\sqrt{t}\)). For example, a safe lead at halftime is approximately \(0.477 \sqrt{24\cdot 60} = 18.1\) points.

Next - adjustments for possession arrow and shot clock time; validity of approximation; adjusting for team strengths.

## Thursday, October 15, 2015

### An Enormous Number of Kilograms

I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a googol and the smallest number that satisfies the conditions of the Archimedes cattle problem.

When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.

We can see this by letting our \((n+1)\)-digit number \( x = 10 m + a\), where \(m\) is an \(n\)-digit number and \(0\leq a < 10\). Moving \(a\) to the front we get \(y = 10^n a + m\), and our requirement is \(y = 2x\). This gives: \begin{align} 20 m + 2 &= 10^n a + m \\ 19 m &= a(10^n-2) \\ m &= \frac{2a(5\cdot 10^{n-1} - 1)}{19} \end{align} The smallest \(m\), if one exists, requires \(a,n\) such that 19 divides \(5\cdot 10^{n-1}-1\) (as 19 can't divide \(2 a\)) and the result has \(n\)-digits. It's easy to check that the smallest value of \(n\) that satisfies the first condition is \(n=17\). To get the smallest solution we try \(a=1\), but this yields a value with only 16 digits. Setting \(a=2\), however, yields the 17-digit \(m = 10526315789473684\). The smallest solution to our puzzle is therefore the 18-digit number \(105263157894736842\); that's surprisingly large.

Numbers with this type of property are known as parasitic numbers.

Later, I wondered if there were numbers with the slightly different, but equally interesting property, that moving the last digit to the front converted ("autoconverts") it from a value under one unit of measurement to an equivalent value under a different unit of measurement.

The first one I tried was moving the last digit to the front converts from Celsius to Fahrenheit. This is a fun puzzle that eventually made its way into the New York Times. The smallest such value is 275 C, which exactly equals 527 F. What's the next smallest temperature?

How about moving the first digit to the end? We'll need to use the little-known fact that, legally, a pound is exactly equal to 0.45359237 kilograms. Given this, does there exist a number such that moving the first digit to the end converts from pounds to kilograms? The answer is yes, but the smallest solution has 108,437,840 digits! The solution is similar to the above, but as it's computationally more involved I've written Sage code to solve it, which you can find in my GitHub puzzles repository.

The smallest number that autoconverts from gallons to liters, incidentally, is even bigger at 382,614,539 digits!

## Saturday, October 10, 2015

### Solving a Math Puzzle using Physics

The first two parts require students to interpret the meaning of the components of the formula \(T(x) = 5 \sqrt{36+x^2} + 4(20-x) \), and the final "challenge" component involves finding the minimum of \( T(x) \) over \( 0 \leq x \leq 20 \). Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.

One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where \( \theta_1, \theta_2\) are the angles to the normal and \(v_1, v_2\) are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water and 1/4 on land. Furthermore, the crocodile travels along the riverbank once it hits land, so \( \theta_2 = 90^{\circ}\) and \(\sin{\theta_2} = 1\). Snell's law now says that the path of least time satisfies \[ \sin{\theta_1} = \frac{x}{\sqrt{36+x^2}} = \frac{4}{5},\] giving us \( 25x^2 = 16x^2 + 24^2\). Solving, \( 3^2 x^2 = 24^2, x^2 = 8^2\) and the solution is \(x = 8\).

## Sunday, October 4, 2015

### Mixed Models in R - Bigger, Faster, Stronger

Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository):

model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id) fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log) )The fixed effects are

**year**,

**field**(home/away/neutral),

**d_div**(NCAA division of the defense),

**o_div**(NCAA division of the offense) and

**game_length**(number of overtime periods);

**offense**(strength of offense),

**defense**(strength of defense) and

**game_id**are all random effects. The reason for modeling team offenses and defenses as random vs fixed effects is that I view them as random samples from the same distribution. As mentioned above, this results in statistical shrinkage or "regression to the mean" for these values, which is particularly useful for partially completed seasons.

One of the problems with large mixed models is that they can be very slow to fit. For example, the model above takes several hours on a 12-core workstation, which makes it very difficult to test model changes and tweaks. Is there any way to speed up the fitting process? Certainly! One way is to add two options to the above code:

fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log), nAGQ=0, control=glmerControl(optimizer = "nloptwrap") )What do these do? Model fitting is an optimization process. Part of that process involves the estimation of particular integrals, which can be very slow; the option "nAGQ=0" tells glmer to ignore estimating those integrals. For some models this has minimal impact on parameter estimates, and this NCAA hockey model is one of those. The second option tells glmer to fit using the "nloptwrap" optimizer (there are several other optimizers available, too), which tends to be faster than the default optimization method.

The impact can be rather startling. With the default options the above model takes about 3 hours to fit. Add these two options, and the model fitting takes 30 seconds with minimal impact on the parameter estimates, or approximately 400 times faster.

## Thursday, October 1, 2015

### Elo's Rating System as a Forgetful Logistic Model

\begin{align}

\frac{\partial L}{\partial d} &= 2n \frac{e^{-d}}{1+e^{-d}} -n + 2m \frac{e^{-c-2d}}{1+e^{-c-2d}}-2(m-S_A) \\

&= -n\frac{1-e^{-d}}{1+e^{-d}} + 2 S_A - 2m\frac{1}{1+e^{-c-2d}} \\

&= -n\tanh(d/2) + 2 S_A - 2m\frac{1}{1+e^{-c-2d}}.

\end{align} What is \( m\frac{1}{1+e^{-c-2d}} \)? This is actually just \( {E_A}' \), the expected score for A when playing B for \(m\) games, but assuming the updated ratings for both players. Finally, setting \(\frac{\partial L}{\partial d} = 0\), we get \[ n\tanh(d/2) = 2(S_A - {E_A}')\] and hence \[ \tanh(d/2) = \frac{2}{n} (S_A - {E_A}').\] Assuming \(n\) is large relative to \(S_A - {E_A}'\), we have \( \tanh(d/2) \approx d/2\) and \( {E_A}' \approx E_A \). This is Elo's updating rule in the form \[ d = \frac{4}{n} (S_A - E_A ).\] If we rescale with the constant \( \sigma \), the updating rule becomes \[ d = \frac{4\sigma }{n} (S_A - E_A ).\] We also now see that the adjustment factor \( K = \frac{4\sigma }{n}\).

## Saturday, July 11, 2015

### Power Rankings: Looking at a Very Simple Method

Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.

In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" \(R_i\), and that the probability of \(i\) beating \(j\) is \( \frac{R_i}{R_i+R_j} \). Note that each \(R_i\) is assumed to be strictly positive. Where does this model structure come from?

Here are three reasonable constraints for a power ranking model:

- If \(R_i\) and \(R_j\) have equal strength, the probability of one beating the other should be \( \frac{1}{2}\).
- As the strength of one team strictly approaches 0 (infinitely weak) with the other team fixed, the probability of the other team winning strictly increases to 1.
- As the strength of one team strictly approaches 1 (infinitely strong) with the other team fixed, the probability of the other team winning strictly decreases to 0.

\frac{\partial \log\left( \mathcal{L} \right)}{\partial r_i} &=& \sum_{w(m)=i} 1 - \frac{e^{r_{w(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_{l(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}}\\

&=& \sum_{w(m)=i} 1 - \frac{e^{r_i}}{e^{r_i}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_i}}{e^{r_{w(m)}}+e^{r_i}}\\

&=&0.

\end{eqnarray*} But this is just the number of actual wins minus the expected wins! Thus, the maximum likelihood estimators for the \( r_i \) satisfy \( O_i = E_i \) for all teams \( i \), where \( O_i \) is the actual (observed) number of wins for team \( i \), and \( E_i \) is the expected number of wins for team \( i \) based on our model. That's a nice property!

If you'd like to experiment with some actual data, and to see that the resulting fit does indeed satisfy this property, here's an example BTL model using NCAA men's ice hockey scores. You can, of course, actually use this property to iteratively solve for the MLE estimators \( R_i \). Note that you'll have to fix one of the \( R_i \) to be a particular value (or add some other constraint), as the model probabilities are invariant with respect to multiplication of the \( R_i \) by the same positive scalar.

### Getting Started Doing Baseball Analysis without Coding

Here are two example problems. You don't have to do these specifically, but they illustrate the general idea. Write up your solutions, then publish them for other people to make some (hopefully) helpful comments and suggestions. This can be on a blog or through a versioning control platform like GitHub (which is also great for versioning any code or data your use). Try to write well! A great argument, but poorly written and poorly presented isn't going to be very convincing. Once it's finished, review and revise, review and revise, review and revise. When a team you follow makes a move, treat it as a puzzle for you to solve. Why did they do it, and was it a good idea?

- Pick a recent baseball trade. For example, the Padres traded catcher Yasmani Grandal for Dodgers outfielder Matt Kemp. It's never that simple of course; the Padres aren't paying all of Matt Kemp's salary. Find out what the salary obligations were for each club in this trade. Using your favorite public projection system, where were the projected surplus values for each player at the time of the trade? Of course, there were other players involved in that trade, too. What were the expected surplus values of those players? From the perspective of surplus values, who won or lost this trade? Finally, why do you think each team made this trade, especially considering that they were division rivals? Do you think one or both teams made any mistakes in reasoning; if so, what were they, and did the other team take advantage of those mistakes?
- Pick any MLB team and review the draft picks they made in the 2015 draft for the first 10 rounds. Do you notice any trends or changes from the 2014 draft? Do these picks agree or disagree with the various public pre-draft player rankings? Which picks were designed to save money to help sign other picks? Identify those tough signs. Was the team actually able to sign them, and were the picks to save money still reasonably good picks? Do you best to identify which picks you thought were good and bad, write them down in a notebook with your reasoning, then check back in 6 months and a year. Was your reasoning correct? If not, what were your mistakes and how can you avoid making them in the future?