# The Angry Statistician

## Monday, May 2, 2016

### Behind the Speadsheet

It was really an impossible task. Professional sports at every level are filled with highly accomplished and competitive athletes, with real lives and real egos. Now imagine walking in one day and suddenly trying to convince them that they should be doing things differently. Who do you think you are?

I was one of the analysts who helped Ben and Sam in this quest, and I wanted to write some thoughts down from my own perspective, not as one of the main characters, but as someone more behind the scenes. These are some very short initial thoughts only, but I'd like to followup with some more ideas on where things went wrong from my perspective, and also how independent league teams can better identify roster talent from some non-traditional sources.

My focus was on attempting to identify talent overlooked in the MLB draft. This is extremely challenging; there are 30 teams, 40 standards rounds plus other picks. Furthermore, among those players left, many sign as amateur free agents post-draft. You're left with players from lower divisions, very small schools, 23-year-old seniors, bad bodies, soft tossers, poor defenders, etc. But, still, there may be players who aren't good MLB prospects, but who could still perform well as part of an independent league team.

Looking at top framing college catchers was a bust; this is a premium defensive position and very little is overlooked.

Among the undrafted senior hitters and pitchers there were several potential prospects, many of whom you'll read about in the book. The most important fact to keep in mind is that these are real people with real lives, real families and real hopes and dreams, and playing independent ball isn't nearly lucrative enough to pay the bills. Harsh reality will limit your pool even more, and those who choose to pursue it will face the additional stress of financial strain.

That being said, was Ben and Sam's experiment a success? You'll have to read the book, but absolutely, some talent was found.

## Tuesday, February 9, 2016

### When is a Lead Safe in the NBA?

I'll present a simple derivation. Start by observing there are about 50 scoring groups per team per game (scoring groups include all baskets and free throws that occur at the same time), with each scoring group worth about two points. Assume scoring events by team are Poisson distributed with parameter \(\lambda = \frac{50\cdot t}{48\cdot 60} = \frac{t}{57.6}\). Using a normal approximation, the difference of these two distributions is normal with mean 0 and variance \(\sqrt{2}\lambda\), giving a standard deviation of \(0.1863\sqrt{t}\).

Using this approximation, what is a 90% safe lead? A 90% tail is 1.28 standard deviations, \(1.28\cdot 0.1863\sqrt{t} = 0.2385\sqrt{t}\) scoring groups. As a scoring group is about two points, this means a 90% safe lead, assuming a jump ball, is about \(0.477\sqrt{t}\) points (Clauset et. al. obtained \(0.4602\sqrt{t}\)). For example, a safe lead at halftime is approximately \(0.477 \sqrt{24\cdot 60} = 18.1\) points.

Next - adjustments for possession arrow and shot clock time; validity of approximation; adjusting for team strengths.

## Thursday, October 15, 2015

### An Enormous Number of Kilograms

I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a googol and the smallest number that satisfies the conditions of the Archimedes cattle problem.

When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.

We can see this by letting our \((n+1)\)-digit number \( x = 10 m + a\), where \(m\) is an \(n\)-digit number and \(0\leq a < 10\). Moving \(a\) to the front we get \(y = 10^n a + m\), and our requirement is \(y = 2x\). This gives: \begin{align} 20 m + 2 &= 10^n a + m \\ 19 m &= a(10^n-2) \\ m &= \frac{2a(5\cdot 10^{n-1} - 1)}{19} \end{align} The smallest \(m\), if one exists, requires \(a,n\) such that 19 divides \(5\cdot 10^{n-1}-1\) (as 19 can't divide \(2 a\)) and the result has \(n\)-digits. It's easy to check that the smallest value of \(n\) that satisfies the first condition is \(n=17\). To get the smallest solution we try \(a=1\), but this yields a value with only 16 digits. Setting \(a=2\), however, yields the 17-digit \(m = 10526315789473684\). The smallest solution to our puzzle is therefore the 18-digit number \(105263157894736842\); that's surprisingly large.

Numbers with this type of property are known as parasitic numbers.

Later, I wondered if there were numbers with the slightly different, but equally interesting property, that moving the last digit to the front converted ("autoconverts") it from a value under one unit of measurement to an equivalent value under a different unit of measurement.

The first one I tried was moving the last digit to the front converts from Celsius to Fahrenheit. This is a fun puzzle that eventually made its way into the New York Times. The smallest such value is 275 C, which exactly equals 527 F. What's the next smallest temperature?

How about moving the first digit to the end? We'll need to use the little-known fact that, legally, a pound is exactly equal to 0.45359237 kilograms. Given this, does there exist a number such that moving the first digit to the end converts from pounds to kilograms? The answer is yes, but the smallest solution has 108,437,840 digits! The solution is similar to the above, but as it's computationally more involved I've written Sage code to solve it, which you can find in my GitHub puzzles repository.

The smallest number that autoconverts from gallons to liters, incidentally, is even bigger at 382,614,539 digits!

## Saturday, October 10, 2015

### Solving a Math Puzzle using Physics

The first two parts require students to interpret the meaning of the components of the formula \(T(x) = 5 \sqrt{36+x^2} + 4(20-x) \), and the final "challenge" component involves finding the minimum of \( T(x) \) over \( 0 \leq x \leq 20 \). Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.

One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where \( \theta_1, \theta_2\) are the angles to the normal and \(v_1, v_2\) are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water and 1/4 on land. Furthermore, the crocodile travels along the riverbank once it hits land, so \( \theta_2 = 90^{\circ}\) and \(\sin{\theta_2} = 1\). Snell's law now says that the path of least time satisfies \[ \sin{\theta_1} = \frac{x}{\sqrt{36+x^2}} = \frac{4}{5},\] giving us \( 25x^2 = 16x^2 + 24^2\). Solving, \( 3^2 x^2 = 24^2, x^2 = 8^2\) and the solution is \(x = 8\).

## Sunday, October 4, 2015

### Mixed Models in R - Bigger, Faster, Stronger

Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository):

model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id) fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log) )The fixed effects are

**year**,

**field**(home/away/neutral),

**d_div**(NCAA division of the defense),

**o_div**(NCAA division of the offense) and

**game_length**(number of overtime periods);

**offense**(strength of offense),

**defense**(strength of defense) and

**game_id**are all random effects. The reason for modeling team offenses and defenses as random vs fixed effects is that I view them as random samples from the same distribution. As mentioned above, this results in statistical shrinkage or "regression to the mean" for these values, which is particularly useful for partially completed seasons.

One of the problems with large mixed models is that they can be very slow to fit. For example, the model above takes several hours on a 12-core workstation, which makes it very difficult to test model changes and tweaks. Is there any way to speed up the fitting process? Certainly! One way is to add two options to the above code:

fit <- glmer(model, data=g, verbose=TRUE, family=poisson(link=log), nAGQ=0, control=glmerControl(optimizer = "nloptwrap") )What do these do? Model fitting is an optimization process. Part of that process involves the estimation of particular integrals, which can be very slow; the option "nAGQ=0" tells glmer to ignore estimating those integrals. For some models this has minimal impact on parameter estimates, and this NCAA hockey model is one of those. The second option tells glmer to fit using the "nloptwrap" optimizer (there are several other optimizers available, too), which tends to be faster than the default optimization method.

The impact can be rather startling. With the default options the above model takes about 3 hours to fit. Add these two options, and the model fitting takes 30 seconds with minimal impact on the parameter estimates, or approximately 400 times faster.

## Thursday, October 1, 2015

### Elo's Rating System as a Forgetful Logistic Model

\begin{align}

\frac{\partial L}{\partial d} &= 2n \frac{e^{-d}}{1+e^{-d}} -n + 2m \frac{e^{-c-2d}}{1+e^{-c-2d}}-2(m-S_A) \\

&= -n\frac{1-e^{-d}}{1+e^{-d}} + 2 S_A - 2m\frac{1}{1+e^{-c-2d}} \\

&= -n\tanh(d/2) + 2 S_A - 2m\frac{1}{1+e^{-c-2d}}.

\end{align} What is \( m\frac{1}{1+e^{-c-2d}} \)? This is actually just \( {E_A}' \), the expected score for A when playing B for \(m\) games, but assuming the updated ratings for both players. Finally, setting \(\frac{\partial L}{\partial d} = 0\), we get \[ n\tanh(d/2) = 2(S_A - {E_A}')\] and hence \[ \tanh(d/2) = \frac{2}{n} (S_A - {E_A}').\] Assuming \(n\) is large relative to \(S_A - {E_A}'\), we have \( \tanh(d/2) \approx d/2\) and \( {E_A}' \approx E_A \). This is Elo's updating rule in the form \[ d = \frac{4}{n} (S_A - E_A ).\] If we rescale with the constant \( \sigma \), the updating rule becomes \[ d = \frac{4\sigma }{n} (S_A - E_A ).\] We also now see that the adjustment factor \( K = \frac{4\sigma }{n}\).

## Saturday, July 11, 2015

### Power Rankings: Looking at a Very Simple Method

Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.

In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" \(R_i\), and that the probability of \(i\) beating \(j\) is \( \frac{R_i}{R_i+R_j} \). Note that each \(R_i\) is assumed to be strictly positive. Where does this model structure come from?

Here are three reasonable constraints for a power ranking model:

- If \(R_i\) and \(R_j\) have equal strength, the probability of one beating the other should be \( \frac{1}{2}\).
- As the strength of one team strictly approaches 0 (infinitely weak) with the other team fixed, the probability of the other team winning strictly increases to 1.
- As the strength of one team strictly approaches 1 (infinitely strong) with the other team fixed, the probability of the other team winning strictly decreases to 0.

\frac{\partial \log\left( \mathcal{L} \right)}{\partial r_i} &=& \sum_{w(m)=i} 1 - \frac{e^{r_{w(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_{l(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}}\\

&=& \sum_{w(m)=i} 1 - \frac{e^{r_i}}{e^{r_i}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_i}}{e^{r_{w(m)}}+e^{r_i}}\\

&=&0.

\end{eqnarray*} But this is just the number of actual wins minus the expected wins! Thus, the maximum likelihood estimators for the \( r_i \) satisfy \( O_i = E_i \) for all teams \( i \), where \( O_i \) is the actual (observed) number of wins for team \( i \), and \( E_i \) is the expected number of wins for team \( i \) based on our model. That's a nice property!

If you'd like to experiment with some actual data, and to see that the resulting fit does indeed satisfy this property, here's an example BTL model using NCAA men's ice hockey scores. You can, of course, actually use this property to iteratively solve for the MLE estimators \( R_i \). Note that you'll have to fix one of the \( R_i \) to be a particular value (or add some other constraint), as the model probabilities are invariant with respect to multiplication of the \( R_i \) by the same positive scalar.