The Angry Statistician

Probability and Cumulative Dice Sums

2021-09-10T23:53:00.003-07:00

Let a die be labeled with increasing positive integers $a_1,\ldots , a_n$, and let the probability of getting $a_i$ be $p_i>0$. We start at 0 and roll the die, adding whatever number we get to the current total. If ${\rm Pr}(N)$ is the probability that at some point we achieve the sum $N$, then $\lim_{N \to \infty} {\rm Pr}(N)$ exists and equals $1/\rm{E}(X)$ iff $(a_1, \ldots, a_n) = 1$.

The direction $\implies$ is obvious. Now, if the limit exists it must equal $1/{\rm E}(X)$ by Chebyshev's inequality, so we only need to show that the limit exists assuming that $(a_1, \ldots, a_n) = 1$.

We have the recursive relationship \[{\rm Pr}(N) = p_1 {\rm Pr}(N-a_1) + \ldots + p_n {\rm Pr}(N-a_n);\] the characteristic polynomial is therefore \[f(x) = x^{a_n} - \left(p_1 x^{(a_n-a_1)} + \ldots + p_n\right).\]

This clearly has the root $x=1$. Next note \[ f'(1) = a_n - \sum_{i=1}^{n} p_i a_n + \sum_{i=1}^{n} p_i a_i = \rm{E}(X) > 0 ,\] hence this root is also unique.

I'll now show that all other roots have absolute value $< 1$.

We have \[ x^{a_n} - \left( p_1 x^{(a_n-a_1)} + \ldots + p_n \right) = 0 \iff x^{a_1} = p_1 + \ldots + \frac{p_n}{x^{(a_n-a_1)}} .\]

If $|x|>1$ then \[ |x^{a_1}| = \left|p_1 + \ldots + \frac{p_n}{x^{(a_n-a_1)}}\right| \leq |p_1| + \ldots + \left|\frac{p_n}{x^{(a_n-a_1)}}\right| \leq p_1 + \ldots + p_n = 1;\] contradiction.

If $|x|=1$ \[ |x^{a_1}| = \left|p_1 + \ldots + \frac{p_n}{x^{(a_n-a_1)}}\right| < 1\] unless $ x^{(a_1 - a_i)} = 1 $ for all $ i \implies x^d = 1 $ where $ d = (a_1 - a_2, \ldots, a_1 - a_n) $.

Assume $x^d = 1$; then \[ x^{a_1} = p_1 + \ldots + \frac{p_n}{x^{(a_n-a_1)}} = p_1 + \ldots + p_n = 1 \implies x^{a_1} = 1 \implies x^e = 1\] where $ e = (d, a_1) = (a_1, \ldots, a_n) = 1$ by assumption, hence $ x^1 = 1 \implies x = 1 $ is the only root of $ f(x) $ such that $ |x| \geq 1 \implies \lim_{N \to \infty} {\rm Pr}(N) $ exists.

Gambling to Optimize Expected Median Bankroll

2019-09-04T02:14:00.002-07:00

Gambling to optimize your expected bankroll mean is extremely risky, as you wager your entire bankroll for any favorable gamble, making ruin almost inevitable. But what if, instead, we gambled not to maximize the expected bankroll mean, but the expected bankroll median?

Let the probability of winning a favorable bet be $p$, and the net odds be $b$. That is, if we wager $1$ unit and win, we get back $b$ units (in addition to our wager). Assume our betting strategy is to wager some fraction $f$ of our bankroll, hence $0 \leq f \leq 1$. By our assumption, our betting strategy is invariant with respect to the actual size of our bankroll, and so if we were to repeat this gamble $n$ times with the same $p$ and $b$, the strategy wouldn't change. It follows we may assume an initial bankroll of size $1$.

Let $ q = 1-p $. Now, after $n$ such gambles our bankroll would have a binomial distribution with probability mass function \[ \Pr(k,n,p) = \binom{n}{k} p^k q^{n-k}, \] where $k$ is the number of wins, $n-k$ the number of losses. Note the median occurs at $ k=n p $, corresponding to a bankroll of \[ \left(1+f\cdot b\right)^{n p} \left(1-f\right)^{n q} .\] Now, maximizing this value is equivalent to maximizing its $\log$, which is \[ n p \log\left(1+f\cdot b\right) + n q\log\left(1-f\right) .\] But this is maximized when \[ p \log\left(1+f\cdot b\right) + q\log\left(1-f\right)\] is maximized, and this is precisely the condition for a Kelly optimal bet! It follows that if we gamble to optimize our expected median, this is equivalent to Kelly optimal betting, and hence maximizing expected log wealth.

With a little more work, we can show that the same conclusion holds if we gamble to optimize any expected quantile $x$, with $ 0 < x < 1$. Maximizing the expected quantile $ 0 $ corresponds to "riskless" gambling, i.e. only gambling when there's no chance of a loss. Maximizing the expected quantile $ 1 $ corresponds to maximizing the expected bankroll mean, which we can refer to as the "reckless" strategy. Thus, under our assumptions, there are only three quantile maximization strategies - riskless, Kelly and reckless.

An Island of Liars is an Ensemble of Experts

2018-04-08T16:59:00.000-07:00

In my previous post I looked at how a group of experts may be combined into a single, more powerful, classifier which I call NaiveBoost after the related AdaBoost. I'll illustrate how it can be used with a few examples.

As before, we're faced with making a binary decision, which we can view as an unknown label $ L \in \{ +1, -1 \}$. Furthermore, the prior distribution on $ L $ is assumed to be uniform. Let our experts' independent probabilities be $ p_1 = 0.8, p_2 = 0.7, p_3 = 0.6$ and $p_4 = 0.5$. Our combined NaiveBoost classifier is \[ C(S) = \sum_i \frac{L_i}{2}\log{\left( \frac{p_i}{1-p_i}\right)},\] where $ S = \{ L_i \} $. A few things to note are that $ \log{\left( \frac{p_i}{1-p_i}\right)} $ is $ {\rm logit}( p_i )$, and an expert with $ p = 0.5 $ contributes 0 to our classifier. This latter observation is what we'd expect, as $ p = 0.5 $ is random guessing. Also, experts with probabilities $ p_i $ and $ p_j $ such that $ p_i = 1 - p_j $ are equivalent if we replace the stated label $ L_j $ with $ -L_j $.

Ignoring the last (uninformative) expert, we end up with the combined classifier \[ C(S) = \frac{L_1}{2} \log\left(4\right) + \frac{L_1}{2} \log\left(\frac{7}{3}\right) + \frac{L_3}{2} \log\left( \frac{3}{2} \right).\] If the overall value is positive, the classifier's label is $ L = +1$; if it's negative, the classifier's label is $ L = -1 $. Note the base of the logarithm doesn't matter and we could also ignore the factor of $ \frac{1}{2} $, as these don't change the sign of $ C(S) $. However, the factor of $ \frac{1}{2} $ must be left in if we want the ability to properly recover the actual combined probability via normalization.

Now say $ L_1 = -1, L_2 = +1, L_3 = +1 $. What's our decision? Doing the math, we get $ C(S) = \frac{1}{2} \log{ \left( \frac{7}{8} \right) } $, and as $ 7 < 8 $, $ C(S) < 0 $ and our combined classifier says $ L = -1$. If we wanted to recover the probability, note \[ \exp\left( \frac{1}{2} \log \left( \frac{7}{8} \right) \right) = {\left( \frac{7}{8} \right)}^{1/2},\] hence our classifier states \[ {\rm Pr}(L = +1 | S ) = \frac{ {\left( \frac{7}{8} \right)}^{1/2} }{ {\left( \frac{7}{8} \right)}^{1/2} + {\left( \frac{7}{8} \right)}^{-1/2} } = \frac{ \frac{7}{8} }{ \frac{7}{8} + 1 } = \frac{7}{15}, \] and of course $ {\rm Pr}(L = -1 | S ) = \frac{8}{15}$.

As a second example, consider the @CutTheKnotMath puzzle of two liars on an island. Here we have A and B, each of which lies with probability 2/3 and tells the truth with probability 1/3. A makes a statement and B confirms that it's true. What's the probability that A's statement is actually truthful? We can solve this in a complicated way by observing that this is equivalent to an ensemble of experts, where $ L \in \{ +1, -1 \} $, the prior on $ L $ is uniform and $ L_1 = L_2 = +1$. The probability that $ L = +1 $ is precisely the probability that A is telling the truth.

Following the first example, \[ L_{+1} = \frac{L_1}{2}\log\left( \frac{1}{2} \right) + \frac{L_2}{2}\log\left( \frac{1}{2} \right) = \log\left( \frac{1}{2} \right).\] Continuing as before, we get \[ {\rm Pr}(L = +1 | S ) = \frac{ \frac{1}{2} }{ \frac{1}{2} + 2 } = \frac{1}{5}.\]

Combining Expert Opinions: NaiveBoost

2018-04-07T22:18:00.000-07:00

In many situations we're faced with multiple expert opinions. How should we combine them together into one opinion, hopefully better than any single opinion? I'll demonstrate the derivation of a classifier I'll call NaiveBoost.

We'll start with a simple situation, and later gradually introduce more complexity. Let each expert state a yes or no opinion in response to a yes/no question (binary classifiers), each expert be independent of the other experts and assume expert $i$ is correct with probability $p_i$. We'll also assume that the prior distribution on whether the correct answer is yes or no to be uniform, i.e. each occurs with probability 0.5.

Label a "yes" as +1, and "no" as -1. We ask our question, which has some unknown +1/-1 answer $L$, and get back a set of responses (labels) $S = \{L_i \}$, where $L_i$ is the response from expert $i$. Observe we have \[ \Pr(S | L=+1) = \prod_{i} {p_i}^{\frac{L_i+1}{2}} \cdot {(1-p_i)}^\frac{-L_i+1}{2}\] and also \[ \Pr(S | L=-1) = \prod_{i} {(1-p_i)}^{\frac{L_i+1}{2}} \cdot {p_i}^\frac{-L_i+1}{2}. \] As $ \Pr(L=+1 | S) = \frac{\Pr(S | L=+1)\cdot \Pr(L=+1)}{\Pr(S)}$ and $ \Pr(L=-1 | S) = \frac{\Pr(S | L=-1)\cdot \Pr(L=-1)}{\Pr(S)}$, and given our assumption that $ \Pr(L=+1) = \Pr(L=-1) $, we need only compute $ \Pr(S | L=+1) $, $ \Pr(S | L=-1) $ and normalize.

We'll now take logs and derive a form similar to AdaBoost. Note for $ L_{+1} = \log\left( \Pr(S | L=+1) \right) $ this gives us \[ L_{+1} = \sum_i \frac{L_i+1}{2}\log{(p_i)} + \frac{-L_i+1}{2}\log{(1-p_i)}.\] Rearranging, we get \[ L_{+1} = \sum_i \frac{L_i}{2}\log{\left( \frac{p_i}{1-p_i}\right)} + \frac{1}{2}\log{\left( p_i(1-p_i)\right)}.\] Similarly, for $ L_{-1} = \log\left( \Pr(S | L=-1) \right) $ we get \[ L_{-1} = \sum_i -\frac{L_i}{2}\log{\left( \frac{p_i}{1-p_i}\right)} + \frac{1}{2}\log{\left( p_i(1-p_i)\right)}.\] Note that each of these includes the same terms $ \sum_i \frac{1}{2}\log{\left( p_i(1-p_i)\right)}$. Upon exponentiation these would multiply $ \Pr(S | L=+1) $ and $ \Pr(S | L=-1) $ by the same factor, so we can ignore them as to recover the probabilities we'll need to normalize anyway. Thus we end up with a linear classifier with the AdaBoost form \[ C(S) = \sum_i \frac{L_i}{2}\log{\left( \frac{p_i}{1-p_i}\right)}. \] If $ C(S) $ is positive, the classifier's label is +1; if $ C(S) $ is negative, the classifier's label is -1. Furthermore, we may recover the classifier's probabilities by exponentiating and normalizing.

Simplified Multinomial Kelly

2018-04-01T00:14:00.001-07:00

Here's a simplified version for optimal Kelly bets when you have multiple outcomes (e.g. horse races).

The Smoczynski & Tomkins algorithm, which is explained here (or in the original paper):

https://en.wikipedia.org/wiki/Kelly_criterion#Multiple_horses

Let's say there's a wager that, for every $1 you bet, will return a profit of $b if you win. Let the probability of winning be $p$, and losing be $q=1-p$.

The original Kelly criterion says to wager only if $b\cdot p-q > 0$ (the expected value is positive), and in this case to wager a fraction $ \frac{b\cdot p-q}{b} $ of your bankroll.

But in a horse race, how do you decide which set of outcomes are favorable to bet on? It's tricky, because these wagers are mutually exclusive i.e. you can win at most one.

It turns out there's a simple and intuitive method to find which bets are favorable:

1) Look at $ b\cdot p-q$ for every horse.
2) Pick any horse for which $ b\cdot p-q > 0$ and mark "bet".
3) Adjust the probabilities for the remaining horses by dividing all win probabilities by $ \frac{1}{1-p} $ so they add up to 1 again ("renormalize").
4) Repeat!

That's it.

This should be substantially easier to understand than the exposition in Smoczynski & Tomkins.

The intuitive reasoning for why this should work is that you only need betting on a horse to be conditionally favorable assuming the other horses you've bet on don't win. That is, it must be a positive hedge.

Notes on Setting up a Titan V under Ubuntu 17.04

2017-12-24T00:27:00.001-08:00

I recently purchased a Titan V GPU to use for machine and deep learning, and in the process of installing the latest Nvidia driver's hosed my Ubuntu 16.04 install. I was overdue for a fresh install of Linux, anyway, so I decided to upgrade some of my drives at the same time. Here are some of my notes for the process I went through to get the Titan V working perfectly with TensorFlow 1.5 under Ubuntu 17.04.

Old install:
Ubuntu 16.04
EVGA GeForce GTX Titan SuperClocked 6GB
2TB Seagate NAS HDD
+ additional drives

New install:
Ubuntu 17.04
Titan V 12GB
/ partition on a 250GB Samsung 840 Pro SSD (had an extra around)
/home partition on a new 1TB Crucial MX500 SSD
New WD Blue 4TB HDD
+ additional drives

You'll need to install Linux in legacy mode, not UEFI, in order to use Nvidia's proprietary drivers for the Titan V. Note that Linux will cheerfully boot in UEFI mode, but will not load any proprietary drivers (including Nvidia's). You'll need proprietary drivers for TensorFlow.

You may also need to disable fast boot.

Keep a wired mouse handy, as your wireless mouse may decide to stop working until Linux is installed and updated. This occurred with my Logitech MX Master.

Create an Ubuntu 17.04 live install USB - https://help.ubuntu.com/community/Installation/FromUSBStick

Boot from your live Ubuntu USB with the BIOS in legacy mode.

I selected the Samsung 850 Pro as my / and the Crucial MX500 as /home. You'll need to "create" and "add" if they're unformatted.

Allow 3rd party/proprietary drivers.

Install!

Reboot, login to default Ubuntu (logging in to Unity may hang; it did on my system). The Titan V is not configured yet.

Update Ubuntu:
sudo apt-get update
sudo apt-get upgrade

I recommend installing the latest kernel.
apt-get install linux-generic linux-headers-generic linux-image-generic

I recommend installing the KDE/Plasma desktop, as I could not get the Unity desktop to work.
apt-get install plasma-desktop dolphin konsole

Download and install Nvidia's 387.34_1.0-1 driver - http://www.nvidia.com/download/driverResults.aspx/128019/en-us
sudo dpkg -i nvidia-driver-local-repo-ubuntu1704-387.34_1.0-1_amd64.deb

The Titan V is still not configured, but should be after the next step.

Download and install CUDA 9.0 - https://developer.nvidia.com/cuda-90-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1704&target_type=deblocal
sudo dpkg -i uda-repo-ubuntu1704-9-0-local_9.0.176-1_amd64.deb
apt-get install cuda-9-0

Add the following two lines to your .bash_profile:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda

Reboot; the Titan V should be working and configured.

I recommend installing git.

apt-get install git

You'll need to install pip for Python for TensorFlow.

apt-get install python-pip (for Python 2)
apt-get install python3-pip (for Python 3)

We'll need cuDNN for TensorFlow.

Download cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0

You'll need to create a (free) Nvidia developer account.

https://developer.nvidia.com/rdp/cudnn-download

This is how I installed cuDNN.
tar xzf cudnn-9.0-linux-x64-v7.tgz
cd cuda
sudo cp NVIDIA_SLA_cuDNN_Support.txt /usr/local/cuda-9.0
sudo cp include/cudnn.h /usr/local/cuda-9.0/targets/x86_64-linux/include
sudo cp lib64/libcudnn_static.a /usr/local/cuda-9.0/targets/x86_64-linux/lib
sudo cp lib64/libcudnn.so.7.0.5 /usr/local/cuda-9.0/targets/x86_64-linux/lib
cd /usr/local/cuda-9.0/targets/x86_64-linux/lib
sudo ln -s libcudnn.so.7.0.5 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
cd

Let's install TensorFlow! We'll want the nightly.

sudo -H pip install tf-nightly-gpu (Python 2)
sudo -H pip3 install tf-nightly-gpu (Python 3)

TensorFlow should now be working!

$ python (or python3)
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print(sess.run(a + b))
42
>>>

See also - https://www.tensorflow.org/versions/r0.12/get_started/os_setup

If you also want to install the latest Julia.
git clone https://github.com/JuliaLang/julia
cd julia
sudo apt-get install m4 cmake gfortran clang libopenblas-base libopenblas-dev
make -j 4

Enjoy!

Solving IMO 1989 #6 using Probability and Expectation

2017-11-26T18:18:00.000-08:00

IMO 1989 #6: A permutation $\{x_1, x_2, \ldots , x_m\}$ of the set $\{1, 2, \ldots , 2n\}$, where $n$ is a positive integer, is said to have property $P$ if $ | x_i - x_{i+1} | = n$ for at least one $i$ in $\{1, 2, ... , 2n-1\}$. Show that for each $n$ there are more permutations with property $P$ than without.

Solution: We first observe that the expected number of pairs $\{i, i+1\}$ for which $ | x_i - x_{i+1} | = n$ is $E = 1$. To see this note if $j$, $ 1 \leq j \leq n$, appears in position $1$ or $2n$ it's adjacent to one number, otherwise two. Thus the probability it's adjacent to its partner $j+n$ in a random permutation is \[\begin{equation}
\eqalign{
e_j &= \frac{2}{2n}\cdot \frac{1}{2n-1} + \frac{2n-2}{2n}\cdot \frac{2}{2n-1} \\
&= \frac{2(2n-1)}{2n(2n-1)} \\
&= \frac{1}{n}.
}
\end{equation}\] By linearity of expectation we overall have the expected number of $j$ adjacent to its partner $j+n$ is $\sum_{j=1}^{n} e_j = n\cdot\frac{1}{n} = 1$.

More is true. By the same argument, if we remove any partner pair $\{j,j+n\}$, the expected number of partner pairs in a random permutation of the remaining integers is still 1. This is the critical observation.

Conditional on the partner pair $\{j,j+n\}$ appearing in a random permutation, what is the expected number of partner pairs $e$? Observe that if $n>1$ it must be less than 2, since as before the expected number of partner pairs ignoring $\{j,j+n\}$ is 1, and the probability the $\{j,j+n\}$ pair where they appear has separated another partner pair is greater than 0.

Putting this together, if $n=1$ the property $P$ obviously holds. For $n>1$, note the expected number of partner pairs $E = p\cdot e$, where $p$ is the probability that a random permutation has property $P$ and $e$ is as before. But we already know $E=1$, and by the previous argument if $n>1$ we have $e<2$, hence $ 1 = p\cdot e < 2p $ and we conclude $ p > \frac{1}{2}$.

Poisson Games and Sudden-Death Overtime

2017-05-13T20:02:00.002-07:00

Let's say we have a game that can be reasonably modeled as two independent Poisson processes with team $i$ having parameter $\lambda_i$. If one team wins in regulation with team $i$ scoring $n_i$, then it's well-known we have the MLE estimate $\hat{\lambda_i}=n_i$. But what if the game ends in a tie in regulation with each team scoring $n$ goals and we have sudden-death overtime? How does this affect the MLE estimate for the winning and losing teams?

Assuming without loss of generality that team $1$ is the winner in sudden-death overtime. As we have two independent Poisson processes, the probability of this occurring is $\frac{\lambda_1}{\lambda_1 + \lambda_2}$. Thus, the overall likelihood we'd like to maximize is \[L = e^{-\lambda_1} \frac{{\lambda_1}^n}{n!} e^{-\lambda_2} \frac{{\lambda_2}^n}{n!} \frac{\lambda_1}{\lambda_1 + \lambda_2}.\] Letting $l = \log{L}$ we get \[l = -{\lambda_1} + n \log{\lambda_1} - {\lambda_2} + n \log{\lambda_2} - 2 \log{n!} + \log{\lambda_1}-\log({\lambda_1 + \lambda_2}).\] This gives \[\begin{equation}
\eqalign{
\frac{\partial l}{\partial \lambda_1} &= -1+\frac{n}{\lambda_1}+\frac{1}{\lambda_1}+\frac{1}{\lambda_1 + \lambda_2}\\
\frac{\partial l}{\partial \lambda_2} &= -1+\frac{n}{\lambda_2}+\frac{1}{\lambda_1 + \lambda_2}.
}
\end{equation}\] Setting both partials equal to $0$ and solving, we get \[\begin{equation}
\eqalign{
(n-\hat{\lambda_1})(\hat{\lambda_1}+\hat{\lambda_2})+\hat{\lambda_2} &= 0\\
(n-\hat{\lambda_2})(\hat{\lambda_1}+\hat{\lambda_2})-\hat{\lambda_2} &= 0,
}
\end{equation}\] and so \[\begin{equation}
\eqalign{
\hat{\lambda_1} &= (n+1) \frac{2n}{2n+1}\\
\hat{\lambda_2} &= n \frac{2n}{2n+1}.
}
\end{equation}\] For example, if both teams score $3$ goals in regulation and team $1$ wins in sudden-death overtime, our MLE estimates are $\hat{\lambda_1} = 3\frac{3}{7}, \hat{\lambda_2} = 2\frac{4}{7}$.

Intuitively this makes sense, because $2n$ goals were scored in regulation time, hence we "expect" that the overtime goal occurred around a fraction $\frac{1}{2n}$ of regulation, so team $1$ scored $n+1$ goals in about $\frac{2n+1}{2n}$ regulation periods and team $2$ scored $n$ goals in about $\frac{2n+1}{2n}$ regulation periods. The standard Poisson process MLE estimates here coincide with the estimates we derived above.

Does this work in practice? Yes! I tested it on my NCAA men's lacrosse model, and it increased the out-of-sample testing accuracy by 0.5%. Surprisingly large for such a small change!

Why does Kaggle use Log-loss?

2017-04-01T21:59:00.000-07:00

If you're not familiar with Kaggle, it's an organization dedicated to data science competitions to both provide ways for companies to potentially do analytics at less cost, as well as to identify talented data scientists.

Competitions are scored using a variety of functions, and the most common for binary classification tasks with confidence is something called log-loss, which is essentially $\sum_{i=1}^{n} \log(p_i)$, where $p_i$ is your model's claimed confidence for test data point $i$'s correct label. Why does Kaggle use this scoring function? Here I'll follow Terry Tao's argument.

Ideally what we'd like is a scoring function $f(x)$ that yields the maximum expected score precisely when the claimed confidence $x_i$ in the correct label for $i$ is actually what the submitter believes is the true probability (or frequency) of that outcome. This means that we want \[L(x)=p\cdot f(x) + (1-p)\cdot f(1-x)\] for fixed $p$ to be maximized when $x=p$. Differentiating, this means \[L'(x) = p\cdot f'(x) - (1-p)\cdot f'(1-x) = 0\] when $x=p$, hence $p\cdot f'(p) = (1-p)\cdot f'(1-p)$ for all $p$. This will be satisfied by any admissible $f(x)$ with $x\cdot f'(x)$ symmetric around $x=\frac{1}{2}$, but if we extend our analysis to multinomial outcomes we get the stronger conclusion that in fact $x\cdot f'(x) = c_0$ for some constant $c_0$. This in turn implies $f(x)=c_0\cdot \log(x)+c_1$. If we want $f(1/2)=0$ and $f(1)=1$, we end up with $f(x)={\log}_2(2x)$ and the expected score is \[L(x)=x\cdot {\log}_2(2x) + (1-x)\cdot {\log}_2(2(1-x)).\]

The Kelly Criterion and a Sure Thing

2017-03-31T19:43:00.000-07:00

The Kelly Criterion is an alternative to standard utility theory, which seeks to maximize expected utility. Instead, the Kelly Criterion seeks to maximize expected growth. That is, if we start out with an initial bankroll $B_0$, we seek to maximize $\mathrm{E}[g(t)]$, where $B_t = B_0\cdot e^{g(t)}$.
As a simple example, consider the following choice. We can have a sure $3000, or we can take the gamble of a $\frac{4}{5}$ chance of $4000 and a $\frac{1}{5}$ chance of $0. What does Kelly say?
Assume we have a current bankroll of $B_0$. After the first choice we have $B_1 = B_0+3000$, which we can write as \[\mathrm{E}[g(1)] = \log\left(\frac{B_0+3000}{B_0}\right);\]for the second choice we have \[\mathrm{E}[g(1)] = \frac{4}{5} \log\left(\frac{B_0+4000}{B_0}\right).\]And so we want to compare $\log\left(\frac{B_0+3000}{B_0}\right)$ and $\frac{4}{5} \log\left(\frac{B_0+4000}{B_0}\right)$.
Exponentiating, we're looking for the positive root of \[{\left({B_0+3000}\right)}^5 - {B_0}\cdot {\left({B_0+4000}\right)}^4=0.\]Wolfram Alpha now tells us that we should go with the sure thing if $B_0 < $4942.92$, and take the gamble otherwise.

Prime Divisors of $3^{32}-2^{32}$

2017-03-31T18:33:00.000-07:00

Find four prime divisors < 100 for $3^{32}-2^{32}$.
Source: British Math Olympiad, 2006.

This factors nicely as $3^{32}-2^{32} = \left(3^{16}+2^{16}\right)\left(3^{16}-2^{16}\right)$, and we can continue factoring in this way to get \[3^{32}-2^{32} = \left(3^{16}+2^{16}\right)\left(3^8+2^8\right)\left(3^4+2^4\right)\left(3^2+2^2\right)\left(3^2-2^2\right).\]The final three terms are $5, 13, 97$, so we have three of the four required primes. For another prime divisor, consider $3^{16}-2^{16}$. By Fermat's Little Theorem $a^{16}-1\equiv 0 \bmod 17$ for all $a$ with $(a,17)=1$, and so it follows that $3^{16}-2^{16}\equiv 0 \bmod 17$, and we therefore have $17$ as a fourth such prime divisor.
Alternatively, note $ \left(\dfrac{3}{17}\right)=-1, \left(\dfrac{2}{17}\right)=1$, hence by Euler's Criterion $3^8\equiv -1 \bmod 17$ and $2^8\equiv 1 \bmod 17$, giving $3^8+2^8\equiv 0\bmod 17$.

Highest Powers of 3 and $\left(1+\sqrt{2}\right)^n$

2017-03-30T17:48:00.000-07:00

Let $\left(1+\sqrt{2}\right)^{2012}=a+b\sqrt{2}$, where $a$ and $b$ are integers. What is the greatest common divisor of $b$ and $81$?
Source: 2011-2012 SDML High School 2a, problem 15.

Let $(1+\sqrt{2})^n = a_n + b_n \sqrt{2}$. I've thought about this some more, and there's a nice way to describe the highest power of $3$ that divides $b_n$. This is probably outside of the scope of the intended solution, however.

First note that $(1-\sqrt{2})^n = a_n - b_n \sqrt{2}$, and so from $(1+\sqrt{2})(1-\sqrt{2})=-1$ we get $(1+\sqrt{2})^n (1-\sqrt{2})^n = {(-1)}^n$. This gives \[{a_n}^2 - 2 {b_n}^2 = {(-1)}^n.\] Now define the highest power of a prime $p$ that divides $n$ to be $\operatorname{\nu}_p(n)$.
From cubing and using the above result it's straightforward to prove that if $\operatorname{\nu}_3(b_n) = k > 0$ then $\operatorname{\nu}_3(b_{3n}) = k+1$.
Note $(1+\sqrt{2})^4 = 17 + 12\sqrt{2} \equiv -1+3\sqrt{2} \pmod{3^2}$. Cubing and using the first formula as before, we can in fact show that \[(1+\sqrt{2})^{4\cdot 3^n} \equiv -1 + 3^{n+1}\sqrt{2} \pmod{3^{n+2}},\] and squaring we also have \[(1+\sqrt{2})^{8\cdot 3^n} \equiv 1 + 3^{n+1}\sqrt{2} \pmod{3^{n+2}}.\] Now assume $\operatorname{\nu}_3(b_m) = k, \operatorname{\nu}_3(b_n) = l$ and $k\neq l$. From the top formula if $3 | b_i$ then $3 \not{|} a_i$, and it follows that \[\operatorname{\nu}_3(b_{m+n}) = \min(k,l).\]Putting this all together, write $n = 4\cdot m +k$, where $0\leq k <4$. If $k\neq 0$, then $\operatorname{\nu}_3(b_{n}) = 0$. If $k=0$, let the base-3 expansion of $m$ be $a_i \cdot 3^i + \ldots + a_0$. Then \[\operatorname{\nu}_3(b_{n}) = \min_{a_j \neq 0} j+1 .\]
For $n=2012$, we have $2012 = 4\cdot 503 = 4\cdot(2\cdot 3^5 + 3^2 + 2\cdot 3 + 2)$ and so $\operatorname{\nu}_3(b_{2012})=1$. We don't actually need to compute the entire base-3 expansion for 503, of course; we only need to observe that it's not divisible by 3.

For $n=2016$, we have $2016 = 4\cdot 504 = 4\cdot(2\cdot 3^5 + 2\cdot 3^2)$ and so $\operatorname{\nu}_3(b_{2016})=3$.

Sum of Two Odd Composite Numbers

2017-03-30T12:24:00.000-07:00

What is the largest even integer that cannot be written as the sum of two odd composite numbers? Source: AIME 1984, problem 14.

Note $24 = 3\cdot 3 + 3\cdot 5$, and so if $2k$ has a representation as the sum of even multiples of 3 and 5, say $2k = e_3\cdot 3 + e_5\cdot 5$, we get a representation of $2k+24$ as a sum of odd composites via $2k+24 = (3+e_3)\cdot 3 + (5+e_5)\cdot 5$. But by the Frobenius coin problem every number $k > 3\cdot 5 -3-5 = 7$ has such a representation, hence every number $2k > 14$ has a representation as the sum of even multiples of 3 and 5. Thus every number $n > 24+14=38$ has a representation as the sum of odd composites. Checking, we see that $\boxed{38}$ has no representation as a sum of odd composites.

What's the Value of a Win?

2016-06-18T23:30:00.000-07:00

In a previous entry I demonstrated one simple way to estimate an exponent for the Pythagorean win expectation. Another nice consequence of a Pythagorean win expectation formula is that it also makes it simple to estimate the run value of a win in baseball, the point value of a win in basketball, the goal value of a win in hockey etc.

Let our Pythagorean win expectation formula be \[ w=\frac{P^e}{P^e+1},\] where $w$ is the win fraction expectation, $P$ is runs/allowed (or similar) and $e$ is the Pythagorean exponent. How do we get an estimate for the run value of a win? The expected number of games won in a season with $g$ games is \[W = g\cdot w = g\cdot \frac{P^e}{P^e+1},\] so for one estimate we only need to compute the value of the partial derivative $\frac{\partial W}{\partial P}$ at $P=1$. Note that \[ W = g\left( 1-\frac{1}{P^e+1}\right), \] and so \[ \frac{\partial W}{\partial P} = g\frac{eP^{e-1}}{(P^e+1)^2}\] and it follows \[ \frac{\partial W}{\partial P}(P=1) = \frac{ge}{4}.\] Our estimate for the run value of a win now follows by setting \[\frac{\Delta W}{\Delta P} = \frac{ge}{4} \] giving \[ \Delta W = 1 = \frac{ge}{4} \Delta P.\] What is $\Delta P$? Well $P = R/A$, where $R$ is runs scored over the season and $A$ is runs allowed over the season. We're assuming this is a league average team and asking how many more runs they'd need to score to win an additional game, so $A$ is actually fixed at $L$, the league average number of runs scored (or allowed). This gives us \[1 = \frac{ge}{4} \Delta P = \frac{ge\Delta R}{4L}.\] Now $L/g = l$, the league average runs per game, so we arrive at the estimate \[\Delta R = \frac{4l}{e}.\] In the specific case of MLB, we have $e = 1.8$ and $l = 4.3$, giving that a win is approximately $\Delta R = 9.56$ runs.

Bill James originally used the exponent $e=2$; in this case the formula simplifies to $\Delta R = 2l$, i.e. we get the particularly simple result that a win is equal to approximately twice the average number of runs scored per game.

Applying this estimate to the NBA, a win is approximately $ \Delta R = \frac{4\cdot 101}{16.4} = 24.6$ points. Similarly, we get the estimates for a win of 4.5 goals for the NHL and 5.1 goals for the Premier League.

A Simple Estimate for Pythagorean Exponents

2016-06-08T16:05:00.000-07:00

Given the number of runs scored and runs allowed by a baseball team, what's a good estimate for that team's win fraction? Bill James famously came up with what he called the "Pythagorean expectation" \[w = \frac{R^2}{R^2 + A^2},\] which can also be written as \[w = \frac{{(R/A)}^2}{{(R/A)}^2 + 1}.\] More generally, if team $i$ scores $R_i$ and allows $A_i$ runs, the Pythagorean estimate for the probability of team $1$ beating team $2$ is \[w = \frac{{(R_1/A_1)}^2}{{(R_1/A_1)}^2 + (R_2/A_2)^2}.\] We can see that the estimate of the team's win fraction is a consequence of this, as an average team would by definition have $R_2 = A_2$. Now, there's nothing magical about the exponent being 2; it's a coincidence, and in fact is not even the "best" exponent. But what's a good way to estimate the exponent? Note the structural similarity of this win probability estimator and the Bradley-Terry estimator \[ w = \frac{P_1}{P_1+P_2}.\] Here the $P_i$ are what we could call the "Bradley-Terry power" of the team. This immediately suggests one way to estimate the expectation model's exponent - fit a Bradley-Terry model, then fit the log-linear regression $\log(P_i)$ vs $\log(R_i/A_i)$. The slope of this regression will be one estimate for the expectation exponent.

How well does this work? I get 1.727 for MLB in 2014. The R code and data files for MLB and other sports may be found in my GitHub repo.

Behind the Speadsheet

2016-05-02T22:40:00.000-07:00

In the book "The Only Rule Is It Has to Work: Our Wild Experiment Building a New Kind of Baseball Team", Ben Lindbergh and Sam Miller recount a grand adventure to take command of an independent league baseball team, with the vision of trying every idea, sane or crazy, in an attempt to achieve a winning edge. Five infielders, four outfielders, defensive shifts, optimizing lineups - everything.

It was really an impossible task. Professional sports at every level are filled with highly accomplished and competitive athletes, with real lives and real egos. Now imagine walking in one day and suddenly trying to convince them that they should be doing things differently. Who do you think you are?

I was one of the analysts who helped Ben and Sam in this quest, and I wanted to write some thoughts down from my own perspective, not as one of the main characters, but as someone more behind the scenes. These are some very short initial thoughts only, but I'd like to followup with some more ideas on where things went wrong from my perspective, and also how independent league teams can better identify roster talent from some non-traditional sources.

My focus was on attempting to identify talent overlooked in the MLB draft. This is extremely challenging; there are 30 teams, 40 standards rounds plus other picks. Furthermore, among those players left, many sign as amateur free agents post-draft. You're left with players from lower divisions, very small schools, 23-year-old seniors, bad bodies, soft tossers, poor defenders, etc. But, still, there may be players who aren't good MLB prospects, but who could still perform well as part of an independent league team.

Looking at top framing college catchers was a bust; this is a premium defensive position and very little is overlooked.

Among the undrafted senior hitters and pitchers there were several potential prospects, many of whom you'll read about in the book. The most important fact to keep in mind is that these are real people with real lives, real families and real hopes and dreams, and playing independent ball isn't nearly lucrative enough to pay the bills. Harsh reality will limit your pool even more, and those who choose to pursue it will face the additional stress of financial strain.

That being said, was Ben and Sam's experiment a success? You'll have to read the book, but absolutely, some talent was found.

When is a Lead Safe in the NBA?

2016-02-09T20:16:00.000-08:00

Assuming two NBA teams of equal strength with $t$ seconds remaining, what is a safe lead at a prescribed confidence level? Bill James has a safe lead formula for NCAA basketball, and the topic has been addressed by other researchers at various levels of complexity, e.g. Clauset, Kogan and Redner.

I'll present a simple derivation. Start by observing there are about 50 scoring groups per team per game (scoring groups include all baskets and free throws that occur at the same time), with each scoring group worth about two points. Assume scoring events by team are Poisson distributed with parameter $\lambda = \frac{50\cdot t}{48\cdot 60} = \frac{t}{57.6}$. Using a normal approximation, the difference of these two distributions is normal with mean 0 and variance $\sqrt{2}\lambda$, giving a standard deviation of $0.1863\sqrt{t}$.

Using this approximation, what is a 90% safe lead? A 90% tail is 1.28 standard deviations, $1.28\cdot 0.1863\sqrt{t} = 0.2385\sqrt{t}$ scoring groups. As a scoring group is about two points, this means a 90% safe lead, assuming a jump ball, is about $0.477\sqrt{t}$ points (Clauset et. al. obtained $0.4602\sqrt{t}$). For example, a safe lead at halftime is approximately $0.477 \sqrt{24\cdot 60} = 18.1$ points.

Next - adjustments for possession arrow and shot clock time; validity of approximation; adjusting for team strengths.

An Enormous Number of Kilograms

2015-10-15T01:48:00.002-07:00

For years the kilogram has been defined with respect to a platinum and iridium cylinder, but this is now no longer the case. Here's a puzzle about kilograms that's easy to state and understand, but the answer is very, very surprising.

I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a googol and the smallest number that satisfies the conditions of the Archimedes cattle problem.

When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.

We can see this by letting our $(n+1)$-digit number $ x = 10 m + a$, where $m$ is an $n$-digit number and $0\leq a < 10$. Moving $a$ to the front we get $y = 10^n a + m$, and our requirement is $y = 2x$. This gives: \begin{align} 20 m + 2 &= 10^n a + m \\ 19 m &= a(10^n-2) \\ m &= \frac{2a(5\cdot 10^{n-1} - 1)}{19} \end{align} The smallest $m$, if one exists, requires $a,n$ such that 19 divides $5\cdot 10^{n-1}-1$ (as 19 can't divide $2 a$) and the result has $n$-digits. It's easy to check that the smallest value of $n$ that satisfies the first condition is $n=17$. To get the smallest solution we try $a=1$, but this yields a value with only 16 digits. Setting $a=2$, however, yields the 17-digit $m = 10526315789473684$. The smallest solution to our puzzle is therefore the 18-digit number $105263157894736842$; that's surprisingly large.

Numbers with this type of property are known as parasitic numbers.

Later, I wondered if there were numbers with the slightly different, but equally interesting property, that moving the last digit to the front converted ("autoconverts") it from a value under one unit of measurement to an equivalent value under a different unit of measurement.

The first one I tried was moving the last digit to the front converts from Celsius to Fahrenheit. This is a fun puzzle that eventually made its way into the New York Times. The smallest such value is 275 C, which exactly equals 527 F. What's the next smallest temperature?

How about moving the first digit to the end? We'll need to use the little-known fact that, legally, a pound is exactly equal to 0.45359237 kilograms. Given this, does there exist a number such that moving the first digit to the end converts from pounds to kilograms? The answer is yes, but the smallest solution has 108,437,840 digits! The solution is similar to the above, but as it's computationally more involved I've written Sage code to solve it, which you can find in my GitHub puzzles repository.

The smallest number that autoconverts from gallons to liters, incidentally, is even bigger at 382,614,539 digits!

Solving a Math Puzzle using Physics

2015-10-10T17:25:00.000-07:00

The following math problem, which appeared on a Scottish maths paper, has been making the internet rounds.

The first two parts require students to interpret the meaning of the components of the formula $T(x) = 5 \sqrt{36+x^2} + 4(20-x) $, and the final "challenge" component involves finding the minimum of $ T(x) $ over $ 0 \leq x \leq 20 $. Usually this would require a differentiation, but if you know Snell's law you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.

One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where $ \theta_1, \theta_2$ are the angles to the normal and $v_1, v_2$ are the travel velocities in the two media.

In our puzzle the crocodile has an implied travel velocity of 1/5 in the water and 1/4 on land. Furthermore, the crocodile travels along the riverbank once it hits land, so $ \theta_2 = 90^{\circ}$ and $\sin{\theta_2} = 1$. Snell's law now says that the path of least time satisfies \[ \sin{\theta_1} = \frac{x}{\sqrt{36+x^2}} = \frac{4}{5},\] giving us $ 25x^2 = 16x^2 + 24^2$. Solving, $ 3^2 x^2 = 24^2, x^2 = 8^2$ and the solution is $x = 8$.

Mixed Models in R - Bigger, Faster, Stronger

2015-10-04T19:31:00.000-07:00

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models. These are models that contain both fixed and random effects. There are multiple ways of defining fixed vs random random effects, but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean.

Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository):

model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id)
fit <- glmer(model,
             data=g,
      verbose=TRUE,
      family=poisson(link=log)
      )

The fixed effects are year, field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime periods); offense (strength of offense), defense (strength of defense) and game_id are all random effects. The reason for modeling team offenses and defenses as random vs fixed effects is that I view them as random samples from the same distribution. As mentioned above, this results in statistical shrinkage or "regression to the mean" for these values, which is particularly useful for partially completed seasons.

One of the problems with large mixed models is that they can be very slow to fit. For example, the model above takes several hours on a 12-core workstation, which makes it very difficult to test model changes and tweaks. Is there any way to speed up the fitting process? Certainly! One way is to add two options to the above code:

fit <- glmer(model,
             data=g,
      verbose=TRUE,
      family=poisson(link=log),
      nAGQ=0,
      control=glmerControl(optimizer = "nloptwrap")
      )

What do these do? Model fitting is an optimization process. Part of that process involves the estimation of particular integrals, which can be very slow; the option "nAGQ=0" tells glmer to ignore estimating those integrals. For some models this has minimal impact on parameter estimates, and this NCAA hockey model is one of those. The second option tells glmer to fit using the "nloptwrap" optimizer (there are several other optimizers available, too), which tends to be faster than the default optimization method.

The impact can be rather startling. With the default options the above model takes about 3 hours to fit. Add these two options, and the model fitting takes 30 seconds with minimal impact on the parameter estimates, or approximately 400 times faster.

Elo's Rating System as a Forgetful Logistic Model

2015-10-01T23:14:00.000-07:00

Elo's rating system became famous from its use in chess, but it and variations are now used in sports like the NFL to eSports like League of Legends. It also was infamously used on various "Hot or Not" type websites, as shown in this scene from the movie "Social Network":

Of course, there's a mistake in the formula in the movie!

What is the Elo rating system? As originally proposed, it presumes that if two players A and B have ratings $R_A$ and $R_B$, then the expected score of player A is \[\frac{1}{1+10^{\frac{R_B-R_A}{400}}}.\] Furthermore, if A has a current rating of $R_A$ and plays some more games, then the updated rating ${R_A}'$ is given by ${R_A}' = R_A + K(S_A-E_A)$, where $K$ is an adjustment factor, $S_A$ is the number of points scored by A and $E_A$ was the expected number of points scored by A based on the rating $R_A$.

Now, the expected score formula given above has the same form as a logistic regression model. What's the connection between the two? One answer is that Elo's rating system is a type of online version of a logistic model. An online algorithm is an algorithm that only sees each piece of data once. As applied to a statistical model, it's a model with parameter estimates that are updated as new data comes in, but not refitting on the entire data set. It can also be considered a memoryless model; it has "forgotten" the old data and only knows the current parameter estimates. The appeal of such models is that they're extremely efficient, can operate on enormous data sets and parameter estimates can be updated in real-time.

Okay, let's say we have a "forgetful" logistic model. Can we derive an updating rule, and does it look like Elo's? I'm going to give one possible derivation under the simplifying assumption that games are won or lost, with no ties.

We don't know how many games A and B had previously played, so let's assume they both had previously played $n$ games and have just played $m$ additional games between them, with A scoring $S_A$ points. That means they've both played $n+m$ games, but we're just going to forget this again, so let's adjust everything so that they end up with $n$ games. One way to do this is to normalize $n$ and $m$ so that they sum to $n$, thus $n$ becomes $n\frac{n}{n+m}$ and $m$ becomes $m\frac{n}{n+m}$.

We're now assuming they had each played $n\frac{n}{n+m}$ games in the past, have just played $m\frac{n}{n+m}$ additional games and A scored $S_A \frac{n}{n+m}$ points (it has to be adjusted, too!) in those games.

Again, we're memoryless, so we don't know how strong the opponents were that each had played in the past, so we're going to assume that they had both played opponents that were as strong as themselves and had won half and lost half of those games. After all, people generally prefer to play competitive games.

Define $d$ by ${R_A}' = R_A + d$ and let $c = R_A - R_B$; we also require that ${R_B}' = R_B - d$. The log-likelihood $L$ of A having scored $S_A \frac{n}{n+m}$ points is \[ \frac{-2 n^2}{n+m}\log(1+e^{-d}) -\frac{n^2}{n+m}d-\frac{m n}{n+m}\log(1+e^{-c-2d}) - \frac{(m-S_A)n(c+2d)}{n+m}.\] Factoring out the constant term $n/(n+m)$ simplifies this to \[ L = -2 n\log(1+e^{-d}) - n d - m \log(1+e^{-c-2d}) - (m-S_A)(c+2d).\] Taking the partial derivative of $L$ with respect to $d$ we get
\begin{align}
\frac{\partial L}{\partial d} &= 2n \frac{e^{-d}}{1+e^{-d}} -n + 2m \frac{e^{-c-2d}}{1+e^{-c-2d}}-2(m-S_A) \\
&= -n\frac{1-e^{-d}}{1+e^{-d}} + 2 S_A - 2m\frac{1}{1+e^{-c-2d}} \\
&= -n\tanh(d/2) + 2 S_A - 2m\frac{1}{1+e^{-c-2d}}.
\end{align} What is $ m\frac{1}{1+e^{-c-2d}} $? This is actually just $ {E_A}' $, the expected score for A when playing B for $m$ games, but assuming the updated ratings for both players. Finally, setting $\frac{\partial L}{\partial d} = 0$, we get \[ n\tanh(d/2) = 2(S_A - {E_A}')\] and hence \[ \tanh(d/2) = \frac{2}{n} (S_A - {E_A}').\] Assuming $n$ is large relative to $S_A - {E_A}'$, we have $ \tanh(d/2) \approx d/2$ and $ {E_A}' \approx E_A $. This is Elo's updating rule in the form \[ d = \frac{4}{n} (S_A - E_A ).\] If we rescale with the constant $ \sigma $, the updating rule becomes \[ d = \frac{4\sigma }{n} (S_A - E_A ).\] We also now see that the adjustment factor $ K = \frac{4\sigma }{n}$.

Power Rankings: Looking at a Very Simple Method

2015-07-11T19:54:00.000-07:00

One of the simplest and most common power ranking models is known as the Bradley-Terry-Luce model, which is equivalent to other famous models such the logistic model and the Elo rating system. I'll be referring to "teams" here, but of course the same ideas apply to any two-participant game.

Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.

In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" $R_i$, and that the probability of $i$ beating $j$ is $ \frac{R_i}{R_i+R_j} $. Note that each $R_i$ is assumed to be strictly positive. Where does this model structure come from?

Here are three reasonable constraints for a power ranking model:

If $R_i$ and $R_j$ have equal strength, the probability of one beating the other should be $ \frac{1}{2}$.
As the strength of one team strictly approaches 0 (infinitely weak) with the other team fixed, the probability of the other team winning strictly increases to 1.
As the strength of one team strictly approaches 1 (infinitely strong) with the other team fixed, the probability of the other team winning strictly decreases to 0.

Note that our model structure satisfies all three constraints. Can you think of other simple model structures that satisfy all three constraints?

Given this model and a set of teams and match results, how can we estimate the $R_i$. The maximum-likelihood estimators are the set of $ R_i $ that maximizes the probability of the observed outcomes actually happening. For any given match this probability of team $ i $ beating team $ j $ is $ \frac{R_i}{R_i+R_j} $, so the overall probability of the observed outcomes of the matches $ M $ occurring is \[ \mathcal{L} = \prod_{m\in M} \frac{R_{w(m)}}{R_{w(m)}+R_{l(m)}},\] where $ w(m) $ is then winner and $ l(m) $ the loser of match $ m $. We can transform this into a sum by taking logarithms; \[ \log\left( \mathcal{L} \right) = \log\left(R_{w(m)}\right) - \log\left(R_{w(m)}+R_{l(m)}\right).\] Before going further, let's make a useful reparameterization by setting $ e^{r_i} = R_i $; this makes sense as we're requiring the $ R_i $ to be strictly positive. We then get \[ \log\left( \mathcal{L} \right) = r_{w(m)} - \log\left(e^{r_{w(m)}}+e^{r_{l(m)}}\right).\] Taking partial derivatives we get \begin{eqnarray*}
\frac{\partial \log\left( \mathcal{L} \right)}{\partial r_i} &=& \sum_{w(m)=i} 1 - \frac{e^{r_{w(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_{l(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}}\\
&=& \sum_{w(m)=i} 1 - \frac{e^{r_i}}{e^{r_i}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_i}}{e^{r_{w(m)}}+e^{r_i}}\\
&=&0.
\end{eqnarray*} But this is just the number of actual wins minus the expected wins! Thus, the maximum likelihood estimators for the $ r_i $ satisfy $ O_i = E_i $ for all teams $ i $, where $ O_i $ is the actual (observed) number of wins for team $ i $, and $ E_i $ is the expected number of wins for team $ i $ based on our model. That's a nice property!

If you'd like to experiment with some actual data, and to see that the resulting fit does indeed satisfy this property, here's an example BTL model using NCAA men's ice hockey scores. You can, of course, actually use this property to iteratively solve for the MLE estimators $ R_i $. Note that you'll have to fix one of the $ R_i $ to be a particular value (or add some other constraint), as the model probabilities are invariant with respect to multiplication of the $ R_i $ by the same positive scalar.

Getting Started Doing Baseball Analysis without Coding

2015-07-11T15:00:00.001-07:00

There's lot of confusion about how best to get started doing baseball analysis. It doesn't have to be difficult! You can start doing it right away, even if you don't know anything about R, Python, Ruby, SQL or machine learning (most GMs can't code). Learning these and other tools makes it easier and faster to do analysis, but they're only part of the process of constructing a well-reasoned argument. They're important, of course, because they can turn 2 months of hard work into 10 minutes of typing. Even if you don't like mathematics, statistics, coding or databases, they're mundane necessities that can make your life much easier and your analysis more powerful.

Here are two example problems. You don't have to do these specifically, but they illustrate the general idea. Write up your solutions, then publish them for other people to make some (hopefully) helpful comments and suggestions. This can be on a blog or through a versioning control platform like GitHub (which is also great for versioning any code or data your use). Try to write well! A great argument, but poorly written and poorly presented isn't going to be very convincing. Once it's finished, review and revise, review and revise, review and revise. When a team you follow makes a move, treat it as a puzzle for you to solve. Why did they do it, and was it a good idea?

Pick a recent baseball trade. For example, the Padres traded catcher Yasmani Grandal for Dodgers outfielder Matt Kemp. It's never that simple of course; the Padres aren't paying all of Matt Kemp's salary. Find out what the salary obligations were for each club in this trade. Using your favorite public projection system, where were the projected surplus values for each player at the time of the trade? Of course, there were other players involved in that trade, too. What were the expected surplus values of those players? From the perspective of surplus values, who won or lost this trade? Finally, why do you think each team made this trade, especially considering that they were division rivals? Do you think one or both teams made any mistakes in reasoning; if so, what were they, and did the other team take advantage of those mistakes?
Pick any MLB team and review the draft picks they made in the 2015 draft for the first 10 rounds. Do you notice any trends or changes from the 2014 draft? Do these picks agree or disagree with the various public pre-draft player rankings? Which picks were designed to save money to help sign other picks? Identify those tough signs. Was the team actually able to sign them, and were the picks to save money still reasonably good picks? Do you best to identify which picks you thought were good and bad, write them down in a notebook with your reasoning, then check back in 6 months and a year. Was your reasoning correct? If not, what were your mistakes and how can you avoid making them in the future?

Some Potentially Useful SQL Resources

2015-03-08T17:27:00.001-07:00

Some potentially useful SQL resources - explanations, visualizations, exercises, games, classes.

Who Controls the Pace in Basketball, Offense or Defense?

2015-03-03T00:56:00.002-08:00

During a recent chat with basketball analyst Seth Partnow, he mentioned a topic that came up during a discussion at the recent MIT Sloan Sports Analytics Conference. Who controls the pace of a game more, the offense or defense? And what is the percentage of pace responsibility for each side? The analysts came up with a rough consensus opinion, but is there a way to answer this question analytically? I came up with one approach that examines the variations in possession times, but it suddenly occurred to me that this question could also be answered immediately by looking at the offense-defense asymmetry of the home court advantage.

As you can see in the R output of my NCAA team model code in one of my public basketball repositories, the offense at home scores points at a rate about $ e^{0.0302} = 1.031 $ times the rate on a neutral court, everything else the same. Likewise, the defense at home allows points at a rate about $ e^{-0.0165} = 0.984$ times the rate on a neutral court; in both cases the neutral court rate is the reference level. Notice the geometric asymmetry; $ 1.031\cdot 0.984 = 1.015 > 1$. The implication is that the offense is responsible for about the fraction \[ \frac{(1.031-1)}{(1.031-1)+(1-0.984)} = 0.66 \] of the scoring pace. That is, offensive controls 2/3 of the pace, defense 1/3 of the pace. The consensus opinion the analysts came up with at Sloan? It was 2/3 offense, 1/3 defense! It's nice when things work out, isn't it?

I've used NCAA basketball because there are plenty of neutral court games; to examine the NBA directly we'll have to use a more sophisticated (but perhaps less beautiful) approach involving the variation in possession times. I'll do that next, and I'll also show you how to apply this new information to make better game predictions. Finally, there's a nice connection to some recent work on inferring causality that I'll outline.