Wednesday, January 21, 2015

A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs

Introduction


Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.

Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.

Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.

If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.

If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.

Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.

You can start learning much of this material freely and easily with MOOCs. Here's an initial list.

MOOCs

Data Science Basics

Johns Hopkins: The Data Scientist’s Toolbox. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.

Johns Hopkins: R Programming. R-based. Started January 5, 2015. Coursera.

Scripting Languages

Intro to Computer Science. Python-based. Take anytime. Udacity; videos and exercises are free.


Programming Foundations with Python. Python-based. Take anytime. Udacity; videos and exercises are free.



MIT: Introduction to Computer Science and Programming Using Python. Python-based. Class started January 9, 2015. edX.



Databases and SQL

Stanford: Introduction to Databases. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.


Machine Learning

Stanford: Machine Learning. Octave-based. Class started January 19, 2015. Coursera.


Stanford: Statistical Learning. R-based. Class started January 19, 2015. Stanford OpenEdX.



Sunday, January 18, 2015

How Unfair are the NFL's Overtime Rules?

In 2010 the NFL amended its overtime rules, and in 2012 extended these to all regular season games. Previously, overtime was handled by sudden death - the first team to score won. The team winning a coin flip can elect to kick or receive (they invariably receive, as they should).

Assuming the game ends in the first overtime, the team with the first possession wins under the following scenarios:
  1. scores a touchdown on the first drive
  2. kicks a field goal on the first drive; other team fails to score on the second drive
  3. both teams kick a field goal on the first and second drives; win in sudden death
  4. doesn't score on the first drive; defensive score during second drive
  5. neither team scores on first or second drives; win in sudden death
Under this overtime procedure, roughly how often should be expect the team winning the coin flip to win the game?

For an average team the empirical probabilities of the above events per drive are:
  • \(\mathrm{defensiveTD} = \mathrm{Pr}(\text{defensive touchdown}) = 0.02\)
  • \(\mathrm{safety} = \mathrm{Pr}(\text{safety}) = 0.001\)
  • \(\mathrm{fieldGoal} = \mathrm{Pr}(\text{field goal}) = 0.118\)
  • \(\mathrm{offensiveTD} = \mathrm{Pr}(\text{offensive touchdown}) = 0.200\)
We'll also use the following:
  • \(\mathrm{defensiveScore} = \mathrm{Pr}(\text{any defensive score}) = \mathrm{defensiveTD} + \mathrm{safety}\)
  • \(\mathrm{offensiveScore} = \mathrm{Pr}(\text{any offensive score}) = \mathrm{fieldGoal} + \mathrm{offensiveTD}\)
  • \(\mathrm{noOFscore} = \mathrm{Pr}(\text{no offensive score}) = 1 - \mathrm{offensiveScore}\)
  • \(\mathrm{noScore} = \mathrm{Pr}(\text{no score}) = 1 - \mathrm{offensiveScore} - \mathrm{defensiveScore}\)
  • \(\mathrm{sdWin} = \mathrm{Pr}(\text{driving team winning under sudden death rules})\)
Then the probabilities of the above numbered outcomes is approximately:
  1. \(\mathrm{offensiveTD}\)
  2. \(\mathrm{fieldGoal}\times \mathrm{noOFscore}\)
  3. \(\mathrm{fieldGoal}\times \mathrm{fieldGoal}\times \mathrm{sdWin}\)
  4. \(\mathrm{noScore}\times \mathrm{defensiveScore}\)
  5. \(\mathrm{noScore}\times \mathrm{noScore}\times \mathrm{sdWin}\)
The last thing we need to work out is \(\text{sdWin}\). There are three ways for the team with possession to win:
  1. any offensive score on the first drive
  2. no offensive score; any defensive score on the second drive
  3. neither team scores on the first or second possessions; we're back to our original state
These three scenarios have values:
  1. \(\mathrm{offensiveScore}\)
  2. \(\mathrm{noOFscore}\times \mathrm{defensiveScore}\)
  3. \(\mathrm{noScore}\times \mathrm{noScore}\times \mathrm{sdWin}\)
Doing the math, we get that \begin{align*}
\mathrm{sdWin} &= \mathrm{offensiveScore} + \mathrm{noOFscore}\times \mathrm{defensiveScore} + {\mathrm{noScore}}^2\times\mathrm{sdWin};\\
\mathrm{sdWin} &=\frac{(\mathrm{offensiveScore} + \mathrm{noOFscore}\times \mathrm{defensiveScore})}{(1-{\mathrm{noScore}}^2)}.
\end{align*}
Putting it all together we get \[
\text{win} = \mathrm{offensiveTD} + \mathrm{fieldGoal}\times \mathrm{noOFscore} + \mathrm{noScore}\times \mathrm{defensiveScore}\\
+ (\mathrm{fieldGoal}^2+ \mathrm{noScore}^2)\times \mathrm{sdWin}.\]
Plugging in our empirical values, we finally arrive at \[\mathrm{Pr}(\text{win coin flip, win game}) = 0.560.\] For comparison, under the original sudden death rules, \[\mathrm{Pr}(\text{win coin flip, win game}) = 0.589.\] So the NFL overtime rules are still ridiculously unfair in favor of the winner of the coin flip, but not as ridiculously unfair as they were under the original sudden death rules.

How do these numerical results compare to actual outcomes? Under the current overtime rules, there have been 51 overtime games. In 27 of these the team winning the coin toss won the game, in 21 the team losing the coin toss won the game and there have been 3 ties. That puts \(\mathrm{win} = \frac{27}{48} = 0.5625\) for games not ending in ties. Close enough!

If you'd like to tweak the probabilities for each event to see how the resulting probability for the winner of the coin flip changes, I have a simple Python script here.

Sunday, January 11, 2015

A Short Note on Automatic Algorithm Optimization via Fast Matrix Exponentiation

Alexander Borzunov has written an interesting article about his Python code that uses fast matrix exponentiation to automatically optimize certain algorithms. It's definitely a recommended read.

In his article, Alexander mentions that it's difficult to directly derive a matrix exponentiation algorithm for recursively-defined sequences such as \[
F_n = \begin{cases}
0, & n = 0\\
1, & n = 1\\
1, & n = 2\\
7(2F_{n-1} + 3F_{n-2})+4F_{n-3}+5n+6, & n \geq 3
\end{cases}
\] While it's true that it's not entirely simple, there is a relatively straightforward way to do this that's worth knowing.

The only difficultly is due to the term \(5n+6\), but we can eliminate it by setting
\(F_n = G_n + an+b\), then solving for appropriate values of \(a, b\).

Substituting and grouping terms we have \[
G_n + an+b = 7(2G_{n-1} + 3G_{n-2})+4G_{n-3} + 39an-68a+39b+5n+6,
\] and equating powers of \(n\) we need to solve the equations \[
\begin{align*}
a &= 39a+5,\\
b &= -68a+39b+6.
\end{align*}
\] Setting \(a = -\frac{5}{38}, b = -\frac{142}{361}\) we end up with the homogeneous system \[
G_n = \begin{cases}
\frac{142}{361}, & n = 0\\
\frac{379}{722}, & n = 1\\
\frac{237}{361}, & n = 2\\
7(2G_{n-1} + 3G_{n-2})+4G_{n-3}, & n \geq 3
\end{cases}
\] This is now easily cast into matrix exponentiation form using the standard method.

Wednesday, January 7, 2015

Young Alan Turing and the Arctangent

With the release of the new film "The Imitation Game", I decided to read the biography this excellent film was based on - Alan M. Turing: Centenary Edition. In it, the author Andrew Hodges relates the story that the 15-year-old Alan Turing derived the Maclaurin series for the \(\arctan\) function, i.e. \[\arctan(x) = x - \frac{x^3}{3} + \frac{x^5}{5} - \frac{x^7}{7} + \ldots\] This is trivial using calculus, but it's explicitly stated that young Alan Turing neither knew nor used calculus. How would you derived such a series without calculus?

This is a tricky problem, and I'd suggest first tackling the much easier problem of deriving the Maclaurin series for \(\exp(x)\) from the relation \( \exp(2x) = \exp(x)\cdot \exp(x)\). This is an undercontrained relation, so you'll need to assume \(c_0 = 1, c_1 = 1\).

Getting back to \(\arctan\), you could start with the half-angle formula for the tangent: \[\tan(2x) = \frac{2\tan(x)}{1-{\tan}^2(x)}.\] Now use the Weierstrass-like substitution \(x = \arctan(t)\) to get \[\tan(2\arctan(t)) = \frac{2t}{1-t^2}.\] The right-hand side can be expanded in the usual geometric series fashion to get \[\tan(2\arctan(t)) = 2t\cdot (1+t^2+t^4+\ldots).\]
Finally, take the \(\arctan\) of both sides and assume we have the series expansion \(\arctan(x) = c_1 x + c_3 x^3 + c_5 x^5 + \ldots\). Note that we may ignore the terms with even powers of \(x\) as \(\arctan(x)\) is an odd function.

This gives us the setting \[2\arctan(t) = \arctan(2t\cdot (1+t^2+t^4+\ldots))\] and expanding as a power series \[2(c_1 t + c_3 t^3 + \ldots) = c_1 (2t\cdot (1+t^2+\ldots) + c_3 (2t\cdot (1+t^2+\ldots))^3 + \ldots\]
The next step is to line up powers of \(t\) on both sides and set up a system of simultaneous equations. There's some algebra and combinatorics involved, but we end up with the system of equations \[c_{2i+1} = \sum_{j=0}^{i} c_{2j+1}\cdot 2^{2j} \cdot {{i+j} \choose {2j}}.\] Note that this system is underconstrained due to our functional relationship being satisfied by any multiple of the \(\arctan\) function. We'll assume that \(c_1 = 1\), but note that this follows from the classical (non-calculus) limit \( \lim_{x\to 0} \frac{\sin(x)}{x} = 1\).

The first few relations are \begin{align*}
c_1 &= c_1 \\
c_3 &= c_1 + 4\cdot c_3 \\
c_5 &= c_1 + 12\cdot c_3 + 16\cdot c_5 \\
c_7 &= c_1 + 24\cdot c_3 + 80\cdot c_5 + 64\cdot c_7
\end{align*}
Assuming \(c_1 = 1\) as above we quickly calculate \( c_3 = -\frac{1}{3}, c_5 = \frac{1}{5}, c_7 = -\frac{1}{7}\), with the pattern being obvious.

That \(c_{2i+1} = \frac{(-1)^i}{2i+1}\) can be verified by Wolfram Alpha:

Wolfram Alpha

An obvious question is whether or not there's a simple demonstration of this; in particular, one that a young Alan Turing may have found. This I don't know (yet).

Saturday, October 11, 2014

A Strange Recursive Relation, Automatic

Hofstadter mentions the following recursive relation in his great book "Gödel, Escher, Bach": \[
\begin{align}
g(0) &= 0;\\
g(n) &= n-g(g(n-1)).
\end{align}
\] I claim that \( g(n) = \left\lfloor \phi\cdot (n+1) \right\rfloor \), where \( \phi = \frac{-1+\sqrt{5}}{2}\), and I'll show this using a technique that makes proving many identities of this type nearly automatic.

Let \( \phi\cdot n = \left\lfloor \phi\cdot n \right\rfloor + e\), where \( 0 < e < 1\) as \( \phi \) is irrational, nor can \(e = 1-\phi\), and note that \(\phi\) satisfies \( {\phi}^2 + \phi - 1 = 0\). Some algebra gives \[
\begin{align}
n-\left\lfloor \left( \left\lfloor \phi\cdot n \right\rfloor + 1 \right) \cdot \phi \right\rfloor
&= n-\left\lfloor \left( n\cdot \phi - e + 1 \right) \cdot \phi \right\rfloor \\
&= n-\left\lfloor n\cdot {\phi}^2 - e\cdot \phi + \phi \right\rfloor \\
&= n-\left\lfloor n\cdot \left(1-\phi\right) - e\cdot \phi + \phi \right\rfloor \\
&= n-n-\left\lfloor -n\cdot \phi - e\cdot \phi + \phi \right\rfloor \\
&= -\left\lfloor -n\cdot \phi - e\cdot \phi + \phi \right\rfloor \\
&= -\left\lfloor -n \cdot \phi + e - e - e\cdot \phi + \phi \right\rfloor \\
&= \left\lfloor \phi\cdot n \right\rfloor -\left\lfloor - e - e\cdot \phi + \phi \right\rfloor.
\end{align}
\]
Now if \[
\begin{align}
0 < e < 1-\phi &\implies 0 < - e - e\cdot \phi + \phi < \phi;\\
1-\phi < e < 1 &\implies -1 < - e - e\cdot \phi + \phi < 0.
\end{align}
\]
This implies \[ n-\left\lfloor \left( \left\lfloor \phi\cdot n \right\rfloor + 1 \right) \cdot \phi \right\rfloor = \left\lfloor \phi\cdot (n+1) \right\rfloor .\] Since \( \left\lfloor \phi\cdot (0+1) \right\rfloor = 0\), we're done.

The point of the algebra was to move all terms involving \(n\) out, and then checking to see how the remaining term varied with \( e\). A simple idea, but very useful.

Sunday, October 5, 2014

Closed Under Means

Here's a nice little problem from the 13th All Soviet Union Mathematics Olympiad.
Given a set of real numbers \(S\) containing 0 and 1 that's closed under finite means, show that it contains all rational numbers in the interval \(\left[0,1\right]\).
This isn't a difficult problem, but there's a particularly nice solution.

First observe that if \(x\in S\) then both \(\frac{x}{4}\) and \(\frac{3x}{4}\) are in \(S\); average \(\{0,x\}\) to get \(\frac{x}{2}\), average \(\{0, \frac{x}{2}\}\) to get \(\frac{x}{4}\), average \(\{\frac{x}{2}, x\}\) to get \(\frac{3x}{4}\).

We could show any rational number \(\frac{m}{n}\) with \(1\leq m < n\) is in \(S\) if we had \(n\) distinct elements from \(S\) that summed to \(m\). Let's exhibit one.

Start with an array with \(m\) 1s on the left, \(n-m\) 0s on the right. Repeatedly replace adjacent \(x,y\) values with \(\frac{3(x+y)}{4}, \frac{(x+y)}{4}\), where either \(x=1,y\neq1\) or \(x\neq 0, y=0\), until there is one 0 and one 1 left. We can do this in exactly \(n-2\) steps, each resulting number is guaranteed to be in \(S\) by the above note, and each number is guaranteed to be distinct by construction!

Examples:

\(\frac{1}{3}: \left[1,0,0\right] \to \left[\frac{3}{4},\frac{1}{4},0\right] \)

\(\frac{2}{5}: \left[1,1,0,0,0\right] \to \left[1,\frac{3}{4},\frac{1}{4},0,0\right] \to \left[1,\frac{3}{4},\frac{3}{16},\frac{1}{16},0\right] \)

Thursday, August 28, 2014

Learn One Weird Trick And Easily Solve The Product-Sum Problem

A tribute to Martin Gardner.

For which sets of positive integers does the sum equal the product? For example, when does \( x_1 + x_2 = x_1\cdot x_2\)? It's easy to see that this is only true when \(x_1 = x_2 = 2\).

In the general case our equality is \(\sum_{i=1}^{n} x_i= \prod_{i=1}^{n} x_i \). We can rearrange terms to give \[x_1+x_2+\sum_{i=3}^{n} x_i= x_1\cdot x_2\cdot \prod_{i=3}^{n} x_i,\] and this in turn factors nicely to give us \[\left( x_1\cdot \prod_{x=3}^{n} x_i - 1\right)\cdot \left( x_2\cdot \prod_{x=3}^{n} x_i - 1\right) = \left( \prod_{x=3}^{n} x_i \right)\cdot \left(\sum_{x=3}^{n} x_i \right) + 1.\] How does this help? Consider the case \(n=5\), and without loss of generality assume \(x_i \ge x_{i+1}\) for all \(i\). If \(x_3=x_4=x_5=1\) our factorized equation becomes \[(x_1-1)\cdot(x_2-1)=4,\] with the obvious solutions \( x_1=5, x_2=2; x_1=3, x_2=3\). The only remaining case to consider is \(x_3=2\), as any other case forces  \( \sum_{i=1}^{n} x_i < \prod_{i=1}^{n} x_i \). For this case our equality becomes \[\left( 2 x_1 - 1\right)\cdot \left( 2 x_2 - 1\right) = 9.\] This gives us the remaining solution \(x_1 = x_2 = x_3 = 2\) with the other \(x_i = 1\).

This is quite efficient. For example, for the case \(n=100\) the only equations we need to consider are \begin{aligned} (x_1-1)\cdot(x_2-1) &= 99\\
                         (2 x_1-1)\cdot(2 x_2-1) &= 199\\
                         (3 x_1-1)\cdot(3 x_2-1) &= 301\\
                         (4 x_1-1)\cdot(4 x_2-1) &= 405\\
                         (4 x_1-1)\cdot(4 x_2-1) &= 401\\
                         (6 x_1-1)\cdot(6 x_2-1) &= 607\\
                         (9 x_1-1)\cdot(9 x_2-1) &= 919\\
                         (8 x_1-1)\cdot(8 x_2-1) &= 809\\
                         (12 x_1-1)\cdot(12 x_2-1) &= 1225\\
                         (16 x_1-1)\cdot(16 x_2-1) &= 1633.
\end{aligned} Now 199, 401, 607, 919, 809 are all prime ruling them out immediately, and 301 and 1633 don't have factors of the right form. The other equations yield the five solutions \( (100,2), (34,4), (12,10), (7,4,4), (3,3,2,2,2) \) with the other \(x_i = 1\).

For \(n = 1000 \) you'd need to consider 52 cases, but most of these are eliminated immediately. I get the six solutions \( (1000,2), (334,4), (112,10), (38,28), (67,4,4), (16,4,4,4) \).

Have fun!