tag:blogger.com,1999:blog-9149402429183581490Sun, 01 Oct 2017 09:28:07 +0000The Angry Statisticianhttp://angrystatistician.blogspot.com/noreply@blogger.com (Christopher Long)Blogger72125tag:blogger.com,1999:blog-9149402429183581490.post-1943698674891911964Sun, 14 May 2017 03:02:00 +00002017-05-13T21:32:42.221-07:00Poisson Games and Sudden-Death OvertimeLet's say we have a game that can be reasonably modeled as two independent Poisson processes with team \(i\) having parameter \(\lambda_i\). If one team wins in regulation with team \(i\) scoring \(n_i\), then it's well-known we have the MLE estimate \(\hat{\lambda_i}=n_i\). But what if the game ends in a tie in regulation with each team scoring \(n\) goals and we have sudden-death overtime? How does this affect the MLE estimate for the winning and losing teams?<br /><br />Assuming without loss of generality that team \(1\) is the winner in sudden-death overtime. As we have two independent Poisson processes, the probability of this occurring is \(\frac{\lambda_1}{\lambda_1 + \lambda_2}\). Thus, the overall likelihood we'd like to maximize is \[L = e^{-\lambda_1} \frac{{\lambda_1}^n}{n!} e^{-\lambda_2} \frac{{\lambda_2}^n}{n!} \frac{\lambda_1}{\lambda_1 + \lambda_2}.\] Letting \(l = \log{L}\) we get \[l = -{\lambda_1} + n \log{\lambda_1} - {\lambda_2} + n \log{\lambda_2} - 2 \log{n!} + \log{\lambda_1}-\log({\lambda_1 + \lambda_2}).\] This gives \[\begin{equation}<br />\eqalign{<br />\frac{\partial l}{\partial \lambda_1} &= -1+\frac{n}{\lambda_1}+\frac{1}{\lambda_1}+\frac{1}{\lambda_1 + \lambda_2}\\<br />\frac{\partial l}{\partial \lambda_2} &= -1+\frac{n}{\lambda_2}+\frac{1}{\lambda_1 + \lambda_2}.<br />}<br />\end{equation}\] Setting both partials equal to \(0\) and solving, we get \[\begin{equation}<br />\eqalign{<br />(n-\hat{\lambda_1})(\hat{\lambda_1}+\hat{\lambda_2})+\hat{\lambda_2} &= 0\\<br />(n-\hat{\lambda_2})(\hat{\lambda_1}+\hat{\lambda_2})-\hat{\lambda_2} &= 0,<br />}<br />\end{equation}\] and so \[\begin{equation}<br />\eqalign{<br />\hat{\lambda_1} &= (n+1) \frac{2n}{2n+1}\\<br />\hat{\lambda_2} &= n \frac{2n}{2n+1}.<br />}<br />\end{equation}\] For example, if both teams score \(3\) goals in regulation and team \(1\) wins in sudden-death overtime, our MLE estimates are \(\hat{\lambda_1} = 3\frac{3}{7}, \hat{\lambda_2} = 2\frac{4}{7}\).<br /><br />Intuitively this makes sense, because \(2n\) goals were scored in regulation time, hence we "expect" that the overtime goal occurred around a fraction \(\frac{1}{2n}\) of regulation, so team \(1\) scored \(n+1\) goals in about \(\frac{2n+1}{2n}\) regulation periods and team \(2\) scored \(n\) goals in about \(\frac{2n+1}{2n}\) regulation periods. The standard Poisson process MLE estimates here coincide with the estimates we derived above.<br /><br />Does this work in practice? Yes! I tested it on my NCAA men's lacrosse model, and it increased the out-of-sample testing accuracy by 0.5%. Surprisingly large for such a small change!http://angrystatistician.blogspot.com/2017/05/poisson-games-and-sudden-death-overtime.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-7242256171179291742Sun, 02 Apr 2017 04:59:00 +00002017-04-01T22:09:18.663-07:00Why does Kaggle use Log-loss?If you're not familiar with <a href="https://www.kaggle.com/" target="_blank">Kaggle</a>, it's an organization dedicated to data science competitions to both provide ways for companies to potentially do analytics at less cost, as well as to identify talented data scientists.<br /><br />Competitions are scored using a variety of functions, and the most common for binary classification tasks with confidence is something called log-loss, which is essentially \(\sum_{i=1}^{n} p_i\cdot\log(p_i)\), where \(p_i\) is your model's claimed confidence for test data point \(i\)'s correct label. Why does Kaggle use this scoring function? Here I'll follow <a href="https://terrytao.wordpress.com/2016/06/01/how-to-assign-partial-credit-on-an-exam-of-true-false-questions/" target="_blank">Terry Tao's argument</a>.<br /><br />Ideally what we'd like is a scoring function \(f(x)\) that yields the maximum expected score precisely when the claimed confidence \(x_i\) in the correct label for \(i\) is actually what the submitter believes is the true probability (or frequency) of that outcome. This means that we want \[L(x)=p\cdot f(x) + (1-p)\cdot f(1-x)\] for fixed \(p\) to be maximized when \(x=p\). Differentiating, this means \[L'(x) = p\cdot f'(x) - (1-p)\cdot f'(1-x) = 0\] when \(x=p\), hence \(p\cdot f'(p) = (1-p)\cdot f'(1-p)\) for all \(p\). This will be satisfied by any admissible \(f(x)\) with \(x\cdot f'(x)\) symmetric around \(x=\frac{1}{2}\), but if we extend our analysis to multinomial outcomes we get the stronger conclusion that in fact \(x\cdot f'(x) = c_0\) for some constant \(c_0\). This in turn implies \(f(x)=c_0\cdot \log(x)+c_1\). If we want \(f(1/2)=0\) and \(f(1)=1\), we end up with \(f(x)={\log}_2(2x)\) and the expected score is \[L(x)=x\cdot {\log}_2(2x) + (1-x)\cdot {\log}_2(2(1-x)).\]http://angrystatistician.blogspot.com/2017/04/why-does-kaggle-use-log-loss.htmlnoreply@blogger.com (Christopher Long)3tag:blogger.com,1999:blog-9149402429183581490.post-513160205354158210Sat, 01 Apr 2017 02:43:00 +00002017-03-31T19:45:41.026-07:00The Kelly Criterion and a Sure ThingThe <a href="https://en.wikipedia.org/wiki/Kelly_criterion" target="_blank">Kelly Criterion</a> is an alternative to standard utility theory, which seeks to maximize expected utility. Instead, the Kelly Criterion seeks to maximize expected <i>growth</i>. That is, if we start out with an initial bankroll \(B_0\), we seek to maximize \(\mathrm{E}[g(t)]\), where \(B_t = B_0\cdot e^{g(t)}\).<br />As a simple example, consider the following choice. We can have a sure $3000, or we can take the gamble of a \(\frac{4}{5}\) chance of $4000 and a \(\frac{1}{5}\) chance of $0. What does Kelly say?<br />Assume we have a current bankroll of \(B_0\). After the first choice we have \(B_1 = B_0+3000\), which we can write as \[\mathrm{E}[g(1)] = \log\left(\frac{B_0+3000}{B_0}\right);\]for the second choice we have \[\mathrm{E}[g(1)] = \frac{4}{5} \log\left(\frac{B_0+4000}{B_0}\right).\]And so we want to compare \(\log\left(\frac{B_0+3000}{B_0}\right)\) and \(\frac{4}{5} \log\left(\frac{B_0+4000}{B_0}\right)\).<br />Exponentiating, we're looking for the positive root of \[{\left({B_0+3000}\right)}^5 - {B_0}\cdot {\left({B_0+4000}\right)}^4=0.\]<a href="https://www.wolframalpha.com/input/?i=solve+(b_0%2B3000)%5E5-(b_0)*(b_0%2B4000)%5E4%3D0" target="_blank">Wolfram Alpha</a> now tells us that we should go with the sure thing if \(B_0 < $4942.92\), and take the gamble otherwise.http://angrystatistician.blogspot.com/2017/03/the-kelly-criterion-and-sure-thing.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-7287905824139329105Sat, 01 Apr 2017 01:33:00 +00002017-03-31T18:39:12.164-07:00Prime Divisors of \(3^{32}-2^{32}\)Find four prime divisors < 100 for \(3^{32}-2^{32}\).<br />Source: British Math Olympiad, 2006.<br /><br />This factors nicely as \(3^{32}-2^{32} = \left(3^{16}+2^{16}\right)\left(3^{16}-2^{16}\right)\), and we can continue factoring in this way to get \[3^{32}-2^{32} = \left(3^{16}+2^{16}\right)\left(3^8+2^8\right)\left(3^4+2^4\right)\left(3^2+2^2\right)\left(3^2-2^2\right).\]The final three terms are \(5, 13, 97\), so we have three of the four required primes. For another prime divisor, consider \(3^{16}-2^{16}\). By <a href="https://en.wikipedia.org/wiki/Fermat%27s_little_theorem">Fermat's Little Theorem</a> \(a^{16}-1\equiv 0 \bmod 17\) for all \(a\) with \((a,17)=1\), and so it follows that \(3^{16}-2^{16}\equiv 0 \bmod 17\), and we therefore have \(17\) as a fourth such prime divisor.<br />Alternatively, note \( \left(\dfrac{3}{17}\right)=-1, \left(\dfrac{2}{17}\right)=1\), hence by <a href="https://en.wikipedia.org/wiki/Euler%27s_criterion">Euler's Criterion</a> \(3^8\equiv -1 \bmod 17\) and \(2^8\equiv 1 \bmod 17\), giving \(3^8+2^8\equiv 0\bmod 17\).http://angrystatistician.blogspot.com/2017/03/prime-divisors-of-332-232.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-909880114773204542Fri, 31 Mar 2017 00:48:00 +00002017-03-30T17:48:03.201-07:00Highest Powers of 3 and \(\left(1+\sqrt{2}\right)^n\)Let \(\left(1+\sqrt{2}\right)^{2012}=a+b\sqrt{2}\), where \(a\) and \(b\) are integers. What is the greatest common divisor of \(b\) and \(81\)?<br />Source: 2011-2012 SDML High School 2a, problem 15.<br /><br />Let \((1+\sqrt{2})^n = a_n + b_n \sqrt{2}\). I've thought about this some more, and there's a nice way to describe the highest power of \(3\) that divides \(b_n\). This is probably outside of the scope of the intended solution, however.<br /><br />First note that \((1-\sqrt{2})^n = a_n - b_n \sqrt{2}\), and so from \((1+\sqrt{2})(1-\sqrt{2})=-1\) we get \((1+\sqrt{2})^n (1-\sqrt{2})^n = {(-1)}^n\). This gives \[{a_n}^2 - 2 {b_n}^2 = {(-1)}^n.\] Now define the highest power of a prime \(p\) that divides \(n\) to be \(\operatorname{\nu}_p(n)\).<br />From cubing and using the above result it's straightforward to prove that if \(\operatorname{\nu}_3(b_n) = k > 0\) then \(\operatorname{\nu}_3(b_{3n}) = k+1\).<br />Note \((1+\sqrt{2})^4 = 17 + 12\sqrt{2} \equiv -1+3\sqrt{2} \pmod{3^2}\). Cubing and using the first formula as before, we can in fact show that \[(1+\sqrt{2})^{4\cdot 3^n} \equiv -1 + 3^{n+1}\sqrt{2} \pmod{3^{n+2}},\] and squaring we also have \[(1+\sqrt{2})^{8\cdot 3^n} \equiv 1 + 3^{n+1}\sqrt{2} \pmod{3^{n+2}}.\] Now assume \(\operatorname{\nu}_3(b_m) = k, \operatorname{\nu}_3(b_n) = l\) and \(k\neq l\). From the top formula if \(3 | b_i\) then \(3 \not{|} a_i\), and it follows that \[\operatorname{\nu}_3(b_{m+n}) = \min(k,l).\]Putting this all together, write \(n = 4\cdot m +k\), where \(0\leq k <4\). If \(k\neq 0\), then \(\operatorname{\nu}_3(b_{n}) = 0\). If \(k=0\), let the base-3 expansion of \(m\) be \(a_i \cdot 3^i + \ldots + a_0\). Then \[\operatorname{\nu}_3(b_{n}) = \min_{a_j \neq 0} j+1 .\]<br />For \(n=2012\), we have \(2012 = 4\cdot 503 = 4\cdot(2\cdot 3^5 + 3^2 + 2\cdot 3 + 2)\) and so \(\operatorname{\nu}_3(b_{2012})=1\). We don't actually need to compute the entire base-3 expansion for 503, of course; we only need to observe that it's not divisible by 3.<br /><br />For \(n=2016\), we have \(2016 = 4\cdot 504 = 4\cdot(2\cdot 3^5 + 2\cdot 3^2)\) and so \(\operatorname{\nu}_3(b_{2016})=3\).<br />http://angrystatistician.blogspot.com/2017/03/highest-powers-of-3-and-left1sqrt2rightn.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-7042197233498512178Thu, 30 Mar 2017 19:24:00 +00002017-03-30T12:28:18.843-07:00Sum of Two Odd Composite NumbersWhat is the largest even integer that cannot be written as the sum of two odd composite numbers? Source: <a href="https://artofproblemsolving.com/wiki/index.php/1984_AIME_Problems" target="_blank">AIME 1984</a>, problem 14.<br /><br />Note \(24 = 3\cdot 3 + 3\cdot 5\), and so if \(2k\) has a representation as the sum of even multiples of 3 and 5, say \(2k = e_3\cdot 3 + e_5\cdot 5\), we get a representation of \(2k+24\) as a sum of odd composites via \(2k+24 = (3+e_3)\cdot 3 + (5+e_5)\cdot 5\). But by the <a href="https://en.wikipedia.org/wiki/Coin_problem" target="_blank">Frobenius coin problem</a> every number \(k > 3\cdot 5 -3-5 = 7\) has such a representation, hence every number \(2k > 14\) has a representation as the sum of even multiples of 3 and 5. Thus every number \(n > 24+14=38\) has a representation as the sum of odd composites. Checking, we see that \(\boxed{38}\) has no representation as a sum of odd composites.http://angrystatistician.blogspot.com/2017/03/sum-of-two-odd-composite-numbers.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-4898052062903412786Sun, 19 Jun 2016 06:30:00 +00002016-06-19T03:54:04.610-07:00What's the Value of a Win?In a previous entry I demonstrated <a href="http://angrystatistician.blogspot.com/2016/06/a-simple-estimate-for-pythagorean.html">one simple way to estimate an exponent for the Pythagorean win expectation</a>. Another nice consequence of a Pythagorean win expectation formula is that it also makes it simple to estimate the run value of a win in baseball, the point value of a win in basketball, the goal value of a win in hockey etc.<br /><br />Let our Pythagorean win expectation formula be \[ w=\frac{P^e}{P^e+1},\] where \(w\) is the win fraction expectation, \(P\) is runs/allowed (or similar) and \(e\) is the Pythagorean exponent. How do we get an estimate for the run value of a win? The expected number of games won in a season with \(g\) games is \[W = g\cdot w = g\cdot \frac{P^e}{P^e+1},\] so for one estimate we only need to compute the value of the partial derivative \(\frac{\partial W}{\partial P}\) at \(P=1\). Note that \[ W = g\left( 1-\frac{1}{P^e+1}\right), \] and so \[ \frac{\partial W}{\partial P} = g\frac{eP^{e-1}}{(P^e+1)^2}\] and it follows \[ \frac{\partial W}{\partial P}(P=1) = \frac{ge}{4}.\] Our estimate for the run value of a win now follows by setting \[\frac{\Delta W}{\Delta P} = \frac{ge}{4} \] giving \[ \Delta W = 1 = \frac{ge}{4} \Delta P.\] What is \(\Delta P\)? Well \(P = R/A\), where \(R\) is runs scored over the season and \(A\) is runs allowed over the season. We're assuming this is a league average team and asking how many more runs they'd need to score to win an additional game, so \(A\) is actually fixed at \(L\), the league average number of runs scored (or allowed). This gives us \[1 = \frac{ge}{4} \Delta P = \frac{ge\Delta R}{4L}.\] Now \(L/g = l\), the league average runs per game, so we arrive at the estimate \[\Delta R = \frac{4l}{e}.\] In the specific case of MLB, we have \(e = 1.8\) and \(l = 4.3\), giving that a win is approximately \(\Delta R = 9.56\) runs.<br /><br />Bill James originally used the exponent \(e=2\); in this case the formula simplifies to \(\Delta R = 2l\), i.e. we get the particularly simple result that a win is equal to approximately twice the average number of runs scored per game.<br /><br />Applying this estimate to the NBA, a win is approximately \( \Delta R = \frac{4\cdot 101}{16.4} = 24.6\) points. Similarly, we get the estimates for a win of 4.5 goals for the NHL and 5.1 goals for the Premier League.http://angrystatistician.blogspot.com/2016/06/whats-value-of-win.htmlnoreply@blogger.com (Christopher Long)2tag:blogger.com,1999:blog-9149402429183581490.post-7009974848766785430Wed, 08 Jun 2016 23:05:00 +00002016-06-08T16:05:06.375-07:00A Simple Estimate for Pythagorean ExponentsGiven the number of runs scored and runs allowed by a baseball team, what's a good estimate for that team's win fraction? Bill James famously came up with what he called the "<a href="https://en.wikipedia.org/wiki/Pythagorean_expectation">Pythagorean expectation</a>" \[w = \frac{R^2}{R^2 + A^2},\] which can also be written as \[w = \frac{{(R/A)}^2}{{(R/A)}^2 + 1}.\] More generally, if team \(i\) scores \(R_i\) and allows \(A_i\) runs, the Pythagorean estimate for the probability of team \(1\) beating team \(2\) is \[w = \frac{{(R_1/A_1)}^2}{{(R_1/A_1)}^2 + (R_2/A_2)^2}.\] We can see that the estimate of the team's win fraction is a consequence of this, as an average team would by definition have \(R_2 = A_2\). Now, there's nothing magical about the exponent being 2; it's a coincidence, and in fact is not even the "best" exponent. But what's a good way to estimate the exponent? Note the structural similarity of this win probability estimator and the Bradley-Terry estimator \[ w = \frac{P_1}{P_1+P_2}.\] Here the \(P_i\) are what we could call the "Bradley-Terry power" of the team. This immediately suggests one way to estimate the expectation model's exponent - fit a Bradley-Terry model, then fit the log-linear regression \(\log(P_i)\) vs \(\log(R_i/A_i)\). The slope of this regression will be one estimate for the expectation exponent.<br /><br />How well does this work? I get <a href="https://github.com/octonion/lunchtime/blob/master/pythagorean/mlb_btl.txt">1.727 for MLB in 2014</a>. The R code and data files for MLB and other sports may be found in my <a href="https://github.com/octonion/lunchtime/tree/master/pythagorean">GitHub repo</a>.http://angrystatistician.blogspot.com/2016/06/a-simple-estimate-for-pythagorean.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-3609641838826143688Tue, 03 May 2016 05:40:00 +00002016-05-02T22:40:14.420-07:00Behind the SpeadsheetIn the book <a href="http://www.amazon.com/Only-Rule-Has-Work-Experiment-ebook/dp/B016IBVN6Y">"The Only Rule Is It Has to Work: Our Wild Experiment Building a New Kind of Baseball Team"</a>, Ben Lindbergh and Sam Miller recount a grand adventure to take command of an independent league baseball team, with the vision of trying every idea, sane or crazy, in an attempt to achieve a winning edge. Five infielders, four outfielders, defensive shifts, optimizing lineups - everything.<br /><br />It was really an impossible task. Professional sports at every level are filled with highly accomplished and competitive athletes, with real lives and real egos. Now imagine walking in one day and suddenly trying to convince them that they should be doing things differently. Who do you think you are?<br /><br />I was one of the analysts who helped Ben and Sam in this quest, and I wanted to write some thoughts down from my own perspective, not as one of the main characters, but as someone more behind the scenes. These are some very short initial thoughts only, but I'd like to followup with some more ideas on where things went wrong from my perspective, and also how independent league teams can better identify roster talent from some non-traditional sources.<br /><br />My focus was on attempting to identify talent overlooked in the MLB draft. This is extremely challenging; there are 30 teams, 40 standards rounds plus other picks. Furthermore, among those players left, many sign as amateur free agents post-draft. You're left with players from lower divisions, very small schools, 23-year-old seniors, bad bodies, soft tossers, poor defenders, etc. But, still, there may be players who aren't good MLB prospects, but who could still perform well as part of an independent league team.<br /><br />Looking at top framing college catchers was a bust; this is a premium defensive position and very little is overlooked.<br /><br />Among the undrafted senior hitters and pitchers there were several potential prospects, many of whom you'll read about in the book. The most important fact to keep in mind is that these are real people with real lives, real families and real hopes and dreams, and playing independent ball isn't nearly lucrative enough to pay the bills. Harsh reality will limit your pool even more, and those who choose to pursue it will face the additional stress of financial strain.<br /><br />That being said, was Ben and Sam's experiment a success? You'll have to read the book, but absolutely, some talent was found.http://angrystatistician.blogspot.com/2016/05/behind-speadsheet.htmlnoreply@blogger.com (Christopher Long)2tag:blogger.com,1999:blog-9149402429183581490.post-5461723911114364192Wed, 10 Feb 2016 04:16:00 +00002016-02-09T20:16:05.287-08:00When is a Lead Safe in the NBA?Assuming two NBA teams of equal strength with \(t\) seconds remaining, what is a safe lead at a prescribed confidence level? Bill James has a <a href="http://www.slate.com/articles/sports/sports_nut/2008/03/the_lead_is_safe.html">safe lead formula for NCAA basketball</a>, and the topic has been addressed by other researchers at various levels of complexity, e.g. <a href="http://arxiv.org/abs/1503.03509">Clauset, Kogan and Redner</a>.<br /><br />I'll present a simple derivation. Start by observing there are about 50 scoring groups per team per game (scoring groups include all baskets and free throws that occur at the same time), with each scoring group worth about two points. Assume scoring events by team are Poisson distributed with parameter \(\lambda = \frac{50\cdot t}{48\cdot 60} = \frac{t}{57.6}\). Using a normal approximation, the difference of these two distributions is normal with mean 0 and variance \(\sqrt{2}\lambda\), giving a standard deviation of \(0.1863\sqrt{t}\).<br /><br />Using this approximation, what is a 90% safe lead? A 90% tail is 1.28 standard deviations, \(1.28\cdot 0.1863\sqrt{t} = 0.2385\sqrt{t}\) scoring groups. As a scoring group is about two points, this means a 90% safe lead, assuming a jump ball, is about \(0.477\sqrt{t}\) points (Clauset et. al. obtained \(0.4602\sqrt{t}\)). For example, a safe lead at halftime is approximately \(0.477 \sqrt{24\cdot 60} = 18.1\) points.<br /><br />Next - adjustments for possession arrow and shot clock time; validity of approximation; adjusting for team strengths.http://angrystatistician.blogspot.com/2016/02/when-is-lead-safe-in-nba.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-2344669656897553681Thu, 15 Oct 2015 08:48:00 +00002015-10-15T03:40:15.810-07:00An Enormous Number of KilogramsFor years the kilogram has been defined with respect to a platinum and iridium cylinder, but this is now <a href="http://www.nature.com/news/kilogram-conflict-resolved-at-last-1.18550">no longer the case</a>. Here's a puzzle about kilograms that's easy to state and understand, but the answer is very, very surprising.<br /><br />I've always had a fascination with really large numbers. First 100 when I was really little, and as I got older and more sophisticated numbers like a <a href="https://en.wikipedia.org/wiki/Googol">googol</a> and the smallest number that satisfies the conditions of the <a href="https://en.wikipedia.org/wiki/Archimedes%27_cattle_problem">Archimedes cattle problem</a>.<br /><br />When I was an undergraduate I interviewed for a summer internship with an insurance company as an actuarial student. They gave me the following puzzle - what's the smallest number that when you move the last digit to the front it multiplies by 2? I calculated for a little while, then said "This can't be right, my answer has 18 digits!". It turns out that the smallest solution does, indeed, have 18 digits.<br /><br />We can see this by letting our \((n+1)\)-digit number \( x = 10 m + a\), where \(m\) is an \(n\)-digit number and \(0\leq a < 10\). Moving \(a\) to the front we get \(y = 10^n a + m\), and our requirement is \(y = 2x\). This gives: \begin{align} 20 m + 2 &= 10^n a + m \\ 19 m &= a(10^n-2) \\ m &= \frac{2a(5\cdot 10^{n-1} - 1)}{19} \end{align} The smallest \(m\), if one exists, requires \(a,n\) such that 19 divides \(5\cdot 10^{n-1}-1\) (as 19 can't divide \(2 a\)) and the result has \(n\)-digits. It's easy to check that the smallest value of \(n\) that satisfies the first condition is \(n=17\). To get the smallest solution we try \(a=1\), but this yields a value with only 16 digits. Setting \(a=2\), however, yields the 17-digit \(m = 10526315789473684\). The smallest solution to our puzzle is therefore the 18-digit number \(105263157894736842\); that's surprisingly large. <p\><br /><br />Numbers with this type of property are known as <a href="https://en.wikipedia.org/wiki/Parasitic_number">parasitic numbers</a>.<br /><br />Later, I wondered if there were numbers with the slightly different, but equally interesting property, that moving the last digit to the front converted ("autoconverts") it from a value under one unit of measurement to an equivalent value under a different unit of measurement.<br /><br />The first one I tried was moving the last digit to the front converts from Celsius to Fahrenheit. This is a fun puzzle that eventually made its way into the <a href="http://tierneylab.blogs.nytimes.com/2009/11/16/monday-puzzle-conversion-factors/">New York Times</a>. The smallest such value is 275 C, which exactly equals 527 F. What's the next smallest temperature?<br /><br />How about moving the first digit to the end? We'll need to use the little-known fact that, legally, a <a href="https://en.wikipedia.org/wiki/Pound_(mass)">pound is exactly equal to 0.45359237 kilograms</a>. Given this, does there exist a number such that moving the first digit to the end converts from pounds to kilograms? The answer is yes, but the smallest solution has 108,437,840 digits! The solution is similar to the above, but as it's computationally more involved I've written <a href="https://en.wikipedia.org/wiki/SageMath">Sage</a> code to solve it, which you can find in my <a href="https://github.com/octonion/puzzles/tree/master/parasitic">GitHub puzzles repository</a>.<br /><br />The smallest number that autoconverts from gallons to liters, incidentally, is even bigger at 382,614,539 digits!http://angrystatistician.blogspot.com/2015/10/an-enormous-number-of-kilograms.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-1759380959914215633Sun, 11 Oct 2015 00:25:00 +00002015-10-10T17:25:28.315-07:00Solving a Math Puzzle using PhysicsThe following math problem, which appeared on a Scottish maths paper, has been <a href="http://gizmodo.com/can-you-solve-the-math-problem-that-stumped-most-scotti-1735604246">making the internet rounds</a>.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-1Q63s68Tj0Q/VhmsVliOXmI/AAAAAAAADTE/tGn2BHhRuVQ/s1600/1466229790313874095.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-1Q63s68Tj0Q/VhmsVliOXmI/AAAAAAAADTE/tGn2BHhRuVQ/s400/1466229790313874095.png" /></a></div><br />The first two parts require students to interpret the meaning of the components of the formula \(T(x) = 5 \sqrt{36+x^2} + 4(20-x) \), and the final "challenge" component involves finding the minimum of \( T(x) \) over \( 0 \leq x \leq 20 \). Usually this would require a differentiation, but if you know <a href="https://en.wikipedia.org/wiki/Snell%27s_law">Snell's law</a> you can write down the solution almost immediately. People normally think of Snell's law in the context of light and optics, but it's really a statement about least time across media permitting different velocities.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-JDqJpTkDKdE/VhmmQcVJT8I/AAAAAAAADS0/po9hMnRXoJ8/s1600/Picture0.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-JDqJpTkDKdE/VhmmQcVJT8I/AAAAAAAADS0/po9hMnRXoJ8/s320/Picture0.jpg" /></a></div><br />One way to phrase Snell's law is that least travel time is achieved when \[ \frac{\sin{\theta_1}}{\sin{\theta_2}} = \frac{v_1}{v_2},\] where \( \theta_1, \theta_2\) are the angles to the normal and \(v_1, v_2\) are the travel velocities in the two media.<br /><br />In our puzzle the crocodile has an implied travel velocity of 1/5 in the water and 1/4 on land. Furthermore, the crocodile travels along the riverbank once it hits land, so \( \theta_2 = 90^{\circ}\) and \(\sin{\theta_2} = 1\). Snell's law now says that the path of least time satisfies \[ \sin{\theta_1} = \frac{x}{\sqrt{36+x^2}} = \frac{4}{5},\] giving us \( 25x^2 = 16x^2 + 24^2\). Solving, \( 3^2 x^2 = 24^2, x^2 = 8^2\) and the solution is \(x = 8\).http://angrystatistician.blogspot.com/2015/10/solving-math-puzzle-using-physics.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-8224978452592947571Mon, 05 Oct 2015 02:31:00 +00002015-10-05T20:23:03.074-07:00Mixed Models in R - Bigger, Faster, StrongerWhen you start doing more advanced sports analytics you'll eventually starting working with what are known as <a href="https://en.wikipedia.org/wiki/Mixed_model">hierarchical, nested or mixed effects models</a>. These are models that contain both <a href="https://en.wikipedia.org/wiki/Fixed_effects_model">fixed</a> and <a href="https://en.wikipedia.org/wiki/Random_effects_model">random effects</a>. There are <a href="http://andrewgelman.com/2005/01/25/why_i_dont_use/">multiple ways of defining fixed vs random random effects</a>, but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean.<br /><br />Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my <a href="https://github.com/octonion/hockey" target="_blank">hockey GitHub repository</a>):<br /><pre>model <- gs ~ year+field+d_div+o_div+game_length+(1|offense)+(1|defense)+(1|game_id)<br />fit <- glmer(model,<br /> data=g,<br /> verbose=TRUE,<br /> family=poisson(link=log)<br /> )<br /></pre>The fixed effects are <b>year</b>, <b>field</b> (home/away/neutral), <b>d_div</b> (NCAA division of the defense), <b>o_div</b> (NCAA division of the offense) and <b>game_length</b> (number of overtime periods); <b>offense</b> (strength of offense), <b>defense</b> (strength of defense) and <b>game_id</b> are all random effects. The reason for modeling team offenses and defenses as random vs fixed effects is that I view them as random samples from the same distribution. As mentioned above, this results in <a href="https://en.wikipedia.org/wiki/Shrinkage_(statistics)">statistical shrinkage</a> or "regression to the mean" for these values, which is particularly useful for partially completed seasons. <p/>One of the problems with large mixed models is that they can be very slow to fit. For example, the model above takes several hours on a 12-core workstation, which makes it very difficult to test model changes and tweaks. Is there any way to speed up the fitting process? Certainly! One way is to add two options to the above code: <pre>fit <- glmer(model,<br /> data=g,<br /> verbose=TRUE,<br /> family=poisson(link=log),<br /> nAGQ=0,<br /> control=glmerControl(optimizer = "nloptwrap")<br /> )<br /></pre>What do these do? Model fitting is an optimization process. Part of that process involves the estimation of particular integrals, which can be very slow; the option "nAGQ=0" tells glmer to ignore estimating those integrals. For some models this has minimal impact on parameter estimates, and this NCAA hockey model is one of those. The second option tells glmer to fit using the "nloptwrap" optimizer (there are several other optimizers available, too), which tends to be faster than the default optimization method.<br /><br />The impact can be rather startling. With the default options the above model takes about 3 hours to fit. Add these two options, and the model fitting takes 30 seconds with minimal impact on the parameter estimates, or approximately 400 times faster.http://angrystatistician.blogspot.com/2015/10/mixed-models-in-r-bigger-faster-stronger.htmlnoreply@blogger.com (Christopher Long)4tag:blogger.com,1999:blog-9149402429183581490.post-7371474520122350280Fri, 02 Oct 2015 06:14:00 +00002015-10-02T14:18:15.530-07:00Elo's Rating System as a Forgetful Logistic Model<a href="https://en.wikipedia.org/wiki/Elo_rating_system" target="_blank">Elo's rating system</a> became famous from its use in chess, but it and variations are now used in <a href="http://fivethirtyeight.com/datalab/introducing-nfl-elo-ratings/" target="_blank">sports like the NFL</a> to <a href="http://leagueoflegends.wikia.com/wiki/Elo_rating_system" target="_blank">eSports like League of Legends</a>. It also was infamously used on various "Hot or Not" type websites, as shown in this scene from the movie "Social Network":<br /><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="YOUTUBE-iframe-video" data-thumbnail-src="https://i.ytimg.com/vi/BzZRr4KV59I/0.jpg" frameborder="0" height="266" src="https://www.youtube.com/embed/BzZRr4KV59I?feature=player_embedded" width="320"></iframe></div><div><br /></div><div>Of course, there's a mistake in the formula in the movie!</div><div><br /></div><div>What is the Elo rating system? As originally proposed, it presumes that if two players A and B have ratings \(R_A\) and \(R_B\), then the expected score of player A is \[\frac{1}{1+10^{\frac{R_B-R_A}{400}}}.\] Furthermore, if A has a current rating of \(R_A\) and plays some more games, then the updated rating \({R_A}'\) is given by \({R_A}' = R_A + K(S_A-E_A)\), where \(K\) is an adjustment factor, \(S_A\) is the number of points scored by A and \(E_A\) was the expected number of points scored by A based on the rating \(R_A\).</div><div><br /></div><div>Now, the expected score formula given above has the same form as a <a href="https://en.wikipedia.org/wiki/Logistic_regression" target="_blank">logistic regression model</a>. What's the connection between the two? One answer is that Elo's rating system is a type of <a href="https://en.wikipedia.org/wiki/Online_algorithm" target="_blank">online</a> version of a logistic model. An online algorithm is an algorithm that only sees each piece of data once. As applied to a statistical model, it's a model with parameter estimates that are updated as new data comes in, but not refitting on the entire data set. It can also be considered a <a href="https://en.wikipedia.org/wiki/Memorylessness" target="_blank">memoryless</a> model; it has "forgotten" the old data and only knows the current parameter estimates. The appeal of such models is that they're extremely efficient, can operate on enormous data sets and parameter estimates can be updated in real-time.</div><div><br /></div><div>Okay, let's say we have a "forgetful" logistic model. Can we derive an updating rule, and does it look like Elo's? I'm going to give one possible derivation under the simplifying assumption that games are won or lost, with no ties.</div><div><br /></div><div>We don't know how many games A and B had previously played, so let's assume they both had previously played \(n\) games and have just played \(m\) additional games between them, with A scoring \(S_A\) points. That means they've both played \(n+m\) games, but we're just going to forget this again, so let's adjust everything so that they end up with \(n\) games. One way to do this is to <a href="https://en.wikipedia.org/wiki/Normalization_(statistics)" target="_blank">normalize</a> \(n\) and \(m\) so that they sum to \(n\), thus \(n\) becomes \(n\frac{n}{n+m}\) and \(m\) becomes \(m\frac{n}{n+m}\).</div><div><br /></div><div>We're now assuming they had each played \(n\frac{n}{n+m}\) games in the past, have just played \(m\frac{n}{n+m}\) additional games and A scored \(S_A \frac{n}{n+m}\) points (it has to be adjusted, too!) in those games.</div><div><br /></div><div>Again, we're memoryless, so we don't know how strong the opponents were that each had played in the past, so we're going to assume that they had both played opponents that were as strong as themselves and had won half and lost half of those games. After all, people generally prefer to play competitive games.</div><div><br /></div>Define \(d\) by \({R_A}' = R_A + d\) and let \(c = R_A - R_B\); we also require that \({R_B}' = R_B - d\). The <a href="https://en.wikipedia.org/wiki/Likelihood_function#Log-likelihood" target="_blank">log-likelihood</a> \(L\) of A having scored \(S_A \frac{n}{n+m}\) points is \[ \frac{-2 n^2}{n+m}\log(1+e^{-d}) -\frac{n^2}{n+m}d-\frac{m n}{n+m}\log(1+e^{-c-2d}) - \frac{(m-S_A)n(c+2d)}{n+m}.\] Factoring out the constant term \(n/(n+m)\) simplifies this to \[ L = -2 n\log(1+e^{-d}) - n d - m \log(1+e^{-c-2d}) - (m-S_A)(c+2d).\] Taking the partial derivative of \(L\) with respect to \(d\) we get<br />\begin{align}<br />\frac{\partial L}{\partial d} &= 2n \frac{e^{-d}}{1+e^{-d}} -n + 2m \frac{e^{-c-2d}}{1+e^{-c-2d}}-2(m-S_A) \\<br />&= -n\frac{1-e^{-d}}{1+e^{-d}} + 2 S_A - 2m\frac{1}{1+e^{-c-2d}} \\<br />&= -n\tanh(d/2) + 2 S_A - 2m\frac{1}{1+e^{-c-2d}}.<br />\end{align} What is \( m\frac{1}{1+e^{-c-2d}} \)? This is actually just \( {E_A}' \), the expected score for A when playing B for \(m\) games, but assuming the updated ratings for both players. Finally, setting \(\frac{\partial L}{\partial d} = 0\), we get \[ n\tanh(d/2) = 2(S_A - {E_A}')\] and hence \[ \tanh(d/2) = \frac{2}{n} (S_A - {E_A}').\] Assuming \(n\) is large relative to \(S_A - {E_A}'\), we have \( \tanh(d/2) \approx d/2\) and \( {E_A}' \approx E_A \). This is Elo's updating rule in the form \[ d = \frac{4}{n} (S_A - E_A ).\] If we rescale with the constant \( \sigma \), the updating rule becomes \[ d = \frac{4\sigma }{n} (S_A - E_A ).\] We also now see that the adjustment factor \( K = \frac{4\sigma }{n}\).http://angrystatistician.blogspot.com/2015/10/elos-rating-system-as-forgetful.htmlnoreply@blogger.com (Christopher Long)2tag:blogger.com,1999:blog-9149402429183581490.post-3689344049108158170Sun, 12 Jul 2015 02:54:00 +00002015-07-11T20:08:13.781-07:00Power Rankings: Looking at a Very Simple MethodOne of the simplest and most common power ranking models is known as the <a href="https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model" target="_blank">Bradley-Terry-Luce model</a>, which is <a href="http://angrystatistician.blogspot.com/2013/03/baseball-chess-psychology-and.html" target="_blank">equivalent to other famous models</a> such the <a href="https://en.wikipedia.org/wiki/Logistic_regression" target="_blank">logistic model</a> and the <a href="https://en.wikipedia.org/wiki/Elo_rating_system" target="_blank">Elo rating system</a>. I'll be referring to "teams" here, but of course the same ideas apply to any two-participant game.<br /><br />Let me clarify what I mean when I use the term "power ranking". A power ranking supplies not only a ranking of teams, but also provides numbers that may be used to estimate the probabilities of various outcomes were two particular teams to play a match.<br /><br />In the BTL power ranking system we assume the teams have some latent (hidden/unknown) "strength" \(R_i\), and that the probability of \(i\) beating \(j\) is \( \frac{R_i}{R_i+R_j} \). Note that each \(R_i\) is assumed to be strictly positive. Where does this model structure come from?<br /><br />Here are three reasonable constraints for a power ranking model:<br /><ol><li> If \(R_i\) and \(R_j\) have equal strength, the probability of one beating the other should be \( \frac{1}{2}\).</li><li>As the strength of one team strictly approaches 0 (infinitely weak) with the other team fixed, the probability of the other team winning strictly increases to 1.</li><li>As the strength of one team strictly approaches 1 (infinitely strong) with the other team fixed, the probability of the other team winning strictly decreases to 0.</li></ol><div>Note that our model structure satisfies all three constraints. Can you think of other simple model structures that satisfy all three constraints?</div><div><br /></div><div>Given this model and a set of teams and match results, how can we estimate the \(R_i\). The <a href="https://en.wikipedia.org/wiki/Maximum_likelihood" target="_blank">maximum-likelihood estimators</a> are the set of \( R_i \) that maximizes the probability of the observed outcomes actually happening. For any given match this probability of team \( i \) beating team \( j \) is \( \frac{R_i}{R_i+R_j} \), so the overall probability of the observed outcomes of the matches \( M \) occurring is \[ \mathcal{L} = \prod_{m\in M} \frac{R_{w(m)}}{R_{w(m)}+R_{l(m)}},\] where \( w(m) \) is then winner and \( l(m) \) the loser of match \( m \). We can transform this into a sum by taking logarithms; \[ \log\left( \mathcal{L} \right) = \log\left(R_{w(m)}\right) - \log\left(R_{w(m)}+R_{l(m)}\right).\] Before going further, let's make a useful reparameterization by setting \( e^{r_i} = R_i \); this makes sense as we're requiring the \( R_i \) to be strictly positive. We then get \[ \log\left( \mathcal{L} \right) = r_{w(m)} - \log\left(e^{r_{w(m)}}+e^{r_{l(m)}}\right).\] Taking partial derivatives we get \begin{eqnarray*}<br />\frac{\partial \log\left( \mathcal{L} \right)}{\partial r_i} &=& \sum_{w(m)=i} 1 - \frac{e^{r_{w(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_{l(m)}}}{e^{r_{w(m)}}+e^{r_{l(m)}}}\\<br />&=& \sum_{w(m)=i} 1 - \frac{e^{r_i}}{e^{r_i}+e^{r_{l(m)}}} + \sum_{l(m)=i} - \frac{e^{r_i}}{e^{r_{w(m)}}+e^{r_i}}\\<br />&=&0.<br />\end{eqnarray*} But this is just the number of actual wins minus the expected wins! Thus, the maximum likelihood estimators for the \( r_i \) satisfy \( O_i = E_i \) for all teams \( i \), where \( O_i \) is the actual (observed) number of wins for team \( i \), and \( E_i \) is the expected number of wins for team \( i \) based on our model. That's a nice property!<br /><br />If you'd like to experiment with some actual data, and to see that the resulting fit does indeed satisfy this property, here's an <a href="https://github.com/octonion/hockey/tree/master/lunchtime" target="_blank">example BTL model using NCAA men's ice hockey scores</a>. You can, of course, actually use this property to iteratively solve for the MLE estimators \( R_i \). Note that you'll have to fix one of the \( R_i \) to be a particular value (or add some other constraint), as the model probabilities are invariant with respect to multiplication of the \( R_i \) by the same positive scalar.</div>http://angrystatistician.blogspot.com/2015/07/power-rankings-looking-at-very-simple.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-8665826341097287811Sat, 11 Jul 2015 22:00:00 +00002015-07-11T15:00:38.036-07:00Getting Started Doing Baseball Analysis without CodingThere's lot of confusion about how best to get started doing baseball analysis. <a href="http://www.dataphoric.com/2015/06/27/learn_data_science_the_hard_way/" target="_blank">It doesn't have to be difficult!</a> You can start doing it right away, even if you don't know anything about R, Python, Ruby, SQL or machine learning (most GMs can't code). Learning these and other tools makes it easier and faster to do analysis, but they're only part of the process of constructing a well-reasoned argument. They're important, of course, because they can turn 2 months of hard work into 10 minutes of typing. Even if you don't like mathematics, statistics, coding or databases, they're mundane necessities that can make your life much easier and your analysis more powerful.<br /><br />Here are two example problems. You don't have to do these specifically, but they illustrate the general idea. Write up your solutions, then publish them for other people to make some (hopefully) helpful comments and suggestions. This can be on a blog or through a versioning control platform like <a href="https://github.com/" target="_blank">GitHub</a> (which is also great for versioning any code or data your use). Try to write well! A great argument, but poorly written and poorly presented isn't going to be very convincing. Once it's finished, review and revise, review and revise, review and revise. When a team you follow makes a move, treat it as a puzzle for you to solve. Why did they do it, and was it a good idea?<br /><ol><li>Pick a recent baseball trade. For example, the Padres traded catcher Yasmani Grandal for Dodgers outfielder Matt Kemp. It's never that simple of course; the Padres aren't paying all of Matt Kemp's salary. Find out what the salary obligations were for each club in this trade. Using your favorite public projection system, where were the projected surplus values for each player at the time of the trade? Of course, there were <a href="http://www.si.com/mlb/2014/12/18/matt-kemp-arthritic-hips-dodgers-padres-trade" target="_blank">other players</a> involved in that trade, too. What were the expected surplus values of those players? From the perspective of surplus values, who won or lost this trade? Finally, why do you think each team made this trade, especially considering that they were division rivals? Do you think one or both teams made any mistakes in reasoning; if so, what were they, and did the other team take advantage of those mistakes?</li><li>Pick any MLB team and review the <a href="http://mlb.mlb.com/mlb/events/draft/y2015/drafttracker.jsp" target="_blank">draft picks they made in the 2015 draft</a> for the first 10 rounds. Do you notice any trends or <a href="http://mlb.mlb.com/mlb/events/draft/y2014/drafttracker.jsp" target="_blank">changes from the 2014 draft</a>? Do these picks agree or disagree with the various public pre-draft player rankings? Which picks were designed to save money to help sign other picks? Identify those tough signs. Was the team actually able to sign them, and were the picks to save money still reasonably good picks? Do you best to identify which picks you thought were good and bad, write them down in a notebook with your reasoning, then check back in 6 months and a year. Was your reasoning correct? If not, what were your mistakes and how can you avoid making them in the future?</li></ol>http://angrystatistician.blogspot.com/2015/07/getting-started-doing-baseball-analysis.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-8962656961286759513Mon, 09 Mar 2015 00:27:00 +00002015-03-08T17:27:27.572-07:00Some Potentially Useful SQL Resources<br />Some potentially useful SQL resources - explanations, visualizations, exercises, games, classes.<br /><ol><li><a href="http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/" target="_blank">A Visual Explanation of SQL Joins</a></li><li><a href="http://datamonkey.pro/" target="_blank">Datamonkey</a></li><li><a href="https://my.vanderbilt.edu/cs265/" target="_blank">Introduction to Database Management Systems</a></li><li><a href="http://wwwlgis.informatik.uni-kl.de/cms/courses/informationssysteme/sqlisland/" target="_blank">SQL Island Adventure Game</a></li><li><a href="http://pgexercises.com/" target="_blank">PostgreSQL Exercises</a></li><li><a href="http://www.padjo.org/" target="_blank">Public Affairs Data Journalism</a></li><li><a href="https://github.com/rhc2104/sqlteaching" target="_blank">SQL Teaching's GitHub repo (if you're curious)</a></li><li><a href="https://class.stanford.edu/courses/DB/2014/SelfPaced/about" target="_blank">Stanford's Self-Paced Database MOOC</a></li><li><a href="http://hackr.io/tutorials/sql" target="_blank">Hackr.io's SQL Section (good to check occasionally)</a></li><li><a href="http://www.sql-ex.ru/" target="_blank">Practical skills of SQL language</a></li><li><a href="http://www.sqlteaching.com/" target="_blank">SQL Teaching (learn SQL in your browser)</a></li><li><a href="http://sqlzoo.net/wiki/Main_Page" target="_blank">SQLZOO - Interactive SQL Tutorial</a></li><li><a href="https://schemaverse.com/" target="_blank">The Schemaverse: a space-based strategy game implemented entirely within a PostgreSQL database</a></li><li><a href="http://blog.treasuredata.com/blog/2014/12/05/learn-sql-by-calculating-customer-lifetime-value-part-1/" target="_blank">Treasure Data: Learn SQL by Calculating Customer Lifetime Value</a></li></ol>http://angrystatistician.blogspot.com/2015/03/some-potentially-useful-sql-resources.htmlnoreply@blogger.com (Christopher Long)2tag:blogger.com,1999:blog-9149402429183581490.post-7862524207431434181Tue, 03 Mar 2015 08:56:00 +00002015-03-03T00:56:53.621-08:00Who Controls the Pace in Basketball, Offense or Defense?During a recent chat with basketball analyst <a href="https://twitter.com/sethpartnow" target="_blank">Seth Partnow</a>, he mentioned a topic that came up during a discussion at the recent <a href="http://www.sloansportsconference.com/" target="_blank">MIT Sloan Sports Analytics Conference</a>. Who controls the pace of a game more, the offense or defense? And what is the percentage of pace responsibility for each side? The analysts came up with a rough consensus opinion, but is there a way to answer this question analytically? I came up with one approach that examines the variations in possession times, but it suddenly occurred to me that this question could also be answered immediately by looking at the offense-defense asymmetry of the home court advantage.<div><br /></div><div>As you can see in the <a href="https://github.com/octonion/basketball-m/blob/master/ncaa/diagnostics/lmer.txt" target="_blank">R output of my NCAA team model code</a> in one of my <a href="https://github.com/octonion/basketball-m" target="_blank">public basketball repositories</a>, the offense at home scores points at a rate about \( e^{0.0302} = 1.031 \) times the rate on a neutral court, everything else the same. Likewise, the defense at home allows points at a rate about \( e^{-0.0165} = 0.984\) times the rate on a neutral court; in both cases the neutral court rate is the reference level. Notice the geometric asymmetry; \( 1.031\cdot 0.984 = 1.015 > 1\). The implication is that the offense is responsible for about the fraction \[ \frac{(1.031-1)}{(1.031-1)+(1-0.984)} = 0.66 \] of the scoring pace. That is, offensive controls 2/3 of the pace, defense 1/3 of the pace. The consensus opinion the analysts came up with at Sloan? It was 2/3 offense, 1/3 defense! It's nice when things work out, isn't it?</div><div><br /></div><div>I've used NCAA basketball because there are plenty of neutral court games; to examine the NBA directly we'll have to use a more sophisticated (but perhaps <a href="http://en.wikipedia.org/wiki/Mathematical_beauty" target="_blank">less beautiful</a>) approach involving the variation in possession times. I'll do that next, and I'll also show you how to apply this new information to make better <a href="https://github.com/octonion/basketball-m/blob/master/ncaa/sos/predict_daily.txt" target="_blank">game predictions</a>. Finally, there's a nice connection to some recent work on <a href="http://qz.com/316826/mathematicians-have-finally-figured-out-how-to-tell-correlation-from-causation/" target="_blank">inferring causality</a> that I'll outline.</div>http://angrystatistician.blogspot.com/2015/03/who-controls-pace-in-basketball-offense.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-8170448064117049143Wed, 11 Feb 2015 22:17:00 +00002015-02-11T14:17:53.532-08:00Baseball's Billion Dollar EquationIn 1999 <a href="http://en.wikipedia.org/wiki/Voros_McCracken" target="_blank">Voros McCracken</a> infamously speculated about the amount of control the pitcher had over balls put in play. Not so much, as it turned out, and <a href="http://en.wikipedia.org/wiki/Defense_independent_pitching_statistics" target="_blank">DIPS</a> was born. It's tough to put a value on something like DIPS, but if an MLB team had developed and exploited it for several years, it could potentially have been worth hundreds of millions of dollars. Likewise, <a href="http://www.hardballtimes.com/the-little-technician-rene-rivera-craftsman-at-work/" target="_blank">catcher framing</a> could easily have been worth hundreds of millions.<br /><br />How about a billion dollar equation? Sure, look at the baseball draft. An 8th round draft pick like <a href="http://www.fangraphs.com/statss.aspx?playerid=9218" target="_blank">Paul Goldschmidt</a> could net you a $200M surplus. And then there's <a href="http://www.fangraphs.com/statss.aspx?playerid=4720" target="_blank">Chase Headley</a>, <a href="http://www.fangraphs.com/statss.aspx?playerid=8090" target="_blank">Matt Carpenter</a>, <a href="http://www.fangraphs.com/statss.aspx?playerid=10264" target="_blank">Brandon Belt</a>, <a href="http://www.fangraphs.com/statss.aspx?playerid=9776" target="_blank">Jason Kipnis</a> and <a href="http://www.fangraphs.com/statss.aspx?playerid=9393" target="_blank">Matt Adams</a>. The commonality? All college position players easily identified as likely major leaguers purely through statistical analysis. You can also do statistical analysis for college pitchers, of course, but ideally you'd also want velocities. These are frequently available through public sources, but you may have to put them together manually. We'll also find that GB/FB ratios are important.<br /><br />There's plenty of public data available. I've made yearly NCAA college baseball data available in my <a href="https://github.com/octonion/baseball-public" target="_blank">public baseball GitHub</a> account; it covers 2002-2014, which is plenty of data for analysis. Older years are also available, but only in PDF format. So you'll either have to enter the data manually, use a service or do some high-quality automated OCR. My repository also includes NCAA play-by-play data from several sources, which among other things is useful for building catcher framing and defensive estimates.<br /><br />Also publicly available, and will be available in my GitHub over the next several days:<br /><ol><li><a href="http://en.wikipedia.org/wiki/National_Association_of_Intercollegiate_Athletics" target="_blank">NAIA</a> - roughly NCAA D2 level</li><li><a href="http://en.wikipedia.org/wiki/National_Junior_College_Athletic_Association" target="_blank">NJCAA</a> - junior college, same rules as NCAA</li><li><a href="http://en.wikipedia.org/wiki/California_Community_College_Athletic_Association" target="_blank">CCCAA</a> - junior college</li><li><a href="http://en.wikipedia.org/wiki/Northwest_Athletic_Conference_(junior_college)" target="_blank">NWAACC</a> - junior college</li></ol><div>Prospects come out of the NAIA and NCAA D2/D3 divisions every year, and with the free agent market valuing a single win at around $7M you want to make sure you don't overlook any player with talent. With JUCO players you'd like to identify that sleeper before he transfers to an NCAA D1 and has a huge year. Later you'll also want to analyze:</div><div><ol><li>Summer leagues</li><li>Independent leagues</li></ol><div>We'll start by looking at what data is available and how to combine the data sets. There are always player transfers to identify, and NCAA teams frequently play interdivision games as well as NAIA teams. We'll want to build a predictive model that identifies the most talented players uniformly across all leagues, so this will be a boring but necessary step.</div></div>http://angrystatistician.blogspot.com/2015/02/baseballs-billion-dollar-equation.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-4885721382300786171Wed, 11 Feb 2015 16:48:00 +00002015-02-11T08:48:34.913-08:00A Very Rough Guide to Getting Started in Data Science: Part II, The Big PictureData science to a beginner seems completely overwhelming. Not only are there huge numbers of programming languages, packages and algorithms, but even managing your data is an entire area itself. Some examples are the languages R, Python, Ruby, Perl, Julia, Mathematica, MATLAB/Octave; packages SAS, STATA, SPSS; algorithms linear regression, logistic regression, nested model, neural nets, support vector machines, linear discriminant analysis and deep learning.<br />For managing your data some people use Excel, or a relational database like MySQL or PostgreSQL. And where do things like big data, NoSQL and Hadoop fit in? And what's gradient descent and why is it important? But perhaps the most difficult part of all is that you actually need to know and understand statistics, too.<br /><br />It does seem overwhelming, but there's a simple key idea - <b>data science is using data to answer a question</b>. Even if you're only sketching a graph using a stick and a sandbox, you're still doing data science. Your goal for data science should be to continually learn better, more powerful and more efficient ways to answer your questions. My general framework has been strongly influenced by <a href="http://en.wikipedia.org/wiki/George_P%C3%B3lya" target="_blank">George Pólya</a>'s wonderful book <a href="http://en.wikipedia.org/wiki/How_to_Solve_It" target="_blank">"How to Solve It"</a>. While it's directed at solving mathematical problems, his approach is helpful for solving problems in general.<br /><br />"How to Solve It" suggests the following steps when solving a mathematical problem:<br /><ol><li>First, you have to understand the problem.</li><li>After understanding, then make a plan.</li><li>Carry out the plan.</li><li>Review/extend. Look back on your work. How could it be better?</li></ol><div>Pólya goes into much greater detail for each step and provides some illustrative examples. It's not the final word on how to approach and solve mathematical problems, but it's very helpful and I highly recommend it. For data science, the analogous steps from my perspective would be:</div><ol><li>What questions do you want to answer?</li><li>What data would be helpful to answer these questions? How and where do you get this data?</li><li>Given the question you want to answer and the data you have, which approaches and models are likely to be useful? This can be <b>very confusing</b>. There are always tradeoffs - underfitting vs overfitting, <a href="http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff" target="_blank">bias vs variance</a>, simplicity vs complexity, <a href="http://en.wikipedia.org/wiki/Prior_probability" target="_blank">information about where something came from vs what's it doing</a>. </li><li>Perform analysis/fit model.</li><li>How do you know if your model and analysis are good or bad, and how confident should you be in your predictions and conclusions? A very critical, but commonly treated lightly or even skipped entirely.</li><li>Given the results, what should you try next?</li></ol>Let's follow Pólya and do an illustrative example next.http://angrystatistician.blogspot.com/2015/02/a-very-rough-guide-to-getting-started.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-6914729618120541850Tue, 10 Feb 2015 23:18:00 +00002015-02-10T15:18:03.938-08:00More Measles: Vaccination Rates and School FundingI took a look at California's personal belief exemption rate (PBE) for kindergarten vaccinations in <a href="http://angrystatistician.blogspot.com/2015/02/mere-measles-look-at-vaccination-rates.html">Part I</a>. California also provides poverty information for public schools through the <a href="http://www.cde.ca.gov/ds/sd/sd/filessp.asp">Free or Reduced Price Meals data sets</a>, both of which conveniently include California's school codes. Cleaned versions of these data sets and my R code are in my <a href="https://github.com/octonion/vaccinations">vaccination GitHub</a>.<br /><br />We can use the school code as a key to join these two data sets. But remember, the FRPM data set only includes data about public schools, so we'll have to retain the private school data for PBEs by doing what's called a <a href="http://en.wikipedia.org/wiki/Join_%28SQL%29#Left_outer_join">left outer join</a>. This still performs a join on the school code key, but if any school codes included in the left data don't have corresponding entries in the right data set we still retain them. The missing values for the right data set in this case are set to <a href="http://en.wikipedia.org/wiki/Null_%28SQL%29">NULL</a>.<br /><br />We can perform a left outer join in R by using "merge" with the option "all.x=TRUE". I'll start by looking at how the PBE rate varies between charter, non-charter public and private schools, so we'll need to replace those missing values for funding source after our join. If the funding source is missing, it's a private school. The FRPM data also denotes non-charter public schools with funding type "", so I'll replace those with "aPublic" for convenience. For factors, R will by default set the reference level to be the level that comes first alphabetically.<br /><br /><script src="https://gist.github.com/octonion/546f5657e89cc72b367d.js"></script><br />Here's a subset of the output. The addition of the funding source is an improvement over the model that doesn't include it, and the estimates for the odds ratios for funding source is the highest for directly funded charter schools, followed by locally funded charter schools and private schools. Remember, public schools are the reference level, so for the public level \(\log(\text{odds ratio}) = 0\). Everything else being equal, our odds ratio estimates based on funding source would be: \begin{align*}<br />\mathrm{OR}_{\text{public}} &= e^{-3.820}\times e^{0} &= 0.022\\<br />\mathrm{OR}_{\text{private}} &= e^{-3.820}\times e^{0.752} &= 0.047\\<br />\mathrm{OR}_{\text{charter-local}} &= e^{-3.820}\times e^{1.049} &= 0.063\\<br />\mathrm{OR}_{\text{charter-direct}} &= e^{-3.820}\times e^{1.348} &= 0.085<br />\end{align*}<br />Converting to estimated PBE rates, we get: \begin{align*}<br />\mathrm{PBE}_{\text{public}} &= \frac{0.022}{1+0.022} &= 0.022\\<br />\mathrm{PBE}_{\text{private}} &= \frac{0.047}{1+0.047} &= 0.045\\<br />\mathrm{PBE}_{\text{charter-local}} &= \frac{0.063}{1+0.063} &= 0.059\\<br />\mathrm{PBE}_{\text{charter-direct}} &= \frac{0.085}{1+0.085} &= 0.078<br />\end{align*}<br /><script src="https://gist.github.com/octonion/52c4fb139ee83d40c519.js"></script>http://angrystatistician.blogspot.com/2015/02/more-measles-vaccination-rates-and.htmlnoreply@blogger.com (Christopher Long)0tag:blogger.com,1999:blog-9149402429183581490.post-555489984408974541Sun, 08 Feb 2015 03:43:00 +00002015-02-08T00:06:40.987-08:00Mere Measles: A Look at Vaccination Rates in California, Part ICalifornia is currently at the <a href="http://www.mercurynews.com/science/ci_27479456/california-measles-outbreak-state-cases-now-over-100">epicenter of a measles outbreak</a>, a disease that was considered all but eradicated in the US as of a few years ago. Measles is a nasty disease; it's easily transmitted and at its worst can cause measles encephalitis, leading to brain damage or even death.<br /><br />The increasing problem in the US with the measles, whooping cough and other nearly-eradicated diseases stems from a liberal <a href="http://www.shotsforschool.org/laws/exemptions/">personal belief exemption policy</a> in California and other states. This wasn't a major problem until <a href="http://en.wikipedia.org/wiki/Andrew_Wakefield">Andrew Wakefield</a> <a href="http://en.wikipedia.org/wiki/MMR_vaccine_controversy">famously and fraudulently tied autism to the MMR vaccine</a>. This has led to thousands of unnecessary deaths as well as needless misery for thousands more. I myself caught a case of the <a href="http://www.kpbs.org/news/2014/dec/22/record-number-whooping-cough-cases-san-diego-count/">whooping cough in San Diego</a> a few years ago as a consequence of Wakefield's fraud. I've had several MMR vaccines over my life, but adults may still only be partially immune; this is yet another reason why a healthy level of <a href="http://en.wikipedia.org/wiki/Herd_immunity">herd immunity</a> is so critical to maintain.<br /><br />PBE rates in California may be relatively low, but the problem is that parents who are likely to seek a PBE exemption for the MMR vaccine tend to cluster, making them susceptible to outbreaks of highly infectious diseases such as the measles. As we'll see they cluster in private schools, they cluster in particular cites and they cluster in particular counties.<br /><br />California makes vaccination and PBE (personal belief exemption) information available here (indexed by school code):<br /><br /><a href="http://www.cdph.ca.gov/programs/immunize/pages/immunizationlevels.aspx">Immunization Levels in Child Care and Schools</a><br /><br />California also makes student poverty information available here (also indexed by school code):<br /><br /><a href="http://www.cde.ca.gov/ds/sd/sd/filessp.asp">Student Poverty - FRPM Data</a><br /><br />This is typical government data - Excel spreadsheets, bits of non-data all over the place. I've cleaned up both data sets for 2013-2014 here:<br /><br /><a href="https://github.com/octonion/vaccinations/tree/master/data">California kindergarten and poverty data for 2013-2014</a><br /><br />Let's start with a basic nested model using only the vaccination data to examine the PBE rate for public vs private schools:<br /><br /><script src="https://gist.github.com/octonion/2d087e27f8199f0dd122.js"></script><br />We get these nice ANOVA results that indicates the public and private status of schools does indeed add to our PBE model.<br /><br /><script src="https://gist.github.com/octonion/30973d9c758d08de70d8.js"></script><br />What's the estimated impact of public vs private? This is a nice illustration of why the odds ratio is so easy to use in logistic models with logit links. Here our the fixed effect estimates:<br /><br /><script src="https://gist.github.com/octonion/7c061b8e0ed941bf618f.js"></script><br />To get PBE rate estimates for average public and private schools, take the exponential of the estimates to get the <a href="http://en.wikipedia.org/wiki/Odds_ratio">odds ratio</a> for the estimate. \begin{align*}<br />\mathrm{Intercept} &= e^{-3.113} = 0.044\\<br />\mathrm{public} &= e^{-0.606} = 0.546\\<br />\mathrm{private} &= e^{0} = 1.00<br />\end{align*}<br />The nice thing about odds ratios is that they simply multiply. What's our estimate for the odds ratio for the PBE rate of a public school? It's \(0.044\times 0.546 = 0.024\); for a private school it's \(0.044\times 1.00 = 0.044\). Translating to PBE rates we get \(\frac{0.024}{1+0.024} = 0.023\) for public schools and \(\frac{0.044}{1+0.044} = 0.042\) for private schools.<br /><br />Of course, the odds ratios for counties vary quite a bit, as do the odds ratios for cities within each county. You can see the \(\log({\mathrm{odds}})\) values for all of California's <a href="https://github.com/octonion/vaccinations/blob/master/counties.csv">counties here</a> and <a href="https://github.com/octonion/vaccinations/blob/master/cities.csv">cities here</a>.<br /><br />For another example, let's do a private school in Beverly Hills, which is in Los Angeles County. The odds ratio for Beverly Hills is \(3.875\); for Los Angeles County it's 0.625. Chaining as before, we get the overall estimate PBE odds ratio for a private school in Beverly Hills, Los Angeles County to be \(0.044\times 1.00\times 0.625\times 3.875 = 0.107\). Translated to an estimated PBE rate this is \(\frac{0.107}{1+0.107} = 0.096\). Thus, we'd expect about 10% of the children in a private Beverly Hills kindergarten to be unvaccinated due to personal belief exemptions. This is <a href="http://www.ovg.ox.ac.uk/herd-immunity">well below the desired herd immunity level for measles of 95%</a>.<br /><br />In the next part I'll show how to merge the information in California's vaccination data and poverty data to examine the role of other factors in the clustering of unvaccinated children. As an example, charter schools in California tend to have even higher rates of personal belief exemptions than private schools.http://angrystatistician.blogspot.com/2015/02/mere-measles-look-at-vaccination-rates.htmlnoreply@blogger.com (Christopher Long)2tag:blogger.com,1999:blog-9149402429183581490.post-8175766332155393834Sat, 07 Feb 2015 23:24:00 +00002015-02-07T15:24:11.769-08:00Touring Waldo; Overfitting Waldo; Scanning Waldo; Waldo, Waldo, WaldoRandal Olson has written a nice article on <a href="http://en.wikipedia.org/wiki/Where%27s_Wally%3F">finding Waldo</a> - <a href="http://www.randalolson.com/2015/02/03/heres-waldo-computing-the-optimal-search-strategy-for-finding-waldo/">Here’s Waldo: Computing the optimal search strategy for finding Waldo</a>. Randal presents a variety of machine learning methods to find very good search paths among the 68 known locations of Waldo. Of course, there's no need for an approximation; modern algorithms can optimize tiny problems like these exactly.<br /><br />One approach would be to treat this as a <a href="http://en.wikipedia.org/wiki/Travelling_salesman_problem">traveling salesman problem</a> with Euclidean distances as edge weights, but you'll need to add a dummy node that has edge weight 0 to every node. Once you have the optimal tour, delete the dummy node and you have your optimal <a href="http://en.wikipedia.org/wiki/Hamiltonian_path">Hamiltonian path</a>.<br /><br />I haven't coded in the dummy node yet, but here's the Waldo problem as a traveling salesman problem using <a href="http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/">TSPLIB format</a>.<br /><br /><script src="https://gist.github.com/octonion/7f0f70ab04805782bac1.js"></script><br />The <a href="http://www.math.uwaterloo.ca/tsp/concorde.html">Condorde software package</a> optimizes this in a fraction of a second:<br /><br /><script src="https://gist.github.com/octonion/95340a94ca840546ebdd.js"></script><br />I'll be updating this article to graphically show you the results for the optimal Hamiltonian path. There are also many additional questions I'll address. Do we really want to use this as our search path? We're obviously overfitting. Do we want to assume Waldo will <i>never</i> appear in a place he hasn't appeared before? When searching for Waldo we see an entire little area, not a point, so a realistic approach would be to develop a scanning algorithm that covers the entire image and accounts for our viewing point and posterior Waldo density. We can also jump where we're looking at from point to point quickly while not searching for Waldo, but scans are much slower.http://angrystatistician.blogspot.com/2015/02/touring-waldo-overfitting-waldo.htmlnoreply@blogger.com (Christopher Long)1tag:blogger.com,1999:blog-9149402429183581490.post-7021674754617646563Fri, 06 Feb 2015 21:05:00 +00002015-02-06T13:05:07.740-08:00Short Notes: Get CUDA and gputools Running on Ubuntu 14.10Here's a basic guide for getting CUDA 7.0 and the R package gputools running perfectly under Ubuntu 14.10. It's not difficult, but there are a few issues and this will be helpful to have in a single place.<br /><br />If you're running Ubuntu 14.10, I'd recommend installing CUDA 7.0. NVIDIA has a 7.0 Debian package specifically for 14.10; this wasn't the case for CUDA 6.5, which only had a Debian package for 14.04.<br /><br />To get access to CUDA 7.0, you'll first need to register as a CUDA developer.<br /><br /><a href="https://developer.nvidia.com/cuda-registered-developer-program">Join The CUDA Registered Developer Program</a><br /><br />Once you have access, navigate to the CUDA 7.0 download page and get the Debian package.<br /><br /><a href="https://developer.nvidia.com/rdp/cuda-70-rc-downloads">CUDA 7.0 Release Candidate Downloads</a><br /><br />You'll either need to be running the NVIDIA 340 or 346 drivers. If you're having trouble upgrading, I'd suggest adding the <a href="http://www.ubuntuupdates.org/ppa/xorg-edgers">xorg-edgers PPA</a>.<br /><br /><script src="https://gist.github.com/octonion/5239bd8c595033c567e1.js"></script>Once your NVIDIA driver is set, install the CUDA 7.0 Debian package you've downloaded. Don't forget to remove any previously installed CUDA packages or repositories.<br /><br /><script src="https://gist.github.com/octonion/018ac43ddb973d9846aa.js"></script>You'll need to add paths so everything knows where CUDA is installed. Append the following to the .bashrc in your home directory:<br /><br /><script src="https://gist.github.com/octonion/554912e766654444e63b.js"></script>Execute "source ~/.bashrc" for these changes to be applied. If you want to test your new CUDA install, make the samples provided by NVIDIA.<br /><br /><script src="https://gist.github.com/octonion/d9088cc3165df2201883.js"></script>I get the following output when running BlackScholes:<br /><br /><script src="https://gist.github.com/octonion/64c309b25f62c029dff7.js"></script>The next task is to install gputools for R. You can't unfortunately install the current package through R, as the source code contains references to CUDA architectures that are obsolete under CUDA 7.0. But that's easy to fix.<br /><br /><script src="https://gist.github.com/octonion/fb5dbac550983c018cb8.js"></script>Now do some editing in gputools/src/Makefile:<br /><br /><script src="https://gist.github.com/octonion/ed71024cfadb62b53b44.js"></script>Now build and install the patched gputools package while you're in the directory immediately above gputools:<br /><br /><script src="https://gist.github.com/octonion/f0028a3a6fb88dc2c9e3.js"></script>If you want to make the gputools packages available for all R users<br /><br /><script src="https://gist.github.com/octonion/e3953a0bbf1156e65cbf.js"></script>Keep in mind that they'll have to make the same environmental variable changes as above. Let's test it!<br /><br /><script src="https://gist.github.com/octonion/1aa3b572ebd7a6db68b8.js"></script>Running gives us:<br /><br /><script src="https://gist.github.com/octonion/902639d2db61d0f86dc1.js"></script>A nice 26-fold speedup. We're all set!http://angrystatistician.blogspot.com/2015/02/short-notes-get-cuda-and-gputools.htmlnoreply@blogger.com (Christopher Long)3tag:blogger.com,1999:blog-9149402429183581490.post-574142958622078177Thu, 22 Jan 2015 01:15:00 +00002015-01-21T17:24:21.668-08:00A Very Rough Guide to Getting Started in Data Science: Part I, MOOCs<h3>Introduction</h3><br />Data science is a very hot, perhaps the hottest, field right now. Sports analytics has been my primary area of interest, and it's a field that has seen amazing growth in the last decade. It's no surprise that the most common question I'm asked is about becoming a data scientist. This will be a first set of rough notes attempting to answer this question from my own personal perspective. Keep in mind that this is only my opinion and there are many different ways to do data science and become a data scientist.<br /><div><br /></div><div>Data science is using data to answer a question. This could be doing something as simple as making a plot by hand, or using Excel to take the average of a set of numbers. The important parts of this process are knowing which questions to ask, deciding what information you'd need to answer it, picking a method that takes this data and produces results relevant to your question and, most importantly, how to properly interpret these results so you can be confident that they actually answer your question.</div><div><br /></div><div>Knowing the questions requires some domain expertise, either yours or someone else's. Unless you're a data science researcher, data science is a tool you apply to another domain.</div><div><br /></div><div>If you have the data you feel should answer your question, you're in luck. Frequently you'll have to go out and collect the data yourself, e.g. scraping from the web. Even if you already have the data, it's common to have to process the data to remove bad data, correct errors and put it into a form better suited for analysis. A popular tool for this phase of data analysis is a scripting language; typically something like Python, Perl or Ruby. These are high-level programming languages that very good at web work as well as manipulating data.</div><div><br /></div><div>If you're dealing with a large amount of data, you'll find that it's convenient to store it in a structured way that makes it easier to access, manipulate and update in the future. This will typically be a relational database of some type, such as PostgreSQL, MySQL or SQL Server. These all use the programming language SQL.</div><div><br /></div><div>Methodology and interpretation are the most difficult, broadest and most important parts of data science. You'll see methodology referenced as statistical learning, machine learning, artificial intelligence and data mining; these can be covered in statistics, computer science, engineering or other classes. Interpretation is traditionally the domain of statistics, but this is always taught together with methodology.</div><div><br /></div><div>You can start learning much of this material freely and easily with MOOCs. Here's an initial list.<br /><br /></div><div><h3>MOOCs</h3></div><h4>Data Science Basics</h4><div><a href="https://www.coursera.org/course/datascitoolbox" target="_blank">Johns Hopkins: The Data Scientist’s Toolbox</a>. Overview of version control, markdown, git, GitHub, R, and RStudio. Started January 5, 2015. Coursera.</div><div><br /></div><div><a href="https://www.coursera.org/course/rprog" target="_blank">Johns Hopkins: R Programming</a>. R-based. Started January 5, 2015. Coursera.</div><div><br /></div><h4>Scripting Languages</h4><div><a href="https://www.udacity.com/course/cs101" target="_blank">Intro to Computer Science</a>. Python-based. Take anytime. Udacity; videos and exercises are free.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/Pm_WAWZNbdA" allowfullscreen frameborder="0"></iframe></div><div><br /></div><div><a href="https://www.udacity.com/course/ud036" target="_blank">Programming Foundations with Python</a>. Python-based. Take anytime. Udacity; videos and exercises are free.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/bwq2W6XUvyg" allowfullscreen frameborder="0"></iframe></div><div><br /></div><div><br /></div><div><a href="https://www.edx.org/course/introduction-computer-science-mitx-6-00-1x-0" target="_blank">MIT: Introduction to Computer Science and Programming Using Python</a>. Python-based. Class started January 9, 2015. edX.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/ww2BdhILIio" allowfullscreen frameborder="0"></iframe></div><div><br /></div><div><br /></div><h4>Databases and SQL</h4><div><a href="https://www.coursera.org/course/db" target="_blank">Stanford: Introduction to Databases</a>. XML, JSON, SQL; uses SQLite for SQL. Self-paced. Coursera.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/ShjrtAQmIVg" allowfullscreen frameborder="0"></iframe></div><div><br /></div><h4>Machine Learning</h4><div><a href="https://www.coursera.org/course/ml" target="_blank">Stanford: Machine Learning</a>. Octave-based. Class started January 19, 2015. Coursera.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/qeHZOdmJvFU" allowfullscreen frameborder="0"></iframe></div><div><br /></div><div><a href="https://class.stanford.edu/courses/HumanitiesandScience/StatLearning/Winter2015/about" target="_blank">Stanford: Statistical Learning</a>. R-based. Class started January 19, 2015. Stanford OpenEdX.</div><div><br /></div><div class="separator" style="clear: both; text-align: center;"><iframe class="youtube-player" type="text/html" width="640" height="385" src="http://www.youtube.com/embed/St2-97n7atk" allowfullscreen frameborder="0"></iframe></div><div><br /></div><div><br /></div>http://angrystatistician.blogspot.com/2015/01/a-very-rough-guide-to-getting-started.htmlnoreply@blogger.com (Christopher Long)1