Skip to main content


Showing posts from February, 2015

Why does Kaggle use Log-loss?

Baseball's Billion Dollar Equation

In 1999 Voros McCracken infamously speculated about the amount of control the pitcher had over balls put in play. Not so much, as it turned out, and DIPS was born. It's tough to put a value on something like DIPS, but if an MLB team had developed and exploited it for several years, it could potentially have been worth hundreds of millions of dollars. Likewise, catcher framing could easily have been worth hundreds of millions.

How about a billion dollar equation? Sure, look at the baseball draft. An 8th round draft pick like Paul Goldschmidt could net you a $200M surplus. And then there's Chase Headley, Matt Carpenter, Brandon Belt, Jason Kipnis and Matt Adams. The commonality? All college position players easily identified as likely major leaguers purely through statistical analysis. You can also do statistical analysis for college pitchers, of course, but ideally you'd also want velocities. These are frequently available through public sources, but you may have to put the…

A Very Rough Guide to Getting Started in Data Science: Part II, The Big Picture

Data science to a beginner seems completely overwhelming. Not only are there huge numbers of programming languages, packages and algorithms, but even managing your data is an entire area itself. Some examples are the languages R, Python, Ruby, Perl, Julia, Mathematica, MATLAB/Octave; packages SAS, STATA, SPSS; algorithms linear regression, logistic regression, nested model, neural nets, support vector machines, linear discriminant analysis and deep learning.
For managing your data some people use Excel, or a relational database like MySQL or PostgreSQL. And where do things like big data, NoSQL and Hadoop fit in? And what's gradient descent and why is it important? But perhaps the most difficult part of all is that you actually need to know and understand statistics, too.

It does seem overwhelming, but there's a simple key idea - data science is using data to answer a question. Even if you're only sketching a graph using a stick and a sandbox, you're still doing data sc…

More Measles: Vaccination Rates and School Funding

I took a look at California's personal belief exemption rate (PBE) for kindergarten vaccinations in Part I. California also provides poverty information for public schools through the Free or Reduced Price Meals data sets, both of which conveniently include California's school codes. Cleaned versions of these data sets and my R code are in my vaccination GitHub.

We can use the school code as a key to join these two data sets. But remember, the FRPM data set only includes data about public schools, so we'll have to retain the private school data for PBEs by doing what's called a left outer join. This still performs a join on the school code key, but if any school codes included in the left data don't have corresponding entries in the right data set we still retain them. The missing values for the right data set in this case are set to NULL.

We can perform a left outer join in R by using "merge" with the option "all.x=TRUE". I'll start by look…

Mere Measles: A Look at Vaccination Rates in California, Part I

California is currently at the epicenter of a measles outbreak, a disease that was considered all but eradicated in the US as of a few years ago. Measles is a nasty disease; it's easily transmitted and at its worst can cause measles encephalitis, leading to brain damage or even death.

The increasing problem in the US with the measles, whooping cough and other nearly-eradicated diseases stems from a liberal personal belief exemption policy in California and other states. This wasn't a major problem until Andrew Wakefieldfamously and fraudulently tied autism to the MMR vaccine. This has led to thousands of unnecessary deaths as well as needless misery for thousands more. I myself caught a case of the whooping cough in San Diego a few years ago as a consequence of Wakefield's fraud. I've had several MMR vaccines over my life, but adults may still only be partially immune; this is yet another reason why a healthy level of herd immunity is so critical to maintain.

PBE rates…

Touring Waldo; Overfitting Waldo; Scanning Waldo; Waldo, Waldo, Waldo

Randal Olson has written a nice article on finding Waldo - Here’s Waldo: Computing the optimal search strategy for finding Waldo. Randal presents a variety of machine learning methods to find very good search paths among the 68 known locations of Waldo. Of course, there's no need for an approximation; modern algorithms can optimize tiny problems like these exactly.

One approach would be to treat this as a traveling salesman problem with Euclidean distances as edge weights, but you'll need to add a dummy node that has edge weight 0 to every node. Once you have the optimal tour, delete the dummy node and you have your optimal Hamiltonian path.

I haven't coded in the dummy node yet, but here's the Waldo problem as a traveling salesman problem using TSPLIB format.

The Condorde software package optimizes this in a fraction of a second:

I'll be updating this article to graphically show you the results for the optimal Hamiltonian path. There are also many additional quest…

Short Notes: Get CUDA and gputools Running on Ubuntu 14.10

Here's a basic guide for getting CUDA 7.0 and the R package gputools running perfectly under Ubuntu 14.10. It's not difficult, but there are a few issues and this will be helpful to have in a single place.

If you're running Ubuntu 14.10, I'd recommend installing CUDA 7.0. NVIDIA has a 7.0 Debian package specifically for 14.10; this wasn't the case for CUDA 6.5, which only had a Debian package for 14.04.

To get access to CUDA 7.0, you'll first need to register as a CUDA developer.

Join The CUDA Registered Developer Program

Once you have access, navigate to the CUDA 7.0 download page and get the Debian package.

CUDA 7.0 Release Candidate Downloads

You'll either need to be running the NVIDIA 340 or 346 drivers. If you're having trouble upgrading, I'd suggest adding the xorg-edgers PPA.

Once your NVIDIA driver is set, install the CUDA 7.0 Debian package you've downloaded. Don't forget to remove any previously installed CUDA packages or repositories…