Skip to main content

Probability and Cumulative Dice Sums

Touring Waldo; Overfitting Waldo; Scanning Waldo; Waldo, Waldo, Waldo

Randal Olson has written a nice article on finding Waldo - Here’s Waldo: Computing the optimal search strategy for finding Waldo. Randal presents a variety of machine learning methods to find very good search paths among the 68 known locations of Waldo. Of course, there's no need for an approximation; modern algorithms can optimize tiny problems like these exactly.

One approach would be to treat this as a traveling salesman problem with Euclidean distances as edge weights, but you'll need to add a dummy node that has edge weight 0 to every node. Once you have the optimal tour, delete the dummy node and you have your optimal Hamiltonian path.

I haven't coded in the dummy node yet, but here's the Waldo problem as a traveling salesman problem using TSPLIB format.


The Condorde software package optimizes this in a fraction of a second:


I'll be updating this article to graphically show you the results for the optimal Hamiltonian path. There are also many additional questions I'll address. Do we really want to use this as our search path? We're obviously overfitting. Do we want to assume Waldo will never appear in a place he hasn't appeared before? When searching for Waldo we see an entire little area, not a point, so a realistic approach would be to develop a scanning algorithm that covers the entire image and accounts for our viewing point and posterior Waldo density. We can also jump where we're looking at from point to point quickly while not searching for Waldo, but scans are much slower.

Comments

  1. Great/important stuff. Would love to see an efficient way to pull/clean CA's xls data!

    ReplyDelete

Post a Comment

Popular posts from this blog

Mining the First 3.5 Million California Unclaimed Property Records

As I mentioned in my previous article  the state of California has over $6 billion in assets listed in its unclaimed property database .  The search interface that California provides is really too simplistic for this type of search, as misspelled names and addresses are both common and no doubt responsible for some of these assets going unclaimed. There is an alternative, however - scrape the entire database and mine it at your leisure using any tools you want. Here's a basic little scraper written in Ruby . It's a slow process, but I've managed to pull about 10% of the full database in the past 24 hours ( 3.5 million out of about 36 million). What does the distribution of these unclaimed assets look like?  Among those with non-zero cash reported amounts: Total value - $511 million Median value - $15 Mean value - $157 90th percentile - $182 95th percentile - $398 98th percentile - $1,000 99th percentile - $1,937 99.9th percentile - $14,203 99.99th perc...

Simplified Multinomial Kelly

Here's a simplified version for optimal Kelly bets when you have multiple outcomes (e.g. horse races). The Smoczynski & Tomkins algorithm, which is explained here (or in the original paper): https://en.wikipedia.org/wiki/Kelly_criterion#Multiple_horses Let's say there's a wager that, for every $1 you bet, will return a profit of $b if you win. Let the probability of winning be \(p\), and losing be \(q=1-p\). The original Kelly criterion says to wager only if \(b\cdot p-q > 0\) (the expected value is positive), and in this case to wager a fraction \( \frac{b\cdot p-q}{b} \) of your bankroll. But in a horse race, how do you decide which set of outcomes are favorable to bet on? It's tricky, because these wagers are mutually exclusive i.e. you can win at most one. It turns out there's a simple and intuitive method to find which bets are favorable: 1) Look at \( b\cdot p-q\) for every horse. 2) Pick any horse for which \( b\cdot p-q > 0\) and mar...

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models . These are models that contain both fixed and random effects . There are multiple ways of defining fixed vs random random effects , but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean. Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository ): model The fixed effects are year , field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime periods); offense (strength of offense), defense (strength of defense) and game_id are all random effects. The reason for modeling team offenses and defenses as random vs fixed effec...