Processing math: 100%
Skip to main content

Probability and Cumulative Dice Sums

Touring Waldo; Overfitting Waldo; Scanning Waldo; Waldo, Waldo, Waldo

Randal Olson has written a nice article on finding Waldo - Here’s Waldo: Computing the optimal search strategy for finding Waldo. Randal presents a variety of machine learning methods to find very good search paths among the 68 known locations of Waldo. Of course, there's no need for an approximation; modern algorithms can optimize tiny problems like these exactly.

One approach would be to treat this as a traveling salesman problem with Euclidean distances as edge weights, but you'll need to add a dummy node that has edge weight 0 to every node. Once you have the optimal tour, delete the dummy node and you have your optimal Hamiltonian path.

I haven't coded in the dummy node yet, but here's the Waldo problem as a traveling salesman problem using TSPLIB format.

NAME : waldo68
COMMENT : 68 locations of Waldo
TYPE : TSP
DIMENSION : 68
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
0 0.625000000 7.708333333
1 4.944444444 6.569444444
2 5.430555556 6.402777778
3 5.902777778 6.083333333
4 5.430555556 5.444444444
5 4.791666667 5.444444444
6 5.305555556 5.125000000
7 3.666666667 5.291666667
8 6.069444444 4.638888889
9 3.513888889 4.319444444
10 2.388888889 4.486111111
11 1.763888889 4.333333333
12 2.083333333 3.847222222
13 1.125000000 2.569444444
14 0.805555556 2.250000000
15 1.444444444 2.097222222
16 0.638888889 1.777777778
17 0.958333333 1.277777778
18 0.958333333 0.972222222
19 2.083333333 0.805555556
20 2.388888889 2.729166667
21 2.166666667 2.250000000
22 2.486111111 2.250000000
23 3.347222222 2.569444444
24 3.513888889 2.083333333
25 3.041666667 1.916666667
26 4.305555556 2.888888889
27 5.263888889 2.729166667
28 5.750000000 2.250000000
29 5.597222222 1.930555556
30 4.625000000 1.277777778
31 5.263888889 0.819444444
32 5.416666667 0.333333333
33 7.666666667 7.361111111
34 6.861111111 6.736111111
35 8.458333333 6.722222222
36 10.37500000 7.194444444
37 12.44444444 7.694444444
38 10.37500000 6.430555556
39 11.16666667 6.402777778
40 11.33333333 6.083333333
41 9.888888889 5.777777778
42 9.250000000 5.291666667
43 9.958333333 4.791666667
44 9.097222222 4.652777778
45 7.944444444 5.305555556
46 7.763888889 5.000000000
47 8.000000000 4.791666667
48 7.333333333 5.138888889
49 6.694444444 4.972222222
50 7.166666667 4.500000000
51 7.486111111 4.006944444
52 11.01388889 4.819444444
53 11.80555556 4.652777778
54 12.27777778 4.805555556
55 9.416666667 3.694444444
56 9.250000000 3.222222222
57 11.48611111 3.541666667
58 11.95833333 3.541666667
59 7.986111111 2.430555556
60 6.694444444 1.625000000
61 12.12500000 2.909722222
62 11.51388889 2.756944444
63 11.30555556 2.555555556
64 10.72222222 1.930555556
65 11.16666667 1.791666667
66 12.44444444 1.625000000
67 12.12500000 1.291666667
EOF
view raw waldo69.tsp hosted with ❤ by GitHub

The Condorde software package optimizes this in a fraction of a second:

68
0 7 5 1 2 6 4 8 48 46
45 42 43 55 44 47 50 51 49 3
34 33 35 36 41 38 37 39 40 52
54 53 58 57 61 66 67 65 64 63
62 56 59 60 28 29 31 32 30 24
25 19 17 18 15 16 14 13 20 22
21 23 26 27 9 10 11 12
view raw waldo69.sol hosted with ❤ by GitHub

I'll be updating this article to graphically show you the results for the optimal Hamiltonian path. There are also many additional questions I'll address. Do we really want to use this as our search path? We're obviously overfitting. Do we want to assume Waldo will never appear in a place he hasn't appeared before? When searching for Waldo we see an entire little area, not a point, so a realistic approach would be to develop a scanning algorithm that covers the entire image and accounts for our viewing point and posterior Waldo density. We can also jump where we're looking at from point to point quickly while not searching for Waldo, but scans are much slower.

Comments

  1. Great/important stuff. Would love to see an efficient way to pull/clean CA's xls data!

    ReplyDelete

Post a Comment

Popular posts from this blog

Mining the First 3.5 Million California Unclaimed Property Records

As I mentioned in my previous article  the state of California has over $6 billion in assets listed in its unclaimed property database .  The search interface that California provides is really too simplistic for this type of search, as misspelled names and addresses are both common and no doubt responsible for some of these assets going unclaimed. There is an alternative, however - scrape the entire database and mine it at your leisure using any tools you want. Here's a basic little scraper written in Ruby . It's a slow process, but I've managed to pull about 10% of the full database in the past 24 hours ( 3.5 million out of about 36 million). What does the distribution of these unclaimed assets look like?  Among those with non-zero cash reported amounts: Total value - $511 million Median value - $15 Mean value - $157 90th percentile - $182 95th percentile - $398 98th percentile - $1,000 99th percentile - $1,937 99.9th percentile - $14,203 99.99th perc...

Mixed Models in R - Bigger, Faster, Stronger

When you start doing more advanced sports analytics you'll eventually starting working with what are known as hierarchical, nested or mixed effects models . These are models that contain both fixed and random effects . There are multiple ways of defining fixed vs random random effects , but one way I find particularly useful is that random effects are being "predicted" rather than "estimated", and this in turn involves some "shrinkage" towards the mean. Here's some R code for NCAA ice hockey power rankings using a nested Poisson model (which can be found in my hockey GitHub repository ): model The fixed effects are year , field (home/away/neutral), d_div (NCAA division of the defense), o_div (NCAA division of the offense) and game_length (number of overtime periods); offense (strength of offense), defense (strength of defense) and game_id are all random effects. The reason for modeling team offenses and defenses as random vs fixed effec...

A Bayes' Solution to Monty Hall

For any problem involving conditional probabilities one of your greatest allies is Bayes' Theorem . Bayes' Theorem says that for two events A and B, the probability of A given B is related to the probability of B given A in a specific way. Standard notation: probability of A given B is written Pr(AB) probability of B is written Pr(B) Bayes' Theorem: Using the notation above, Bayes' Theorem can be written:  Pr(AB)=Pr(BA)×Pr(A)Pr(B) Let's apply Bayes' Theorem to the Monty Hall problem . If you recall, we're told that behind three doors there are two goats and one car, all randomly placed. We initially choose a door, and then Monty, who knows what's behind the doors, always shows us a goat behind one of the remaining doors. He can always do this as there are two goats; if we chose the car initially, Monty picks one of the two doors with a goat behind it at random. Assume we pick Door 1 an...