## Tuesday, July 8, 2014

### Finding the Best Book Quotes: Power Ranking Goodreads

My friend Jordan Ellenberg wrote an article for the Wall Street Journal recently describing a metric to roughly measure the least read books, which he calls the Hawking Index (HI). As he describes it, take the page numbers of a book's five top highlights, average them, and divide by the number of pages in the whole book.

On a discussion thread on Facebook, this led to a proposal from me for measuring the general quality of a quote. Assume a user has some level of discrimination $$D$$; the higher the value of $$D$$, the more likely they are to quote a passage. Now assume each passage has some measure of quality $$Q$$; the higher the value of $$Q$$, the more likely a passage is to be quoted. Let's try a classic Bradley-Terry-Luce model - if a user with discrimination $$D$$ quotes at least one passage from a particular work, the probability $$p$$ that they'll quote a given passage with quality $$Q$$ from that same work is roughly $p = \frac{Q}{Q+D}.$
Nice, but where can we find data to actually test this theory? It turns out an excellent source is Goodreads. Users are numbered sequentially starting with Goodreads founder Otis Chandler's ID of 1; a user's associated quote page has a simple URL structure. Furthermore, every user's quote list has an associated RSS/Atom URL that provides an easy to parse XML file.

The steps:
1. Write a simple Go scraper to uses a goroutine to quickly and sequentially get user quote pages, grab the count of quotes shown on that page (truncated at 30), then grab the RSS/Atom URL. This is able to process about 2500/users minutes over a slow internet connection.
2. Write a simple Ruby program to fetch quote XML files for every user that has at least one quote.
3. Write a simple Ruby program to fetch the author URL and work URL for every quote cited by at least one user.
4. Load all this data into a PostgreSQL database.
5. Write a simple R program to pull the data from the database and fit our model. We assign a value for the outcome "quoted" of 1 if the user quoted a passage; we assign "quoted" the value 0 if the user quote a passage from that work, but not that particular passage. Both user "discrimination" and quote "quality" are treated as factors and modeled as random effects; the R package "lmer" was used to fit the model.
If there is interest, I can make all of this code publicly available through my GitHub account. None of it is proprietary.

As a test I grabbed the data from 333760 users, 10724 had at least one quote, there were 83160 total quotes and 32621 unique quotes. The model took about 3.5 minutes to fit using an optimized BLAS library and 4 hyperthreaded CPU cores.

The top 10 quotes on Goodreads, ranked by quality:
1. “No one can make you feel inferior without your consent.”
Eleanor Roosevelt, This is My Story
2. “Outside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.”
Groucho Marx, The Essential Groucho: Writings For By And About Groucho Marx
3. “Twenty years from now you will be more disappointed by the things that you didn't do than by the ones you did do. So throw off the bowlines. Sail away from the safe harbor. Catch the trade winds in your sails. Explore. Dream. Discover.”
H. Jackson Brown Jr., P.S. I Love You
4. “I am so clever that sometimes I don't understand a single word of what I am saying.”
Oscar Wilde, The Happy Prince and Other Stories
5. “Insanity is doing the same thing, over and over again, but expecting different results.”
Narcotics Anonymous, Narcotics Anonymous
6. “He has achieved success who has lived well, laughed often, and loved much;Who has enjoyed the trust of pure women, the respect of intelligent men and the love of little children;Who has filled his niche and accomplished his task;Who has never lacked appreciation of Earth's beauty or failed to express it;Who has left the world better than he found it,Whether an improved poppy, a perfect poem, or a rescued soul;Who has always looked for the best in others and given them the best he had;Whose life was an inspiration;Whose memory a benediction.”
Bessie Anderson Stanley, More Heart Throbs Volume Two in Prose and Verse Dear to the American People And by them contributed as a Supplement to the original \$10,000 Prize Book HEART THROBS
7. “The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”
Terry Pratchett, Diggers
8. “Love all, trust a few, do wrong to none.”
William Shakespeare, All's Well That Ends Well
People are illogical, unreasonable, and self-centered.
Love them anyway.
If you do good, people will accuse you of selfish ulterior motives.
Do good anyway.
If you are successful, you will win false friends and true enemies.
Succeed anyway.
The good you do today will be forgotten tomorrow.
Do good anyway.
Honesty and frankness make you vulnerable.
Be honest and frank anyway.
The biggest men and women with the biggest ideas can be shot down by the smallest men and women with the smallest minds.
Think big anyway.
People favor underdogs but follow only top dogs.
Fight for a few underdogs anyway.
What you spend years building may be destroyed overnight.
Build anyway.
People really need help but may attack you if you do help them.
Help people anyway.
Give the world the best you have and you'll get kicked in the teeth.
Give the world the best you have anyway.”
Kent M. Keith, The Silent Revolution: Dynamic Leadership in the Student Council
10. “She is too fond of books, and it has turned her brain.”
Louisa May Alcott, Work: A Story of Experience

## Monday, April 7, 2014

### Five Free Student Tickets for the SaberSeminar in Boston (August 17-18, 2014)

Meredith Wills, Will Carroll and myself are donating four student two-day tickets, including lunch, for the upcoming baseball analytics Saberseminar run by Dan Brooks. This is a wonderful event, and 100% of the proceeds are donated to the Jimmy Fund. You must be a current student. Meredith and myself will by choosing four students by the end of this week, Sunday April 13, 2014.

• These tickets are for both days, August 17-18, 2014
• The event is in Boston, MA
• Lunch is included, but no other meals
• Transportation and lodging are not included

If you would like to be considered for a donated ticket, please send:
• Your full name (first and last)
• If you're outside of the Boston area, how will you be getting to the event?
• Your school affiliation and whether high school or college
• Do you see yourself working in baseball? For a team, as a journalist, or something else?

Please email the above information to me at sabermetrics@gmail.com.

Again, please do so by the end of the day on Sunday, April 13, 2014. Once the tickets are awarded they're gone.

## Monday, December 30, 2013

### A Stupid and Strange Way of Looking at Sports Power Ratings that could be Smart and Useful

As I've mentioned previously, a common method used in sports for estimating game outcomes known as log5 can be written $p = \frac{p_1 q_2}{p_1 q_2+q_1 p_2}$ where $$p_i$$ is the fraction of games won by team $$i$$ and $$q_i$$ is the fraction of games lost by team $$i$$. We're assuming that there are no ties. What's the easiest way to derive this estimate? Here's one argument. Assume team $$i$$ has a probability $$p_i$$ of beating an average team (a team that wins half its games). Now imagine that this means for any given game the team has some "strength" sampled from [0,1] with median $$p_i$$ and that the stronger team always wins. Thus, the probability that team 1 beats team 2 is $p = \int_0^1 \int_0^1 \! \mathrm{Pr}(p_1 > p_2) \, \mathrm{d} p_1 \mathrm{d} p_2 .$ This looks complicated, but but with probability $$p_1$$ team 1 is stronger than an average team and with probability $$p_2$$ team 2 is stronger than an average team. From this perspective the log5 estimate is just the Bayesian probability that team 1 will be stronger than an average team while team 2 will be weaker than an average team, conditional on either team 1 being stronger than an average team and team 2 weaker than an average team, or team 1 weaker than an average team and team 2 stronger than an average team. In these cases it's unambiguous which team is stronger. The cases where the strength of both teams is stronger or weaker than an average team (the ambiguous cases) are thus discarded.

How could this be useful? Instead of ignoring the ambiguous outcomes when estimating the outcome probabilities under this "latent strength" model, we could instead determine which probability distributions best fit the outcome distributions for a given league! Furthermore, this allows us to cohesively put a power rating system into a Bayesian framework by assigning to each team a Bayesian prior strength distribution. These priors could either be uninformative or informative using e.g. preseason rankings.

## Thursday, November 7, 2013

### Building a Personal Supercomputer

It's time for a workstation upgrade; here's what I've assembled.

The massive case has plenty of space for additional drives to store the extreme amount of data that nearly every sport is now collecting together with the associated video footage. The case, power supply and motherboard allow up to 3 additional video cards to use as GPU units as your analytical needs demand (and budget can handle).
1. Cooler Master Cosmos II - Ultra Full Tower Computer Case

An absolutely massive case - plenty of room for an E-ATX motherboard, large power supply, multiple video cards and hard drives.

2. PC Power & Cooling 1200W Silencer MK III Series Modular Power Supply

An exceptionally large power supply, but platinum rated and plenty of room to spare allows it to operate nearly silenty, plus it'll handle any additional video cards you'll add later for GPU computing.

3. ASUS Rampage IV Black Edition LGA 2011 Extended ATX Intel Motherboard

Space for 4 video cards, 8 memory sticks (up to 64GB), superb quality and extreme tweakability.

4. Intel i7-4930K LGA 2011 64 Technology Extended Memory CPU

Ivy Bridge, 6 cores, exceptional ability to overclock.

5. (2) Corsair Dominator Platinum 32GB (4x8GB) DDR3 2133 MHz

Total of 64GB. Top-quality, you'll need your RAM in matched sets of 4 to enable quad-channel.

6. EVGA GeForce GTX TITAN SuperClocked 6GB GDDR5 384bit

Top-of-the-line NVIDIA Kepler card for GPU computing. 2688 CUDA cores that reach 4800 TFLOPS single-precision and 1600 TFLOPS double-precision.

7. (2) Seagate NAS HDD 2TB SATA 6GB NCQ 64 MB Cache Bare Drive

Solid hard drive; you'll need 2 or more drives for a RAID 10 array.

8. Samsung Electronics 840 Pro Series 2.5-Inch 256 GB SATA 6GB/s Solid State Drive

Small, fast SSD for quick booting and anything bottlenecked by drive read/write times.

9. Silverstone Tek 3.5-inch to 2 x 2.5-Inch Bay Converter

Needed to adapt the SSD to the Cosmos II case.

10. Ubuntu Linux

Ubuntu Linux - what else?

## Sunday, September 15, 2013

### The Good, the Bad and the Weird: Duels and the Gentleman's Draw

As I mentioned in the previous article "The Good, the Bad and the Weird: Duels, Truels and Utility Functions", a classic probability puzzle involves a 3-way duel (called a "truel").
A, B and C are to fight a three-cornered pistol duel. All know that A's chance of hitting his target is 0.3, C's is 0.5, and B never misses. They are to fire at their choice of target in succession in the order A, B, C, cyclically (but a hit man loses further turns and is no longer shot at) until only one man is left unit. What should A's strategy be?
There's a subtle issue involved in these types of problems in that we don't know how each participant values each outcome. If we allow duelists to deliberately miss there are $$2^3-1=7$$ possible outcomes; each person may or may not be shot and at least one person will not be shot. Even if deliberate missing isn't allowed, there are still 3 possible outcomes. A, for example, could conceivably value B winning more than C winning.

The classic solution concludes that with the given hit probabilities the optimal strategy for the first trueler is to deliberately miss. My contention is that this is an incomplete solution; for some sets of utility values this may not be the first trueler's optimal strategy. Let's examine duels using utility values as the first step towards addressing truels.

Let the two duelers have hit probabilities $$p_i$$ for $$i=1,2$$, and also let the values each assigns to a dueling win, loss or tie be $$W_i, L_i, T_i$$. I'm assuming these values are all known to both duelers and that $$W_i > T_i > L_i$$. Note that a strategy that optimizes expected utility is preserved under a positive affine transformation, so under the assumption that all values are finite, we may assume $$W_i=1$$ and $$L_i=0$$ for all $$i$$.

We'll declare the duel a "gentleman's draw" if both duelers deliberately miss in a single round. Let the expected utility value for dueler $$i$$ be $$U_i$$.

Trivially, if the optimal strategy for dueler 1 is trying for a hit, the optimal strategy for dueler 2 must be to also try for a hit. Conversely, the optimal strategy for dueler 1 can be to deliberately miss if and only if the optimal strategy for dueler 2 is also to deliberately miss. Assume the draw is taken if there's indifference (they are, after all, gentlemen).

If dueler 1 deliberately misses, dueler 2 has two choices. If he deliberately misses, it's a gentleman's draw and $$U_2 = T_2$$; if he tries for a hit, both duelers will subsequently try to hit each other and $$U_2 = p_2 + q_2 q_1 U_2$$. Solving, we get $U_2 = \frac{p_2}{1-q_1 q_2}.$ It's therefore optimal for dueler 2 to take the gentleman's draw if and only if $T_2 \geq \frac{p_2}{1-q_1 q_2}.$ As a consequence, dueler 1 will not deliberately miss if $T_2 < \frac{p_2}{1-q_1 q_2}.$ If dueler 1 tries for a hit, both will subsequently try to hit each other and we have $$U_1 = p_1 + q_1 q_2 U_1$$. Solving, we get $U_1 = \frac{p_1}{1-q_1 q_2}.$ Thus, the optimal strategy for dueler 1 is to deliberately miss if and only if $T_1 \geq \frac{p_1}{1-q_1 q_2}$ and $T_2 \geq \frac{p_2}{1-q_1 q_2}.$ These are, as expected, precisely the conditions under which dueler 2 will deliberately miss.

For example, if $$p_1=p_2=1/2$$, it'll be a gentleman's draw if and only if $$T_1,T_2 \geq 2/3$$. Paradoxically, both will fire under these hit probabilities if $$T_1 = 4/5$$ and $$T_2 = 1/2$$, even though this results in lower expected utility for both duelers than if they had agreed to a draw. This is a type of prisoner's dilemma.

## Saturday, August 10, 2013

### The Good, the Bad and the Weird: Duels, Truels and Utility Functions

In the excellent (and highly recommended) book "Fifty Challenging Problems in Probability with Solution", Frederick Mosteller poses "The Three-Cornered Duel":
A, B and C are to fight a three-cornered pistol duel. All know that A's chance of hitting his target is 0.3, C's is 0.5, and B never misses. They are to fire at their choice of target in succession in the order A, B, C, cyclically (but a hit man loses further turns and is no longer shot at) until only one man is left unit. What should A's strategy be?
This is problem 20 in Mosteller's book, and it also appears (with an almost identical solution) in Larsen & Marx "An Introduction to Probability and Its Applications".

Mosteller's solution:
A is naturally not feeling cheery about this enterprise. Having the first shot he sees that, if he hits C, B will then surely hit him, and so he is not going to shoot at C. If he shoots at B and misses him, then B clearly shoots the more dangerous C first, and A gets one shot at B with probability 0.3 of succeeding. If he misses this time, the less said the better. On the other hand, suppose A hits B.  Then C and A shoot alternately until one hits. A's chance of winning is $$(.5)(.3) + (.5)^2(.7)(.3) + (.5)^3(.7)^2(.3) + \ldots$$ . Each term corresponds to a sequence of misses by both C and A ending with a final hit by A. Summing the geometric series we get ... $$3/13 < 3/10$$. Thus hitting B and finishing off with C has less probability of winning for A than just missing the first shot. So A fires his first shot into the ground and then tries to hit B with his next shot. C is out of luck.
Is this right? What if B were to follow A's example and fire into the ground? What if all three were to keep firing into the ground? That this type of an outcome isn't unreasonable for certain sets of shot accuracy probabilities can be illustrated by considering the case where A's accuracy is 0.98, B's accuracy is 1.0 and C's accuracy is 0.99. Mosteller's argument is equally applicable in this case, but if B shoots C after A deliberately misses he'll be shot by A with probability 0.98. Is that reasonable?

Under the assumption that deliberately missing is allowed, there are $$2^3-1=7$$ possible outcomes - each participant can be shot or not shot, and there must be at least one participant not shot. The lack of clarity for what the ideal strategies are for A, B and C in the general case arises from the utility of 2- or 3-way ties to each of the participants being undefined.

In the next article I'll analyze 2-way duels where deliberate missing is allowed by using such fully-defined utility functions. These results will be used in a third article on 3-way duels (truels); in particular, I'll re-examine Mosteller's solution.