Saturday, October 11, 2014

A Strange Recursive Relation, Automatic

Hofstadter mentions the following recursive relation in his great book "Gödel, Escher, Bach": \[
g(0) &= 0;\\
g(n) &= n-g(g(n-1)).
\] I claim that \( g(n) = \left\lfloor \phi\cdot (n+1) \right\rfloor \), where \( \phi = \frac{-1+\sqrt{5}}{2}\), and I'll show this using a technique that makes proving many identities of this type nearly automatic.

Let \( \phi\cdot n = \left\lfloor \phi\cdot n \right\rfloor + e\), where \( 0 < e < 1\) as \( \phi \) is irrational, nor can \(e = 1-\phi\), and note that \(\phi\) satisfies \( {\phi}^2 + \phi - 1 = 0\). Some algebra gives \[
n-\left\lfloor \left( \left\lfloor \phi\cdot n \right\rfloor + 1 \right) \cdot \phi \right\rfloor
&= n-\left\lfloor \left( n\cdot \phi - e + 1 \right) \cdot \phi \right\rfloor \\
&= n-\left\lfloor n\cdot {\phi}^2 - e\cdot \phi + \phi \right\rfloor \\
&= n-\left\lfloor n\cdot \left(1-\phi\right) - e\cdot \phi + \phi \right\rfloor \\
&= n-n-\left\lfloor -n\cdot \phi - e\cdot \phi + \phi \right\rfloor \\
&= -\left\lfloor -n\cdot \phi - e\cdot \phi + \phi \right\rfloor \\
&= -\left\lfloor -n \cdot \phi + e - e - e\cdot \phi + \phi \right\rfloor \\
&= \left\lfloor \phi\cdot n \right\rfloor -\left\lfloor - e - e\cdot \phi + \phi \right\rfloor.
Now if \[
0 < e < 1-\phi &\implies 0 < - e - e\cdot \phi + \phi < \phi;\\
1-\phi < e < 1 &\implies -1 < - e - e\cdot \phi + \phi < 0.
This implies \[ n-\left\lfloor \left( \left\lfloor \phi\cdot n \right\rfloor + 1 \right) \cdot \phi \right\rfloor = \left\lfloor \phi\cdot (n+1) \right\rfloor .\] Since \( \left\lfloor \phi\cdot (0+1) \right\rfloor = 0\), we're done.

The point of the algebra was to move all terms involving \(n\) out, and then checking to see how the remaining term varied with \( e\). A simple idea, but very useful.

Sunday, October 5, 2014

Closed Under Means

Here's a nice little problem from the 13th All Soviet Union Mathematics Olympiad.
Given a set of real numbers \(S\) containing 0 and 1 that's closed under finite means, show that it contains all rational numbers in the interval \(\left[0,1\right]\).
This isn't a difficult problem, but there's a particularly nice solution.

First observe that if \(x\in S\) then both \(\frac{x}{4}\) and \(\frac{3x}{4}\) are in \(S\); average \(\{0,x\}\) to get \(\frac{x}{2}\), average \(\{0, \frac{x}{2}\}\) to get \(\frac{x}{4}\), average \(\{\frac{x}{2}, x\}\) to get \(\frac{3x}{4}\).

We could show any rational number \(\frac{m}{n}\) with \(1\leq m < n\) is in \(S\) if we had \(n\) distinct elements from \(S\) that summed to \(m\). Let's exhibit one.

Start with an array with \(m\) 1s on the left, \(n-m\) 0s on the right. Repeatedly replace adjacent \(x,y\) values with \(\frac{3(x+y)}{4}, \frac{(x+y)}{4}\), where either \(x=1,y\neq1\) or \(x\neq 0, y=0\), until there is one 0 and one 1 left. We can do this in exactly \(n-2\) steps, each resulting number is guaranteed to be in \(S\) by the above note, and each number is guaranteed to be distinct by construction!


\(\frac{1}{3}: \left[1,0,0\right] \to \left[\frac{3}{4},\frac{1}{4},0\right] \)

\(\frac{2}{5}: \left[1,1,0,0,0\right] \to \left[1,\frac{3}{4},\frac{1}{4},0,0\right] \to \left[1,\frac{3}{4},\frac{3}{16},\frac{1}{16},0\right] \)

Thursday, August 28, 2014

Learn One Weird Trick And Easily Solve The Product-Sum Problem

A tribute to Martin Gardner.

For which sets of positive integers does the sum equal the product? For example, when does \( x_1 + x_2 = x_1\cdot x_2\)? It's easy to see that this is only true when \(x_1 = x_2 = 2\).

In the general case our equality is \(\sum_{i=1}^{n} x_i= \prod_{i=1}^{n} x_i \). We can rearrange terms to give \[x_1+x_2+\sum_{i=3}^{n} x_i= x_1\cdot x_2\cdot \prod_{i=3}^{n} x_i,\] and this in turn factors nicely to give us \[\left( x_1\cdot \prod_{x=3}^{n} x_i - 1\right)\cdot \left( x_2\cdot \prod_{x=3}^{n} x_i - 1\right) = \left( \prod_{x=3}^{n} x_i \right)\cdot \left(\sum_{x=3}^{n} x_i \right) + 1.\] How does this help? Consider the case \(n=5\), and without loss of generality assume \(x_i \ge x_{i+1}\) for all \(i\). If \(x_3=x_4=x_5=1\) our factorized equation becomes \[(x_1-1)\cdot(x_2-1)=4,\] with the obvious solutions \( x_1=5, x_2=2; x_1=3, x_2=3\). The only remaining case to consider is \(x_3=2\), as any other case forces  \( \sum_{i=1}^{n} x_i < \prod_{i=1}^{n} x_i \). For this case our equality becomes \[\left( 2 x_1 - 1\right)\cdot \left( 2 x_2 - 1\right) = 9.\] This gives us the remaining solution \(x_1 = x_2 = x_3 = 2\) with the other \(x_i = 1\).

This is quite efficient. For example, for the case \(n=100\) the only equations we need to consider are \begin{aligned} (x_1-1)\cdot(x_2-1) &= 99\\
                         (2 x_1-1)\cdot(2 x_2-1) &= 199\\
                         (3 x_1-1)\cdot(3 x_2-1) &= 301\\
                         (4 x_1-1)\cdot(4 x_2-1) &= 405\\
                         (4 x_1-1)\cdot(4 x_2-1) &= 401\\
                         (6 x_1-1)\cdot(6 x_2-1) &= 607\\
                         (9 x_1-1)\cdot(9 x_2-1) &= 919\\
                         (8 x_1-1)\cdot(8 x_2-1) &= 809\\
                         (12 x_1-1)\cdot(12 x_2-1) &= 1225\\
                         (16 x_1-1)\cdot(16 x_2-1) &= 1633.
\end{aligned} Now 199, 401, 607, 919, 809 are all prime ruling them out immediately, and 301 and 1633 don't have factors of the right form. The other equations yield the five solutions \( (100,2), (34,4), (12,10), (7,4,4), (3,3,2,2,2) \) with the other \(x_i = 1\).

For \(n = 1000 \) you'd need to consider 52 cases, but most of these are eliminated immediately. I get the six solutions \( (1000,2), (334,4), (112,10), (38,28), (67,4,4), (16,4,4,4) \).

Have fun!

Tuesday, July 8, 2014

Finding the Best Book Quotes: Power Ranking Goodreads

My friend Jordan Ellenberg wrote an article for the Wall Street Journal recently describing a metric to roughly measure the least read books, which he calls the Hawking Index (HI). As he describes it, take the page numbers of a book's five top highlights, average them, and divide by the number of pages in the whole book.

On a discussion thread on Facebook, this led to a proposal from me for measuring the general quality of a quote. Assume a user has some level of discrimination \(D\); the higher the value of \(D\), the more likely they are to quote a passage. Now assume each passage has some measure of quality \(Q\); the higher the value of \(Q\), the more likely a passage is to be quoted. Let's try a classic Bradley-Terry-Luce model - if a user with discrimination \(D\) quotes at least one passage from a particular work, the probability \(p\) that they'll quote a given passage with quality \(Q\) from that same work is roughly \[ p = \frac{Q}{Q+D}.\]
Nice, but where can we find data to actually test this theory? It turns out an excellent source is Goodreads. Users are numbered sequentially starting with Goodreads founder Otis Chandler's ID of 1; a user's associated quote page has a simple URL structure. Furthermore, every user's quote list has an associated RSS/Atom URL that provides an easy to parse XML file.

The steps:
  1. Write a simple Go scraper to uses a goroutine to quickly and sequentially get user quote pages, grab the count of quotes shown on that page (truncated at 30), then grab the RSS/Atom URL. This is able to process about 2500/users minutes over a slow internet connection.
  2. Write a simple Ruby program to fetch quote XML files for every user that has at least one quote.
  3. Write a simple Ruby program to fetch the author URL and work URL for every quote cited by at least one user.
  4. Load all this data into a PostgreSQL database.
  5. Write a simple R program to pull the data from the database and fit our model. We assign a value for the outcome "quoted" of 1 if the user quoted a passage; we assign "quoted" the value 0 if the user quote a passage from that work, but not that particular passage. Both user "discrimination" and quote "quality" are treated as factors and modeled as random effects; the R package "lmer" was used to fit the model.
If there is interest, I can make all of this code publicly available through my GitHub account. None of it is proprietary.

As a test I grabbed the data from 333760 users, 10724 had at least one quote, there were 83160 total quotes and 32621 unique quotes. The model took about 3.5 minutes to fit using an optimized BLAS library and 4 hyperthreaded CPU cores.

The top 10 quotes on Goodreads, ranked by quality:
  1. “No one can make you feel inferior without your consent.”
    Eleanor Roosevelt, This is My Story
  2. “Outside of a dog, a book is man's best friend. Inside of a dog it's too dark to read.”
    Groucho Marx, The Essential Groucho: Writings For By And About Groucho Marx
  3. “Twenty years from now you will be more disappointed by the things that you didn't do than by the ones you did do. So throw off the bowlines. Sail away from the safe harbor. Catch the trade winds in your sails. Explore. Dream. Discover.”
    H. Jackson Brown Jr., P.S. I Love You
  4. “I am so clever that sometimes I don't understand a single word of what I am saying.”
    Oscar Wilde, The Happy Prince and Other Stories
  5. “Insanity is doing the same thing, over and over again, but expecting different results.”
    Narcotics Anonymous, Narcotics Anonymous
  6. “He has achieved success who has lived well, laughed often, and loved much;Who has enjoyed the trust of pure women, the respect of intelligent men and the love of little children;Who has filled his niche and accomplished his task;Who has never lacked appreciation of Earth's beauty or failed to express it;Who has left the world better than he found it,Whether an improved poppy, a perfect poem, or a rescued soul;Who has always looked for the best in others and given them the best he had;Whose life was an inspiration;Whose memory a benediction.”
    Bessie Anderson Stanley, More Heart Throbs Volume Two in Prose and Verse Dear to the American People And by them contributed as a Supplement to the original $10,000 Prize Book HEART THROBS
  7. “The trouble with having an open mind, of course, is that people will insist on coming along and trying to put things in it.”
    Terry Pratchett, Diggers
  8. “Love all, trust a few, do wrong to none.”
    William Shakespeare, All's Well That Ends Well
  9. “The Paradoxical Commandments
    People are illogical, unreasonable, and self-centered.
    Love them anyway.
    If you do good, people will accuse you of selfish ulterior motives.
    Do good anyway.
    If you are successful, you will win false friends and true enemies.
    Succeed anyway.
    The good you do today will be forgotten tomorrow.
    Do good anyway.
    Honesty and frankness make you vulnerable.
    Be honest and frank anyway.
    The biggest men and women with the biggest ideas can be shot down by the smallest men and women with the smallest minds.
    Think big anyway.
    People favor underdogs but follow only top dogs.
    Fight for a few underdogs anyway.
    What you spend years building may be destroyed overnight.
    Build anyway.
    People really need help but may attack you if you do help them.
    Help people anyway.
    Give the world the best you have and you'll get kicked in the teeth.
    Give the world the best you have anyway.”
    Kent M. Keith, The Silent Revolution: Dynamic Leadership in the Student Council
  10. “She is too fond of books, and it has turned her brain.”
    Louisa May Alcott, Work: A Story of Experience

Monday, April 7, 2014

Five Free Student Tickets for the SaberSeminar in Boston (August 17-18, 2014)

Meredith Wills, Will Carroll and myself are donating four student two-day tickets, including lunch, for the upcoming baseball analytics Saberseminar run by Dan Brooks. This is a wonderful event, and 100% of the proceeds are donated to the Jimmy Fund. You must be a current student. Meredith and myself will by choosing four students by the end of this week, Sunday April 13, 2014.

Please note:
  • These tickets are for both days, August 17-18, 2014
  • The event is in Boston, MA
  • Lunch is included, but no other meals
  • Transportation and lodging are not included

If you would like to be considered for a donated ticket, please send:
  • Your full name (first and last)
  • If you're outside of the Boston area, how will you be getting to the event?
  • Your school affiliation and whether high school or college
  • Best contact email address (if different from reply-to address)
  • A little about your baseball interests, analytical or otherwise
  • Do you see yourself working in baseball? For a team, as a journalist, or something else?

Please email the above information to me at

Again, please do so by the end of the day on Sunday, April 13, 2014. Once the tickets are awarded they're gone.

Monday, December 30, 2013

A Stupid and Strange Way of Looking at Sports Power Ratings that could be Smart and Useful

As I've mentioned previously, a common method used in sports for estimating game outcomes known as log5 can be written \[p = \frac{p_1 q_2}{p_1 q_2+q_1 p_2}\] where \(p_i\) is the fraction of games won by team \(i\) and \(q_i\) is the fraction of games lost by team \(i\). We're assuming that there are no ties. What's the easiest way to derive this estimate? Here's one argument. Assume team \(i\) has a probability \(p_i\) of beating an average team (a team that wins half its games). Now imagine that this means for any given game the team has some "strength" sampled from [0,1] with median \(p_i\) and that the stronger team always wins. Thus, the probability that team 1 beats team 2 is \[ p = \int_0^1 \int_0^1 \! \mathrm{Pr}(p_1 > p_2) \, \mathrm{d} p_1 \mathrm{d} p_2 .\] This looks complicated, but but with probability \(p_1\) team 1 is stronger than an average team and with probability \(p_2\) team 2 is stronger than an average team. From this perspective the log5 estimate is just the Bayesian probability that team 1 will be stronger than an average team while team 2 will be weaker than an average team, conditional on either team 1 being stronger than an average team and team 2 weaker than an average team, or team 1 weaker than an average team and team 2 stronger than an average team. In these cases it's unambiguous which team is stronger. The cases where the strength of both teams is stronger or weaker than an average team (the ambiguous cases) are thus discarded.

How could this be useful? Instead of ignoring the ambiguous outcomes when estimating the outcome probabilities under this "latent strength" model, we could instead determine which probability distributions best fit the outcome distributions for a given league! Furthermore, this allows us to cohesively put a power rating system into a Bayesian framework by assigning to each team a Bayesian prior strength distribution. These priors could either be uninformative or informative using e.g. preseason rankings.

Thursday, November 7, 2013

Building a Personal Supercomputer

It's time for a workstation upgrade; here's what I've assembled.

The massive case has plenty of space for additional drives to store the extreme amount of data that nearly every sport is now collecting together with the associated video footage. The case, power supply and motherboard allow up to 3 additional video cards to use as GPU units as your analytical needs demand (and budget can handle).
  1. Cooler Master Cosmos II - Ultra Full Tower Computer Case

    An absolutely massive case - plenty of room for an E-ATX motherboard, large power supply, multiple video cards and hard drives.

  2. PC Power & Cooling 1200W Silencer MK III Series Modular Power Supply

    An exceptionally large power supply, but platinum rated and plenty of room to spare allows it to operate nearly silenty, plus it'll handle any additional video cards you'll add later for GPU computing.

  3. ASUS Rampage IV Black Edition LGA 2011 Extended ATX Intel Motherboard

    Space for 4 video cards, 8 memory sticks (up to 64GB), superb quality and extreme tweakability.

  4. Intel i7-4930K LGA 2011 64 Technology Extended Memory CPU

    Ivy Bridge, 6 cores, exceptional ability to overclock.

  5. (2) Corsair Dominator Platinum 32GB (4x8GB) DDR3 2133 MHz

    Total of 64GB. Top-quality, you'll need your RAM in matched sets of 4 to enable quad-channel.

  6. EVGA GeForce GTX TITAN SuperClocked 6GB GDDR5 384bit

     Top-of-the-line NVIDIA Kepler card for GPU computing. 2688 CUDA cores that reach 4800 TFLOPS single-precision and 1600 TFLOPS double-precision.

  7. (2) Seagate NAS HDD 2TB SATA 6GB NCQ 64 MB Cache Bare Drive

    Solid hard drive; you'll need 2 or more drives for a RAID 10 array.

  8. Samsung Electronics 840 Pro Series 2.5-Inch 256 GB SATA 6GB/s Solid State Drive

    Small, fast SSD for quick booting and anything bottlenecked by drive read/write times.

  9. Silverstone Tek 3.5-inch to 2 x 2.5-Inch Bay Converter

    Needed to adapt the SSD to the Cosmos II case.

  10. Ubuntu Linux

     Ubuntu Linux - what else?