Skip to main content


Showing posts from July, 2014

Why does Kaggle use Log-loss?

Finding the Best Book Quotes: Power Ranking Goodreads

My friend Jordan Ellenberg wrote an article for the Wall Street Journal recently describing a metric to roughly measure the least read books, which he calls the Hawking Index (HI). As he describes it, take the page numbers of a book's five top highlights, average them, and divide by the number of pages in the whole book.

On a discussion thread on Facebook, this led to a proposal from me for measuring the general quality of a quote. Assume a user has some level of discrimination \(D\); the higher the value of \(D\), the more likely they are to quote a passage. Now assume each passage has some measure of quality \(Q\); the higher the value of \(Q\), the more likely a passage is to be quoted. Let's try a classic Bradley-Terry-Luce model - if a user with discrimination \(D\) quotes at least one passage from a particular work, the probability \(p\) that they'll quote a given passage with quality \(Q\) from that same work is roughly \[ p = \frac{Q}{Q+D}.\]
Nice, but where can w…