Linkage

I’m heading off to Mexico in less than 12 hours for a week during which I hope to disconnect : no email, web, or phone. I guess I’ll miss the majority of the post-Bin Laden news cycle. In the meantime, here are some more links because I am too lazy to post content.

Speaking of 9/11, this is simply terrible.

An interview with George Saunders, one of my favorite authors.

Blackwell’s proof of Wald’s Identity, as told by Max.

Long pepper looks fascinating and tasty!

Can Voter ID Laws Be Administered in a Race-Neutral Manner? The short answer is no. The longer answer is a 30 page paper.

Frank has blogger about our trip last weekend to The 2nd 8th Annual Grilled Cheese Invitational. My arteries may never be the same again.

There are no more typewriter factories. This makes me especially sad, as I have a 1914 Underwood No. 5 that I love (and lug).

Quote of the day : the grip of darker powers

Philosophers seem singularly unable to put asunder the aleatory and the epistemological side of probability. This suggests that we are in the grip of darker powers than are admitted into the positivist ontology. Something about the concept of probability precludes the separation which, Carnap thought, was essential to further progress. What?”

Ian Hacking, The Emergence of Probability, Cambridge : Cambridge UP, 1975.

more binomial MADness

I posted earlier about the mean absolute deviation (MAD) of a binomial variable S_n with parameters (n,p). Here’s a little follow-up with plots. This is a plot of \mathbb{E}|S_n - np| versus p for different values of n.

The first is for n = 10. Looks beautifully scalloped, no? As we’d expect, the MAD is symmetric about p = 1/2 and monotonically increasing for the first half of the unit interval. Unfortunately, it’s clearly not concave (although it is piecewise concave), which means I have to do a bit more algebra later on.

When $n = 100$ the scallops turn into a finely serrated dome.

By the time you get to $n = 1000$ the thing might as well be concave for all that your eye can tell. But you would be deceived. Like a shark’s skin, the tiny denticles can abrade your proof, damaging it beyond repair.

Why do I care about this? If you take n samples from a Bernoulli variable with parameter p, then the empirical distribution (unnormalized) is (n - S_n, S_n). So \frac{1}{n} \mathbb{E}|S_n - np| is the expected total variational distance between the empirical distribution and its mean. More generally, the expected total variational distance for finite-alphabet distributions is a sum of MAD terms.

Linkage

The fog in San Francisco (h/t Erin Rhode).

A general approach to privacy/utility tradeoffs, with metric spaces! A new preprint/note by Robert Kleinberg and Katrina Ligett.

Max breaks it down for you on how to use the divergence to get tight convergence for Markov chains.

The San Diego Asian Film Festival starts on Thursday!

Apparently China = SF Chinatown. Who knew? Maybe the fog confused them.

Mean absolute deviations for binomial random variables

Update: thanks to Yihong Wu for pointing out a typo in the statement of the result, which then took me months to get around to fixing.

I came across this paper in the Annals of Probability:

Mean absolute deviations of sample means and minimally concentrated binomials
Lutz Mattner

It contains the following cute lemma, which I didn’t know about before. Let S_n have binomial distribution with parameters (n,p). Let b(n,k,p) = \mathbb{P}(S_n = k). The first two parts of the lemma are given below.

Lemma. We have the following:

  1. k_0 \in \arg\max_{k \in \{0,\ldots, n\}} b(n,k,p) if and only if k_0 \in \{0,\ldots, n\} and (n+1)p - 1 \le k_0 \le (n+1) p.
  2. (De Moivre’s mean absolute deviation equality) \sum_{k=0}^{n} |k - np| b(n,k,p) = 2 v (1-p) \max_{k \in \{0,\ldots, n-1\}} b(n-1,k,p), where v is the unique integer between np < v \le np + 1.

The second part, which was new to me (perhaps I’ve been too sheltered), is also in a lovely paper from 1991 by Persi Diaconis and Sandy Zabell : “Closed Form Summation for Classical Distributions: Variations on a Theme of De Moivre,” in Statistical Science. Note that the sum in the second part is nothing more than \mathbb{E}|S_n - np|. Using this result, De Moivre proved that \lim_{n \to \infty} \mathbb{E}|S_n/n - p| \to 0, which implied (to him) that

if after taking a great number of experiments, it should be perceived that the happenings and failings have been nearly in a certain proportion, such as of 2 to 1, it may safely be concluded that the probabilities of happening or failing at any one time assigned will be very near in that proportion, and that the greater the number of experiments has been, so much nearer the truth will the conjectures be that are derived from them.

Diaconis and Zabell show the origins of this lemma, which leads to De Moivre’s result on the normal approximation to the binomial distribution. As for the proof of the L_1 convergence, they call the proof in the p = 1/2 case “simple but clever, impressive if only because of the notational infirmities of the day.” De Moivre’s proof was in Latin, but you can read a translation in their paper. A simple proof for rational p was given by Todhunter in 1865.

For those with an interest in probability with a dash of history to go along, the paper is a fun read.

Lossless compression via the memoizer

Via Andrew Gelman comes a link to deplump, a new compression tool. It runs the data through a predictive model (like most lossless compressors), but:

Deplump compression technology is built on a probabilistic discrete sequence predictor called the sequence memoizer. The sequence memoizer has been demonstrated to be a very good predictor for discrete sequences. The advantage deplump demonstrates in comparison to other general purpose lossless compressors is largely attributable to the better guesses made by the sequence memoizer.

The paper on the sequence memoizer (by Wood et al.) appeared at ICML 2009, with follow-ups at DCC and ICML 2010 It uses as its probabilistic model a version of the Pitman-Yor process, which is a generalization of the “Chinese restaurant”/”stick-breaking” process. Philosophically, the idea seems to be this : since we don’t know the order of the Markov process which best models the data, we will let the model order be “infinite” using the Pitman-Yor process and just infer the right parameters, hopefully avoiding overfitting while being efficient. The key challenge is that since the process can have infinite memory, the encoding seems to get hairy, which is why “memoization” becomes important. It seems that the particular parameterization of the PY process is important to reduce the number of parameters, but I didn’t have time to look at the paper in that much detail. Besides, I’m not as much of a source coding guy!

I tried it out on Leo Breiman’s paper Statistical Modeling: The Two Cultures. Measured in bytes:

307458 Breiman01StatModel.pdf         original
271279 Breiman01StatModel.pdf.bz2     bZip (Burrows-Wheeler transform)
269646 Breiman01StatModel.pdf.gz      gzip
269943 Breiman01StatModel.pdf.zip     zip
266310 Breiman01StatModel.pdf.dpl     deplump

As promised, it is better than the alternatives, (but not by much for this example).

What is interesting is that they don’t seem to cite much from the information theory literature. I’m not sure if this is a case of two communities working on related problems and unaware of the connections or that the problems are secretly not related, or that information theorists mostly “gave up” on this problem (I doubt this, but like I said, I’m not a source coding guy…)

Feller’s anti-Bayesianism

I didn’t really realize that in Feller’s classic probability book he had the following dismissal of Bayesian statistics:

Unfortunately Bayes’ rule has been somewhat discredited by metaphysical applications of the type described above. In routine practice, this kind of argument can be dangerous. A quality control engineer is concerned with one particular machine and not with an infinite population of machines from which one was chosen at random. He has been advised to use Bayes’ rule on the grounds that it is logically acceptable and corresponds to our way of thinking. Plato used this type of argument to prove the existence of Atlantis, and philosophers used it to prove the absurdity of Newton’s mechanics. In our case it overlooks the circumstance that the engineer desires success and that he will do better by estimating and minimizing the sources of various types of errors in predicting and guessing. The modern method of statistical tests and estimation is less intuitive but more realistic. It may be not only defended but also applied.” — W. Feller, 1950 (pp. 124-125 of the 1970 edition)

A few weeks ago, a little note on Feller’s anti-Bayesianism was posted to ArXiV. It’s a bit of an emotional read; a mathematical Op-Ed if you will. However, it does present an interesting perspective on historical “received wisdom” in light of more modern approaches to statistics and Bayesian data analysis. As an example, take the methods from Michael Jordan’s talk at ISIT (video and slides on the ITSOC website now!), using which you can do some cross-validation to see that they are indeed producing the correct results on real data.

What I am missing (as an outsider to the internecine battles of statistics) is an even-handed explanation of what all the hullabaloo is about. Such an article probably exists, but I haven’t seen it…

David Blackwell has passed away

Via Inherent Uncertainty I learned that David Blackwell passed away on July 8th.

Prof. Blackwell’s original paper (with Leo Breiman and Aram Thomasian) on the arbitrarily varying channel was an inspiration to me, and he served on my thesis committee a scant 2 years ago.

I’ll always remember what he told me when I handed him a draft of my thesis. “The best thing about Bayesians is that they’re always right.”