Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let P and Q be two distributions on a finite set \mathcal{X} such that the KL-divergence D(P \| Q)  > 0. Then there exists a sampling procedure which, when given a sequence of samples x_1, x_2, \ldots drawn i.i.d. according to Q, will output (with probability 1) an index i^{\ast} such that the sample x_{i^{\ast}} is distributed according to the distribution P and the expected encoding length of the index i^{\ast} is at most

D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set p_0(x) = 0 for all x \in \mathcal{X} and set p_0^{\ast} = 0 as well. Then for each i = 1, 2, \ldots do the following:

  1. \alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}
  2. p_i(x) = p_{i-1}(x) + \alpha_i(x)
  3. p_{i^{\ast}} = \sum_{x} p_{i}(x)
  4. \beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )
  5. Output i with probability \beta_i.

Ok, so what is going on here? It turns out that \alpha_i(x) is tracking the probability that the procedure halts at i with x_i = x (this is not entirely clear at first). Thus p_i(x) is the probability that we halted before time i and the sample is x, and p_i^{\ast} is the probability that we halt at time i. Thus (1 - p_{i-1}^{\ast}) is the probability that the procedure gets to step i. Getting back to \alpha_i(x), we see that x_i = x with probability Q(x_i) and we halt with probability \beta_i, so indeed, \alpha_i(x) is the probability that we halt at time i with x_i = x.

What we want then is that P(x) = \sum_{i=1}^{\infty} \alpha_i(x). So how should we define \alpha_i(x)? It’s the minimum of two terms : in order to stop at time i such that x_i = x, the factor \alpha_i(x) is at most (1 - p_{i-1}^{\ast}) Q(x). However, we don’t want to let p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x) exceed the target probability P(x), so \alpha_i(x) must be less than P(x) - p_{i-1}(x). The authors say that in this sense the procedure is greedy, since it tries to get p_i(x) as close to P(x) as it can.

In order to show that the sampling is correct, i.e. that the output distribution is P, we want to show what p_i(x) \to P(x). This follows by induction. To get a bound on the description length of the output index, they have to show that

\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1).

This is an interesting little fact on its own. It comes out because

i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1

and then expanding out \mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i.

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some X drawn according to P(x) and wants Bob to generate a Y according to the distribution P(y | x). So ideally, (X,Y) have joint distribution P(x) P(y | x). They can both generate a sequence of common y_1, y_2, \ldots according to the marginal distribution of Y. Then Alice tries to generate the distribution P(y | x) according to this sampling scheme and sends the index i^{\ast} to Bob, who chooses the corresponding y_{i^{\ast}}. How many bits does Alice have to send on average? It’s just

\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ].

which, after some love from Jensen’s inequality, turns out to be upper bounded by

I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ].

Pretty cute, eh?

Linkage (and a puzzle)

I saw Scott’s talk today on some complexity results related to his and Alex Arkhpov’s work on linear optics. I missed the main seminar but I saw the theory talk, which was on how hard it is to approximate the permanent of a matrix X whose entries (X_{ij}) are drawn iid complex circularly-symmetric Gaussian \mathcal{CN}(0,1). In the course of computing the expected value of the 4th moment of the permanent, he gave the following cute result as a puzzle. Given a permutation \sigma of length n, let c(\sigma) be the number of cycles in \sigma. Suppose \sigma is drawn uniformly from the set of all permutations. Show that

\mathbb{E}[ 2^{c(\sigma)}] = n + 1.

At least I think that’s the statement.

In other news…

  • Ken Ono has announced (with others) an algebraic formula for partition numbers. Very exciting!
  • Cosma thinks that New Yorker article is risible, but after talking to a number of people about it, I realized that the writing is pretty risible (and that I had, at first pass, skimmed to the part which I thought was good to report in the popular (or elitist) press, namely the bias towards positive results. Andrew Gelman points out that he has written about this before, but I think the venue was the more interesting part here. What was risible about the writing is that it starts out in this “ZOMG OUR SCIENCE POWERZ ARE FAAAAAAADINNNNNNGGGGGGG,” and then goes on to say slightly more reasonable things. It’s worthy of the worst of Malcolm Gladwell.
  • Tofu is complicated.
  • The 90-second Newbery contest.

Partha Niyogi has passed away

Partha Niyogi passed away after battling brain cancer (via Suresh).

Readers of this blog might be familiar with some of the work he did on Laplacian Eigenmaps and other topics in machine learning, especially manifold learning. I read the Laplacian eigenmaps paper my first year in grad school, as a holdover from my undergrad research in machine learning.

ITW 2010 : finishing up

Blogging conferences seems to have gone by the wayside for me, but some quick takes on things from ITW. It was a more coding-focused conference so there were fewer talks of active research interest to me, but I did get to catch up and talk shop with a few people, so that was nice.

Tali Kaufman (who seems to have no homepage) gave a plenary on “Local computation in codes” which looked at fast (sublinear time) algorithms for detecting if a binary vector is from a code, and fast ways to correct single bit errors. In particular, she was looking at these properties in the context of LDPC codes. It’s a nice place where information theory and CS theory look at the same object but with different intents.

Ueli Maurer gave a great talk on “A Cryptographic Theory of System Indistinguishability,” which started out kind of slow and then ended up at a breakneck speed. This was a way of thinking about crypto from the perspective of systems being able to simulate each other.

Rudolf Ahlswede talked a bit about strong converses and weak converses for channels with a rather generic “time structure.” Essentially this boils down to a difference between lim inf and lim sup, and he gave a rather short proof showing that capacities exist under very mild conditions and that the additivity of capacity (e.g. for two parallel channels) may hold in some settings but not others.

There were lots of other good talks that I enjoyed but I didn’t take particularly good notes this time (I blame the jet lag), so I don’t have much to say here. Tomorrow is the start of Allerton, so I might take better notes for that.

some slightly more recent reads

Suspended in Language (Jim Ottaviani and Leland Purvis) — a graphic novel about Niels Bohr, his life, his theories, and the birth of modern physics. This was a great read and wonderful introduction for those with a scientific bent but perhaps less physics background (me in a nutshell).

Logicomix(Apostolos Doxiadis and Christos Papadimitriou) — continuing with the intellectual comic book trend, this was a semi-fictionalized history of the foundations of mathematics from the perspective of Bertrand Russell. There’s a lot going on in the book, which tries to examine the connections between logic and madness, maps versus reality, and Russell versus Wittgenstein. I very much enjoyed the beginning of the book but it sort of rushed into the ending : I wanted more about Gödel!

Botany of Desire (Michael Pollan) — this is a lyrically written book about the relationship between people and plants. Pollan goes through 4 case studies : the apple, the tulip, marijuana, and the potato, and describes how the plants satisfy human desires and how humans have shaped the course of their evolution. The writing in this book is beautiful, but his favorite words seem to be Apollonian, Dionysian, and chthonic, which lends some of the text an almost 19th century feeling. His dissection of the issues with GMO farming and Monsanto in the potato chapter is great, but I wish it was more accessible to the average reader. Ah well, it’s a book for elites, and a very pretty book at that.

Interracial Intimacy: The Regulation of Race and Romance (Rachel F. Moran) — This was a slightly more legalistic and policy-oriented analysis of how interracial relationships were regulated by the state in the United States. Unlike Kennedy’s book, it has a fair bit more about non black-white relationships, and highlights the differences faced by different ethnic groups. Also unlike Kennedy’s book, it is not aggressively arguing an a particular agenda. Kennedy was building up an argument against race-matching in adoption, and Moran is a little more circumspect and seems (at least to my mind) to be more attuned to the dangers of being prescriptivist. It’s definitely a dry read, but I found it quite informative.

giving no credit where it is not due

Luca pointed to a paper by Chierichetti, Lattanzi, and Panconesi, which has an amusing comment in the last section (I don’t want to spoil it).

The paper itself is interesting, of course. Conductance often appears in bounds on mixing times for Markov chains, but the rumor spreading problem is a bit different than the consensus problems that I have studied in the past. A nice quote from the introduction:

Our long term goal is to characterize a set of necessary and/or suffcient conditions for rumour spreading to be fast in a given network. In this work, we provide a very general suffcient condition — high conductance. Our main motivation comes from the study of social networks. Loosely stated, we are looking after a theorem of the form “Rumour spreading is fast in social networks”. Our result is a good step in this direction because there are reasons to believe that social networks have high conductance.