I am reading this nice paper by Harsha et al on *The Communication Complexity of Correlation* and some of their results rely on the following cute lemma (slightly reworded).

Let and be two distributions on a finite set such that the KL-divergence . Then there exists a sampling procedure which, when given a sequence of samples drawn i.i.d. according to , will output (with probability 1) an index such that the sample is distributed according to the distribution and the expected encoding length of the index is at most

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set for all and set as well. Then for each do the following:

- Output with probability .

Ok, so what is going on here? It turns out that is tracking the probability that the procedure halts at with (this is not entirely clear at first). Thus is the probability that we halted before time and the sample is , and is the probability that we halt at time . Thus is the probability that the procedure gets to step . Getting back to , we see that with probability and we halt with probability , so indeed, is the probability that we halt at time with .

What we want then is that . So how should we define ? It’s the minimum of two terms : in order to stop at time such that , the factor is at most . However, we don’t want to let exceed the target probability , so must be less than . The authors say that in this sense the procedure is greedy, since it tries to get as close to as it can.

In order to show that the sampling is correct, i.e. that the output distribution is , we want to show what . This follows by induction. To get a bound on the description length of the output index, they have to show that

.

This is an interesting little fact on its own. It comes out because

and then expanding out .

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some drawn according to and wants Bob to generate a according to the distribution . So ideally, have joint distribution . They can both generate a sequence of common according to the marginal distribution of . Then Alice tries to generate the distribution according to this sampling scheme and sends the index to Bob, who chooses the corresponding . How many bits does Alice have to send on average? It’s just

.

which, after some love from Jensen’s inequality, turns out to be upper bounded by

.

Pretty cute, eh?