I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).
Let
and
be two distributions on a finite set
such that the KL-divergence
. Then there exists a sampling procedure which, when given a sequence of samples
drawn i.i.d. according to
, will output (with probability 1) an index
such that the sample
is distributed according to the distribution
and the expected encoding length of the index
is at most
where the expectation is taken over the sample sequence and the internal randomness of the procedure.
So what is the procedure? It’s a rejection sampler that does the following. First set for all
and set
as well. Then for each
do the following:
- Output
with probability
.
Ok, so what is going on here? It turns out that is tracking the probability that the procedure halts at
with
(this is not entirely clear at first). Thus
is the probability that we halted before time
and the sample is
, and
is the probability that we halt at time
. Thus
is the probability that the procedure gets to step
. Getting back to
, we see that
with probability
and we halt with probability
, so indeed,
is the probability that we halt at time
with
.
What we want then is that . So how should we define
? It’s the minimum of two terms : in order to stop at time
such that
, the factor
is at most
. However, we don’t want to let
exceed the target probability
, so
must be less than
. The authors say that in this sense the procedure is greedy, since it tries to get
as close to
as it can.
In order to show that the sampling is correct, i.e. that the output distribution is , we want to show what
. This follows by induction. To get a bound on the description length of the output index, they have to show that
.
This is an interesting little fact on its own. It comes out because
and then expanding out .
One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some drawn according to
and wants Bob to generate a
according to the distribution
. So ideally,
have joint distribution
. They can both generate a sequence of common
according to the marginal distribution of
. Then Alice tries to generate the distribution
according to this sampling scheme and sends the index
to Bob, who chooses the corresponding
. How many bits does Alice have to send on average? It’s just
.
which, after some love from Jensen’s inequality, turns out to be upper bounded by
.
Pretty cute, eh?