Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let $P$ and $Q$ be two distributions on a finite set $\mathcal{X}$ such that the KL-divergence $D(P \| Q) > 0$. Then there exists a sampling procedure which, when given a sequence of samples $x_1, x_2, \ldots$ drawn i.i.d. according to $Q$, will output (with probability 1) an index $i^{\ast}$ such that the sample $x_{i^{\ast}}$ is distributed according to the distribution $P$ and the expected encoding length of the index $i^{\ast}$ is at most $D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)$

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set $p_0(x) = 0$ for all $x \in \mathcal{X}$ and set $p_0^{\ast} = 0$ as well. Then for each $i = 1, 2, \ldots$ do the following:

1. $\alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}$
2. $p_i(x) = p_{i-1}(x) + \alpha_i(x)$
3. $p_{i^{\ast}} = \sum_{x} p_{i}(x)$
4. $\beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )$
5. Output $i$ with probability $\beta_i$.

Ok, so what is going on here? It turns out that $\alpha_i(x)$ is tracking the probability that the procedure halts at $i$ with $x_i = x$ (this is not entirely clear at first). Thus $p_i(x)$ is the probability that we halted before time $i$ and the sample is $x$, and $p_i^{\ast}$ is the probability that we halt at time $i$. Thus $(1 - p_{i-1}^{\ast})$ is the probability that the procedure gets to step $i$. Getting back to $\alpha_i(x)$, we see that $x_i = x$ with probability $Q(x_i)$ and we halt with probability $\beta_i$, so indeed, $\alpha_i(x)$ is the probability that we halt at time $i$ with $x_i = x$.

What we want then is that $P(x) = \sum_{i=1}^{\infty} \alpha_i(x)$. So how should we define $\alpha_i(x)$? It’s the minimum of two terms : in order to stop at time $i$ such that $x_i = x$, the factor $\alpha_i(x)$ is at most $(1 - p_{i-1}^{\ast}) Q(x)$. However, we don’t want to let $p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x)$ exceed the target probability $P(x)$, so $\alpha_i(x)$ must be less than $P(x) - p_{i-1}(x)$. The authors say that in this sense the procedure is greedy, since it tries to get $p_i(x)$ as close to $P(x)$ as it can.

In order to show that the sampling is correct, i.e. that the output distribution is $P$, we want to show what $p_i(x) \to P(x)$. This follows by induction. To get a bound on the description length of the output index, they have to show that $\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1)$.

This is an interesting little fact on its own. It comes out because $i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1$

and then expanding out $\mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i$.

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some $X$ drawn according to $P(x)$ and wants Bob to generate a $Y$ according to the distribution $P(y | x)$. So ideally, $(X,Y)$ have joint distribution $P(x) P(y | x)$. They can both generate a sequence of common $y_1, y_2, \ldots$ according to the marginal distribution of $Y$. Then Alice tries to generate the distribution $P(y | x)$ according to this sampling scheme and sends the index $i^{\ast}$ to Bob, who chooses the corresponding $y_{i^{\ast}}$. How many bits does Alice have to send on average? It’s just $\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ]$.

which, after some love from Jensen’s inequality, turns out to be upper bounded by $I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ]$.

Pretty cute, eh?