Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let P and Q be two distributions on a finite set \mathcal{X} such that the KL-divergence D(P \| Q)  > 0. Then there exists a sampling procedure which, when given a sequence of samples x_1, x_2, \ldots drawn i.i.d. according to Q, will output (with probability 1) an index i^{\ast} such that the sample x_{i^{\ast}} is distributed according to the distribution P and the expected encoding length of the index i^{\ast} is at most

D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set p_0(x) = 0 for all x \in \mathcal{X} and set p_0^{\ast} = 0 as well. Then for each i = 1, 2, \ldots do the following:

  1. \alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}
  2. p_i(x) = p_{i-1}(x) + \alpha_i(x)
  3. p_{i^{\ast}} = \sum_{x} p_{i}(x)
  4. \beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )
  5. Output i with probability \beta_i.

Ok, so what is going on here? It turns out that \alpha_i(x) is tracking the probability that the procedure halts at i with x_i = x (this is not entirely clear at first). Thus p_i(x) is the probability that we halted before time i and the sample is x, and p_i^{\ast} is the probability that we halt at time i. Thus (1 - p_{i-1}^{\ast}) is the probability that the procedure gets to step i. Getting back to \alpha_i(x), we see that x_i = x with probability Q(x_i) and we halt with probability \beta_i, so indeed, \alpha_i(x) is the probability that we halt at time i with x_i = x.

What we want then is that P(x) = \sum_{i=1}^{\infty} \alpha_i(x). So how should we define \alpha_i(x)? It’s the minimum of two terms : in order to stop at time i such that x_i = x, the factor \alpha_i(x) is at most (1 - p_{i-1}^{\ast}) Q(x). However, we don’t want to let p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x) exceed the target probability P(x), so \alpha_i(x) must be less than P(x) - p_{i-1}(x). The authors say that in this sense the procedure is greedy, since it tries to get p_i(x) as close to P(x) as it can.

In order to show that the sampling is correct, i.e. that the output distribution is P, we want to show what p_i(x) \to P(x). This follows by induction. To get a bound on the description length of the output index, they have to show that

\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1).

This is an interesting little fact on its own. It comes out because

i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1

and then expanding out \mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i.

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some X drawn according to P(x) and wants Bob to generate a Y according to the distribution P(y | x). So ideally, (X,Y) have joint distribution P(x) P(y | x). They can both generate a sequence of common y_1, y_2, \ldots according to the marginal distribution of Y. Then Alice tries to generate the distribution P(y | x) according to this sampling scheme and sends the index i^{\ast} to Bob, who chooses the corresponding y_{i^{\ast}}. How many bits does Alice have to send on average? It’s just

\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ].

which, after some love from Jensen’s inequality, turns out to be upper bounded by

I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ].

Pretty cute, eh?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.