Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let $P$ and $Q$ be two distributions on a finite set $\mathcal{X}$ such that the KL-divergence $D(P \| Q) > 0$ . Then there exists a sampling procedure which, when given a sequence of samples $x_1, x_2, \ldots$ drawn i.i.d. according to $Q$ , will output (with probability 1) an index $i^{\ast}$ such that the sample $x_{i^{\ast}}$ is distributed according to the distribution $P$ and the expected encoding length of the index $i^{\ast}$ is at most

$D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)$

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set $p_0(x) = 0$ for all $x \in \mathcal{X}$ and set $p_0^{\ast} = 0$ as well. Then for each $i = 1, 2, \ldots$ do the following:

$\alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}$
$p_i(x) = p_{i-1}(x) + \alpha_i(x)$
$p_{i^{\ast}} = \sum_{x} p_{i}(x)$
$\beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )$
Output $i$ with probability $\beta_i$ .

Ok, so what is going on here? It turns out that $\alpha_i(x)$ is tracking the probability that the procedure halts at $i$ with $x_i = x$ (this is not entirely clear at first). Thus $p_i(x)$ is the probability that we halted before time $i$ and the sample is $x$ , and $p_i^{\ast}$ is the probability that we halt at time $i$ . Thus $(1 - p_{i-1}^{\ast})$ is the probability that the procedure gets to step $i$ . Getting back to $\alpha_i(x)$ , we see that $x_i = x$ with probability $Q(x_i)$ and we halt with probability $\beta_i$ , so indeed, $\alpha_i(x)$ is the probability that we halt at time $i$ with $x_i = x$ .

What we want then is that $P(x) = \sum_{i=1}^{\infty} \alpha_i(x)$ . So how should we define $\alpha_i(x)$ ? It’s the minimum of two terms : in order to stop at time $i$ such that $x_i = x$ , the factor $\alpha_i(x)$ is at most $(1 - p_{i-1}^{\ast}) Q(x)$ . However, we don’t want to let $p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x)$ exceed the target probability $P(x)$ , so $\alpha_i(x)$ must be less than $P(x) - p_{i-1}(x)$ . The authors say that in this sense the procedure is greedy, since it tries to get $p_i(x)$ as close to $P(x)$ as it can.

In order to show that the sampling is correct, i.e. that the output distribution is $P$ , we want to show what $p_i(x) \to P(x)$ . This follows by induction. To get a bound on the description length of the output index, they have to show that

$\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1)$ .

This is an interesting little fact on its own. It comes out because

$i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1$

and then expanding out $\mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i$ .

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some $X$ drawn according to $P(x)$ and wants Bob to generate a $Y$ according to the distribution $P(y | x)$ . So ideally, $(X,Y)$ have joint distribution $P(x) P(y | x)$ . They can both generate a sequence of common $y_1, y_2, \ldots$ according to the marginal distribution of $Y$ . Then Alice tries to generate the distribution $P(y | x)$ according to this sampling scheme and sends the index $i^{\ast}$ to Bob, who chooses the corresponding $y_{i^{\ast}}$ . How many bits does Alice have to send on average? It’s just

$\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ]$ .

which, after some love from Jensen’s inequality, turns out to be upper bounded by

$I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ]$ .

Pretty cute, eh?

	Zonghong Liu on A story about Canvas
	anonymousskimmer on “The needs of the many,…
	Chanterelle Recipes… on Broiled shrimp with chanterell…
	kvarsh on ICML 2019 encouraged code subm…
	Pulkit Grover on gender inclusivity in communic…

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

An Ergodic Walk

a process whose average over time converges to the true average

Sampling one distribution out of another

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply