Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let P and Q be two distributions on a finite set \mathcal{X} such that the KL-divergence D(P \| Q)  > 0. Then there exists a sampling procedure which, when given a sequence of samples x_1, x_2, \ldots drawn i.i.d. according to Q, will output (with probability 1) an index i^{\ast} such that the sample x_{i^{\ast}} is distributed according to the distribution P and the expected encoding length of the index i^{\ast} is at most

D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set p_0(x) = 0 for all x \in \mathcal{X} and set p_0^{\ast} = 0 as well. Then for each i = 1, 2, \ldots do the following:

  1. \alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}
  2. p_i(x) = p_{i-1}(x) + \alpha_i(x)
  3. p_{i^{\ast}} = \sum_{x} p_{i}(x)
  4. \beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )
  5. Output i with probability \beta_i.

Ok, so what is going on here? It turns out that \alpha_i(x) is tracking the probability that the procedure halts at i with x_i = x (this is not entirely clear at first). Thus p_i(x) is the probability that we halted before time i and the sample is x, and p_i^{\ast} is the probability that we halt at time i. Thus (1 - p_{i-1}^{\ast}) is the probability that the procedure gets to step i. Getting back to \alpha_i(x), we see that x_i = x with probability Q(x_i) and we halt with probability \beta_i, so indeed, \alpha_i(x) is the probability that we halt at time i with x_i = x.

What we want then is that P(x) = \sum_{i=1}^{\infty} \alpha_i(x). So how should we define \alpha_i(x)? It’s the minimum of two terms : in order to stop at time i such that x_i = x, the factor \alpha_i(x) is at most (1 - p_{i-1}^{\ast}) Q(x). However, we don’t want to let p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x) exceed the target probability P(x), so \alpha_i(x) must be less than P(x) - p_{i-1}(x). The authors say that in this sense the procedure is greedy, since it tries to get p_i(x) as close to P(x) as it can.

In order to show that the sampling is correct, i.e. that the output distribution is P, we want to show what p_i(x) \to P(x). This follows by induction. To get a bound on the description length of the output index, they have to show that

\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1).

This is an interesting little fact on its own. It comes out because

i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1

and then expanding out \mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i.

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some X drawn according to P(x) and wants Bob to generate a Y according to the distribution P(y | x). So ideally, (X,Y) have joint distribution P(x) P(y | x). They can both generate a sequence of common y_1, y_2, \ldots according to the marginal distribution of Y. Then Alice tries to generate the distribution P(y | x) according to this sampling scheme and sends the index i^{\ast} to Bob, who chooses the corresponding y_{i^{\ast}}. How many bits does Alice have to send on average? It’s just

\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ].

which, after some love from Jensen’s inequality, turns out to be upper bounded by

I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ].

Pretty cute, eh?

Linkage

I’m blogging from Chicago, where it is a balmy 42 degrees but sunny. Whither spring, I ask! Actually, I’m not blogging so much as linking to a bunch of stuff.

For San Diegans, the SD Asian Film Festival Spring Showcase is going on. It looks like I’ll miss a lot of it but I might try to catch something at the end of the week.

Less Pretentious & More Accurate Titles For Literary Masterworks — funny but possibly NSFW.

This home-scanning program seems creepy, regardless of the constitutionality issues.

Unfortunate headlines strike again.

I really like scallion pancakes. I’ll have to try this out when I get back to San Diego.

I agree that this video is awesome. Yo-Yo Ma and Lil Buck. I think that dude is made of rubber. And steel.

Tom Waits was induced into the Rock and Roll Hall of Fame. I just hope I get to see him live some day.

Some things to skim or read from ArXiV when I get the chance:
Sequential Analysis in High Dimensional Multiple Testing and Sparse Recovery (Matt Malloy, Robert Nowak)
Differential Privacy: on the trade-off between Utility and Information Leakage (Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Pierpaolo Degano, Catuscia Palamidessi)
Capacity of Byzantine Consensus with Capacity-Limited Point-to-Point Links (Guanfeng Liang, Nitin Vaidya)
Settling the feasibility of interference alignment for the MIMO interference channel: the symmetric square case (Guy Bresler, Dustin Cartwright, David Tse)
Decentralized Online Learning Algorithms for Opportunistic Spectrum Access (Yi Gai, Bhaskar Krishnamachari)
Online and Batch Learning Algorithms for Data with Missing Features (Afshin Rostamizadeh, Alekh Agarwal, Peter Bartlett)
Nonuniform Coverage Control on the Line (Naomi Ehrich Leonard, Alex Olshevsky)
Degree Fluctuations and the Convergence Time of Consensus Algorithms (Alex Olshevsky, John Tsitsiklis)

ITA 2011 : zombie blogging

I came down with the flu at the tail end of ITA, so I proceeded to fail at a bunch of things, like submitting reviews that were due, writing a submission for ISIT, and blogging the ITA Workshop in time. Cosma already blogged about his favorite talks, beating me to the punch.

I basically missed the first day because of technical glitches with our registration system, but once that was all resolved things went a bit more smoothly (at least I hope they did). The poster session seemed well-attended, and we shot videos of all the posters which will be posted at some point. Ofer did a great job arranging the Graduation Day and poster events. The thing about these conferences is that you end up wanting to talk to people you haven’t seen in a while, and it’s good to hammer out research ideas during the daytime, so I only made it to the keynote and Xiao-Li Meng’s tutorial on MCMC. I felt like I followed the tutorial at the beginning, but even Prof. Meng’s engaging speaking style lost me when it came to modeling something about stars (?). But there will be videos posted of the tutorials soon enough as well. I’ll probably make a post about those. For those who were at the entertainment program, of course the video for that was top priority. For the small number of those blog readers who wish to know what I was making:

  • 2 oz. Maker’s Mark bourbon
  • 1 oz. Carpano Antica formula vermouth
  • 1 dash Angostura bitters
  • 1 dash Regan’s No. 6 orange bitters

Shaken with ice, served up with a cherry. I opted for a bourbon Manhattan with a cherry rather than a rye Manhattan with an orange twist (or without garnish) because it was more convenient, and also more 1960s versus craft cocktail.

But on to the talks! I did manage to drag my lazy butt to some of them.

Improved rate-equivocation regions for secure cooperative communication
Ninoslav Marina, Hideki Yagi, H. Vincent Poor
They looked at a model where you have a transmitter and also a “blind” helper who is trying to help communicate over a wiretap channel. They show a better achievable rate-equivocation region by introducing another auxiliary random variable (big surprise!), but this doesn’t affect the best secrecy rate. So if you are willing to tolerate less than full equivocation at the eavesdropper then you’ll get an improvement.

Shannon’s inequality
S. Verdú, Princeton
Sergio talked about an alternative to Fano’s inequality used by Shannon:
P_e \ge \frac{1}{6} \frac{ H(X|Y) }{ \log M + \log \log M - \log H(X|Y) }
It was a nice talk, and the kind of talk I think is great at ITA. It’s not a new result, but ITA is a place where you can give a talk that explains some cool connection or new idea you have.

On the zero-error capacity threshold for deletion channels
Ian A. Kash, Michael Mitzenmacher, Justin Thaler, Jon Ullman
A nice piece of work on connecting zero-error capacity for deletion channels with longest common subsequences. The error model is adversarial. You can make a graph where each vertex is a length-n binary string, and connect two vertices if the two strings have a longest common subsequence of length at least (1 - p)n. If two strings are connected then they can’t be in the same code since an adversary could delete p n bits and create the common subsequence (note : not substring). So you can get a bound on the capacity by getting a bound on the largest independent set in this graph. So then you can use… Turan’s Theorem! Hooray! There are more results of course…

Data-driven decision making in healthcare systems
Mohsen Bayati, Stanford, Mark Braverman, U Toronto, Michael Gillam, Microsoft, Mark Smith, Medstart Health, and Eric Horvitz, Microsoft Research
This was a nice talk on doing feature selection via ideas from sparse signal processing/machine learning. The idea is to find a small set of features to help predict whether a patient is high-risk or low-risk for being readmitted soon after being discharged from the hospital. The idea is that the number of features is huge but the number of data points is small. They do an L1 penalized logistic regression and then derive a threshold based on the cost of doing an intervention (e.g. house-visits for high-risk patients).

Tracking climate models: advances in Climate Informatics
Claire Monteleoni, Columbia CCLS, Gavin A. Schmidt, NASA and Columbia, Shailesh Saroha, and Eva Asplund, Columbia Computer Science
This was an overview of Claire’s work on climate informatics. The basic problem was this : given several models (large-scale simulated systems based on PDEs etc. derived from physics) that predict future temperature, how should you combine them to produce more accurate predictions. She used some tools from her previous works on HMMs to get a system with better prediction accuracy.

On a question of Blackwell concerning hidden Markov chains
Ramon van Handel
The problem is trying to estimate the entropy rate of a process that is a function of a Markov chain (and hence not a Markov chain itself). “Does the conditional distribution of an ergodic hidden Markov chain possess a unique invariant measure?” This was a great talk for the Blackwell session because it started from a question posed by Blackwell and then revisited a few of his other works. Pretty amazing. Oh, and the paper (or one of them).

I think more talks will have to wait for another time (if ever).

Privacy and entropy (needs improvement)

A while ago, Alex Dimakis sent me an EFF article on information theory and privacy, which starts out with an observation of Latanya Sweeney’s that gender, ZIP code, birthdate are uniquely identifying for a large portion of the population (an updated observation was made in 2006).

What’s weird is that the article veers into “how many bits of do you need to uniquely identify someone” based on self-information or surprisal calculations. It paints a bit of a misleading picture about how to answer the question. I’d probably start with taking \log_2(6.625 \times 10^9) and then look at the variables in question.

However, the mere existence of this article raises a point : here is a situation where ideas from information theory and probability/statistics can be made relevant to a larger population. It’s a great opportunity to popularize our field (and demonstrate good ways of thinking about it). Why not do it ourselves?

Rudolf Ahlswede (1938 – 2010)

At the end of last week I learned, much to my sadness, that Rudolf Ahlswede passed away in December. There will be some sort of commemoration at the ITA workshop. His Wikipedia entry has been edited, but I couldn’t find an obituary. It does have a rather dashing photograph of him in earlier years. I think the sideburns suited him.

Rudi Ahlswede was one of the pillars of Information Theory and had many seminal works in that field using tools from combinatorics, probability, and graph theory. I came to know his work through my dissertation work on the arbitrarily varying channel (AVCs); he had written extensively on the AVC starting in the 1960s. Of particular note (to me) was his paper on the “elimination technique,” which is one of the first derandomization arguments I’ve seen in the literature. And of course he was part of the start of Network Coding. I met Ahlswede at the 2006 ISIT and then most recently in September at ITW in Dublin, where I was presenting a paper on AVCs in which the adversary has a noisy look at the transmitted codeword. He presented in the same session a paper on “channels with time structure,” about which I will make a separate post later. I just wanted to note with sadness the passing of such a giant.

Wiener on control versus learning

I’ve seen this quote excerpted in parts before, but not the whole thing:

I repeat, feedback is a method for controlling a system by reinserting into it the results of its past performance. If these results are merely used as numerical data for the criticism of the system and its regulation, we have the simple feedback of the control engineers. If, however, the information which proceeds backward from the performance is able to change the general method and pattern of performance, we have a process which may well be called learning.
– Norbert Wiener, The Human Use of Human Beings

It is a strange distinction Wiener is trying to make here. First, Wiener tries to make “numerical data” a simple special case, and equates control as the manipulation of numerical data. However, he doesn’t contrast numbers with something else (presumably non-numerical) which can “change the general method and pattern.” Taking it from the other direction, he implies that mere control engineering cannot accomplish “learning.” That is, from numerical data and “criticism of the system” we cannot change how the system works. By Wiener’s lights, pretty much all of the work in mathematical control and machine learning would be classified as control.

I am, of course, missing the context in which Wiener was writing. But I’m not sure what I’m missing. For example, at the time a “control engineer” may have been more of a regulator, so in the first case Wiener may be referring to putting a human in the loop. In the book he makes a distinction between data and algorithms (the “taping”) which has been fuzzed up by computer science. If this distinction leads to drawing a line between control and learning, then is there a distinction between control and learning?

Linkage

More content-ful posts to come soon, I swear. I got sidetracked by job applications. ‘Tis the season, you know…

The REAL STORY of Alice and Bob. Classic investigative journalism (h/t Bikash Dey)

Tyler Perry’s For Colored Girls gets panned. I should mention that the play was done at MIT my first year there and I still remember it as one of the most affecting pieces of theater that I saw during my time there (and maybe after as well).

Scott McLemee on the poverty of the Rally to Restore Sanity.

Winners of the SD Asian Film Festival “Interpretations” contest. Like contentless scenes from your acting/directing class, but with film!

SAALT put out a report called From Macacas to Turban Toppers: The Rise in Xenophobic and Racist Rhetoric in American Political Discourse (PDF). Reading it is keeping me up past my bedtime. (via Sepia Mutiny).