ITA 2014 Talk: MAP Perturbations and Measure Concentration

Due to weather issues, I was unable to make it on time to ITA to give my talk, which is based on an ArXiV preprint with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. The full work will be presented at ICML 2014 this summer. I decided to give the talk anyway and upload it to YouTube (warning: single take, much stammering):

I have a Mac, so I used the screencast recording features of QuickTime Player, as recommended to me by Manu Sridharan. Worked like a charm.

I plan to post a bit more about this problem later (I know, promises, promises), but in the meantime, this talk is mostly background about the MAP perturbation framework.


the MAP perturbation framework

I started working this fall on an interesting problem (shameless plug!) with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. What we do there is basically a measure concentration result, but I wanted to write a bit about the motivation for the paper. It’s nicely on the edge of that systems EE / CS divide, so I thought it might be a nice topic for the blog. One name for this idea is “MAP perturbations” so the first thing to do is explain what that means. The basic idea is to take a posterior probability distribution (derived from observed data) and do a random perturbation of the probabilities, and then take the maximum of that perturbed distribution. Sometimes this is called “perturb-and-MAP” but as someone put it, that sounds a bit like “hit-and-run.”

The basic problem is to sample from a particular joint distribution on n variables. For simplicity, let’s consider an n-bit vector \mathbf{x} \in \{0,1\}^n. There are 2^n possible values, so explicitly maximizing the distribution could be computationally tricky. However we are often aided by the fact probability model has some structure. For example, the n bits may be identified with labels {foreground,background} and correspond to n pixels in an image, and there may be some local constraints which make it more likely for adjacent pixels to have the same label. In general, these local constraints get lumped together into a potential function \theta(\mathbf{x}) which assigns a score to each \mathbf{x}. The distribution on \mathbf{x} is a Gibbs distribution:

p(\mathbf{x}) = \frac{1}{Z} \exp(\theta( \mathbf{x} ))

where the normalizing constant Z is the infamous partition function:

Z = \sum_{\mathbf{x}} \exp(\theta(\mathbf{x}))

It’s infamous because it’s often hard to compute explicitly. This also makes sampling from the Gibbs distribution hard.

The MAP rule chooses the \mathbf{x} that maximizes this distribution. Doing this means you don’t need to calculate Z since you can maximize the potential function instead:

\mathbf{X}_{\mathsf{MAP}} =  \mathrm{argmax} \left\{ \theta(\mathbf{x}) \right\}.

This isn’t any easier in general, computationally, but people have put lots of blood, sweat, and tears into creating MAP solvers that use tricks to do this maximization. In some cases, these solvers work pretty well. Our goal will be to use the good solvers as a black box.

Unfortunately, in a lot of applications, we really would like to sample from he Gibbs distribution, because the number-one best configuration \mathbf{x} may not be the only “good” one. In fact, there may be many almost-maxima, and sampling will let you produce a list of those. One way to do this is via Markov-chain Monte Carlo (MCMC), but the problem with all MCMC methods is you have to know how long to run the chain.

The MAP perturbation approach is different — it adds a random function \gamma(\mathbf{x}) to the potential function and solves the resulting MAP problem:

\mathbf{X}_{\mathsf{R-MAP}} =  \mathrm{argmax} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\}

The random function \gamma(\cdot) associates a random variable to each \mathbf{x}. The simplest approach to designing a perturbation function is to associate an independent and identically distributed (i.i.d.) random variable \gamma(\mathbf{x}) for each \mathbf{x}. We can find the distribution of the randomized MAP predictor when each \gamma(\mathbf{x}) a Gumbel random variable with zero mean, variance \pi^2/6, and cumulative distribution function

G( y ) = \exp( - \exp( - (y + c))),

where c \approx 0.5772 is the Euler-Mascheroni constant.

So what’s the distribution of the output of the randomized predictor \mathbf{X}_{\mathsf{R-MAP}}? It turns out that the distribution is exactly that of the Gibbs distribution we want to sample from:

\mathbb{P}_{\gamma}\left( \mathbf{X}_{\mathsf{R-MAP}} = \mathbf{x} \right) = p(\mathbf{x})

and the expected value of the maximal value is the log of the partition function:

\mathbb{E}_{\gamma}\left[ \max_{\mathbf{x}} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\} \right] = \log Z

This follows from properties of the Gumbel distribution.

Great – we’re done. We can just generate the \gamma random variables, shove the perturbed potential function into our black-box MAP solver, and voilà: samples from the Gibbs distribution. Except that there’s a little problem here – we have to generate 2^n Gumbel variables, one for each \mathbf{x}, so we’re still a bit sunk.

The trick is to come up with lower-complexity perturbation (something that Tamir and Tommi have been working on for a bit, among others), but I will leave that for another post…