I always end up bookmarking a bunch of papers from ArXiV and then looking at them a bit later than I want. So here are a few notes on some papers from the last month. I have a backlog of reading to catch up on, so I’ll probably split this into a couple of posts.

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning
H. Brendan McMahan
This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective
Wee Peng Tay
This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent $0$) and $K$ experts (agents $1, 2, \ldots, K$). The agent is trying to do an $M$-ary hypothesis test. The experts (and the agent) have access to local (private) observations $Y_k[1], Y_k[2], \ldots, Y_k[n_k]$ for $k \in \{0,1,2,\ldots,K\}$. The observations come from a family of distributions determined by the true hypothesis $m$. The agent $0$ needs to pick one of the $K$ experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function $C_k$ which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent $0$ should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Ji Liu, Stephen J. Wright
This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector $\hat{x}$ is a version of $x$ that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.

For those readers of the blog who have not submitted papers to machine learning (or related) conferences, the conference review process is a bit like a mini-version of a journal review. You (as the author) get the reviews back and have to write a response and then the reviewers discuss the paper and (possibly, but in my experience rarely) revise their reviews. However, they generally are supposed to take into account the response in the discussion. In some cases people even adjust their scores; when I’ve been a reviewer I often adjust my scores, especially if the author response addresses my questions.

This morning I had the singular experience of having a paper rejected from ICML 2014 in which all of the reviewers specifically marked that they did not read and consider the response. Based on the initial scores the paper was borderline, so the rejection is not surprising. However, we really did try to address their criticisms in our rebuttal. In particular, some misunderstood what our claims were. Had they bothered to read our response (and proposed edits), perhaps they would have realized this.

Highly selective (computer science) conferences often tout their reviews as being just as good as a journal, but in both outcomes and process, it’s a pretty ludicrous claim. I know this post may sound like sour grapes, but it’s not about the outcome, it’s about the process. Why bother with the facade of inviting authors to rebut if the reviewers are unwilling to read the response?

My friend Ranjit is working on this Crash Course in Psychology. Since I’ve never taken psychology, I am learning a lot!

Apparently the solution for lax editorial standards is to scrub away the evidence. (via Kevin Chen).

Some thoughts on high performance computing vs. Map Reduce. I think about this a fair bit, since some of my colleagues work on HPC, which feels like a different beast than a lot of the problems I’ve been thinking about.

A nice behind-the-scenes on Co-Op Sauce, a staple at Chicagoland farmers’ markets.

Due to weather issues, I was unable to make it on time to ITA to give my talk, which is based on an ArXiV preprint with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. The full work will be presented at ICML 2014 this summer. I decided to give the talk anyway and upload it to YouTube (warning: single take, much stammering):

I have a Mac, so I used the screencast recording features of QuickTime Player, as recommended to me by Manu Sridharan. Worked like a charm.

I plan to post a bit more about this problem later (I know, promises, promises), but in the meantime, this talk is mostly background about the MAP perturbation framework.

I started working this fall on an interesting problem (shameless plug!) with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. What we do there is basically a measure concentration result, but I wanted to write a bit about the motivation for the paper. It’s nicely on the edge of that systems EE / CS divide, so I thought it might be a nice topic for the blog. One name for this idea is “MAP perturbations” so the first thing to do is explain what that means. The basic idea is to take a posterior probability distribution (derived from observed data) and do a random perturbation of the probabilities, and then take the maximum of that perturbed distribution. Sometimes this is called “perturb-and-MAP” but as someone put it, that sounds a bit like “hit-and-run.”

The basic problem is to sample from a particular joint distribution on $n$ variables. For simplicity, let’s consider an $n$-bit vector $\mathbf{x} \in \{0,1\}^n$. There are $2^n$ possible values, so explicitly maximizing the distribution could be computationally tricky. However we are often aided by the fact probability model has some structure. For example, the $n$ bits may be identified with labels {foreground,background} and correspond to $n$ pixels in an image, and there may be some local constraints which make it more likely for adjacent pixels to have the same label. In general, these local constraints get lumped together into a potential function $\theta(\mathbf{x})$ which assigns a score to each $\mathbf{x}$. The distribution on $\mathbf{x}$ is a Gibbs distribution:

$p(\mathbf{x}) = \frac{1}{Z} \exp(\theta( \mathbf{x} ))$

where the normalizing constant $Z$ is the infamous partition function:

$Z = \sum_{\mathbf{x}} \exp(\theta(\mathbf{x}))$

It’s infamous because it’s often hard to compute explicitly. This also makes sampling from the Gibbs distribution hard.

The MAP rule chooses the $\mathbf{x}$ that maximizes this distribution. Doing this means you don’t need to calculate $Z$ since you can maximize the potential function instead:

$\mathbf{X}_{\mathsf{MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) \right\}$.

This isn’t any easier in general, computationally, but people have put lots of blood, sweat, and tears into creating MAP solvers that use tricks to do this maximization. In some cases, these solvers work pretty well. Our goal will be to use the good solvers as a black box.

Unfortunately, in a lot of applications, we really would like to sample from he Gibbs distribution, because the number-one best configuration $\mathbf{x}$ may not be the only “good” one. In fact, there may be many almost-maxima, and sampling will let you produce a list of those. One way to do this is via Markov-chain Monte Carlo (MCMC), but the problem with all MCMC methods is you have to know how long to run the chain.

The MAP perturbation approach is different — it adds a random function $\gamma(\mathbf{x})$ to the potential function and solves the resulting MAP problem:

$\mathbf{X}_{\mathsf{R-MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\}$

The random function $\gamma(\cdot)$ associates a random variable to each $\mathbf{x}$. The simplest approach to designing a perturbation function is to associate an independent and identically distributed (i.i.d.) random variable $\gamma(\mathbf{x})$ for each $\mathbf{x}$. We can find the distribution of the randomized MAP predictor when each $\gamma(\mathbf{x})$ a Gumbel random variable with zero mean, variance $\pi^2/6$, and cumulative distribution function

$G( y ) = \exp( - \exp( - (y + c)))$,

where $c \approx 0.5772$ is the Euler-Mascheroni constant.

So what’s the distribution of the output of the randomized predictor $\mathbf{X}_{\mathsf{R-MAP}}$? It turns out that the distribution is exactly that of the Gibbs distribution we want to sample from:

$\mathbb{P}_{\gamma}\left( \mathbf{X}_{\mathsf{R-MAP}} = \mathbf{x} \right) = p(\mathbf{x})$

and the expected value of the maximal value is the log of the partition function:

$\mathbb{E}_{\gamma}\left[ \max_{\mathbf{x}} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\} \right] = \log Z$

This follows from properties of the Gumbel distribution.

Great – we’re done. We can just generate the $\gamma$ random variables, shove the perturbed potential function into our black-box MAP solver, and voilà: samples from the Gibbs distribution. Except that there’s a little problem here – we have to generate $2^n$ Gumbel variables, one for each $\mathbf{x}$, so we’re still a bit sunk.

The trick is to come up with lower-complexity perturbation (something that Tamir and Tommi have been working on for a bit, among others), but I will leave that for another post…

I recently saw that Andrew Gelman hasn’t really heard of compressed sensing. As someone in the signal processing/machine learning/information theory crowd, it’s a little flabbergasting, but I think it highlights two things that aren’t really appreciated by the systems EE/algorithms crowd: 1) statistics is a pretty big field, and 2) the gulf between much statistical practice and what is being done in SP/ML research is pretty wide.

The other aspect of this is a comment from one of his readers:

Meh. They proved L1 approximates L0 when design matrix is basically full rank. Now all sparsity stuff is sometimes called ‘compressed sensing’. Most of it seems to be linear interpolation, rebranded.

I find such dismissals disheartening — there is a temptation to say that every time another community picks up some models/tools from your community that they are reinventing the wheel. As a short-hand, it can be useful to say “oh yeah, this compressed sensing stuff is like the old sparsity stuff.” However, as a dismissal it’s just being parochial — you have to actually engage with the use of those models/tools. Gelman says it can lead to “better understanding one’s assumptions and goals,” but I think it’s more important to “understand what others’ goals.”

I could characterize rate-distortion theory as just calculating some large deviations rate functions. Dembo and Zeitouni list RD as an application of the LDP, but I don’t think they mean “meh, it’s rebranded LDP.” For compressed sensing, the goal is to do the inference in a computationally and statistically efficient way. One key ingredient is optimization. If you just dismiss all of compressed sensing as “rebranded sparsity” you’re missing the point entirely.

When making an editing pass over a bibliography today, I noticed that the citation for the UC Irvine Machine Learning Repository has changed. It used to be
 @misc{Bache+Lichman:2013 , author = "A. Asuncion and D.H. Newman", year = "2007", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 
But now it’s this:
 @misc{Bache+Lichman:2013 , author = "K. Bache and M. Lichman", year = "2013", title = "{UCI} Machine Learning Repository", url = "http://archive.ics.uci.edu/ml", institution = "University of California, Irvine, School of Information and Computer Sciences" } 
Also, the KDD repository has been merged in with the main repository, so the above is now the citation for both.

Update your BibTeX accordingly! (You too Kunal, but I bet you don’t cite this repo that much).

Here are my much-belated post-ISIT notes. I didn’t do as good a job of taking notes this year, so my points may be a bit cursory. Also, the offer for guest posts is still open! On a related note the slides from the plenary lectures are now available on Dropbox, and are also linked to from the ISIT website.

From compression to compressed sensing
Shirin Jalali (New York University, USA); Arian Maleki (Rice University, USA)
The title says it, mostly. Both data compression and compressed sensing use special structure in the signal to achieve a reduction in storage, but while all signals can be compressed (in a sense), not all signals can be compressively sensed. Can one get a characterization (with an algorithm) that can take a lossy source code/compression method, and use it to recover a signal via compressed sensing? They propose an algorithm called compressible signal pursuit to do that. The full version of the paper is on ArXiV.

Dynamic Joint Source-Channel Coding with Feedback
Tara Javidi (UCSD, USA); Andrea Goldsmith (Stanford University, USA)
This is a JSSC problem with a Markov source, which can be used to model a large range of problems, including some sequential search and learning problems (hence the importance of feedback). The main idea is to map the problem in to a partially-observable Markov decision problem (POMDP) and exploit the structure of the resulting dynamic program. They get some structural properties of the solution (e.g. what are the sufficient statistics), but there are a lot of interesting further questions to investigate. I usually have a hard time seeing the difference between finite and infinite horizon formulations, but here the difference was somehow easier for me to understand — in the infinite horizon case, however, the solution is somewhat difficult to compute.

Unsupervised Learning and Universal Communication
Vinith Misra (Stanford University, USA); Tsachy Weissman (Stanford University, USA)
This paper was about universal decoding, sort of. THe idea is that the decoder doesn’t know the codebook but it knows the encoder is using a random block code. However, it doesn’t know the rate, even. The question is really what can one say in this setting? For example, symmetry dictates that the actual message label will be impossible to determine, so the error criterion has to be adjusted accordingly. The decoding strategy that they propose is a partition of the output space (or “clustering”) followed by a labeling. They claim this is a model for clustering through an information theoretic lens, but since the number of clusters is exponential in the dimension of the space, I think that it’s perhaps more of a special case of clustering. A key concept in their development is something they call the minimum partition information, which takes the place of the maximum mutual information (MMI) used in universal decoding (c.f. Csiszár and Körner).

Farzin Haddadpour (Sharif University of Technology, Iran); Mahdi Jafari Siavoshani (The Chinese University of Hong Kong, Hong Kong); Mayank Bakshi (The Chinese University of Hong Kong, Hong Kong); Sidharth Jaggi (Chinese University of Hong Kong, Hong Kong)
Of course I had to go to this paper, since it was on AVCs. The main result is that if one considers maximal error but allow the encoder only to randomize, then one can achieve the same rates over the Gaussian AVC as one can with average error and no randomization. That is, allowing encoder randomization can move from average error to max error. An analogous result for discrete channels is in a classic paper by Csiszár and Narayan, and this is the Gaussian analogue. The proof uses a similar quantization/epsilon-net plus union bound that I used in my first ISIT paper (also on Gaussian AVCs, and finally on ArXiV), but it seems that the amount of encoder randomization needed here is more than the amount of common randomness used in my paper.

Coding with Encoding Uncertainty
Jad Hachem (University of California, Los Angeles, USA); I-Hsiang Wang (EPFL, Switzerland); Christina Fragouli (EPFL, Switzerland); Suhas Diggavi (University of California Los Angeles, USA)
This paper was on graph-based codes where the encoder makes errors, but the channel is ideal and the decoder makes no errors. That is, given a generator matrix $G$ for a code, the encoder wiring could be messed up and bits could be flipped or erased when parities are being computed. The resulting error model can’t just be folded into the channel. Furthermore, a small amount of error in the encoder (in just the right place) could be catastrophic. They focus just on edge erasures in this problem and derive a new distance metric between codewords that helps them characterize the maximum number of erasures that an encoder can tolerate. They also look at a random erasure model.

I saw this paper on ArXiV a while back and figured it would be a fun read, and it was. Post-ISIT blogging may have to wait for another day or two.

Finding a most biased coin with fewest flips
Karthekeyan Chandrasekaran, Richard Karp
arXiv:1202.3639 [cs.DS]

The setup of the problem is that you have $n$ coins with biases $\{p_i : i \in [n]\}$. For some given $p \in [\epsilon,1-\epsilon]$ and $\epsilon \in (0,1/2)$, each coin is “heavy” ($p_i = p + \epsilon$) with probability $\alpha$ and “light” ($p_i = p - \epsilon$) with probability $1 - \alpha$. The goal is to use a sequential flipping strategy to find a heavy coin with probability at least $1 - \delta$.

Any such procedure has three components, really. First, you have to keep track of some statistics for each coin $i$. On the basis of that, you need a rule to pick which coin to flip. Finally, you need a stopping criterion.

The algorithm they propose is a simple likelihood-based scheme. If I have flipped a particular coin $i$ a bunch of times and gotten $h_i$ heads and $t_i$ tails, then the likelihood ratio is
$L_i = \left( \frac{p+\epsilon}{p - \epsilon} \right)^{h_i} \left( \frac{ 1 - p - \epsilon }{ 1 -p + \epsilon} \right)^{t_i}$
So what the algorithm does is keep track of these likelihoods for the coins that it has flipped so far. But what coin to pick? It is greedy and chooses a coin $i$ which has the largest likelihood $L_i$ so far (breaking ties arbitrarily).

Note that up to now the prior probability $\alpha$ of a coin being heavy has not been used at all, nor has the failure probability $\delta$. These appear in the stopping criterion. The algorithm keeps flipping coins until there exists at least one $i$ for which
$L_i \ge \frac{1 - \alpha}{\alpha} \cdot \frac{ 1 - \delta }{\delta}$
It then outputs the coin with the largest likelihood. It’s a pretty quick calculation to see that given $(h_i, t_i)$ heads and tails for a coin $i$,
$\mathbb{P}(\mathrm{coin\ }i\mathrm{\ is\ heavy}) = \frac{\alpha L_i}{ \alpha L_i + (1 - \alpha) }$,
from which the threshold condition follows.

This is a simple-sounding procedure, but to analyze it they make a connection to something called a “multitoken Markov game” which models the corresponding mutli-armed bandit problem. What they show is that for the simpler case given by this problem, the corresponding algorithm is, in fact optimal in the sense that it makes the minimum expected number of flips:
$\frac{16}{\epsilon^2} \left( \frac{1 - \alpha}{\alpha} + \log\left( \frac{(1 -\alpha)(1 - \delta)}{\alpha \delta} \right) \right)$

The interesting thing here is that the prior distribution on the heavy/lightness plays a pretty crucial role here in designing the algorithm. part of the explore-exploit tradeoff in bandit problems is the issue of hedging against uncertainty in the distribution of payoffs — if instead you have a good handle on what to expect in terms of how the payoffs of the arms should vary, you get a much more tractable problem.

I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

• Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
• Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with $n$ vertices:

the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most $4 n^3/27$ (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve $O(n^2)$; in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time $n + o(n)$.

How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

• Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
• On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a $K$-user interference channel with gains $\alpha_{ij}$ between transmitter $i$ and receiver $j$. Then if
$\alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}$
then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
• Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.