Annina Bracher (ETH Zurich, Switzerland); Amos Lapidoth (ETHZ, Switzerland)
The title pretty much describes it — there are two receivers which are both looking out for a particular message. This is the identification problem, in which the receiver only cares about a particular message (but we don’t know which one) and we have to design a code such that they can detect the message. The number of messages is $2^{2^{nC}}$ where $C$ is the Shannon capacity of the DMC. In the broadcast setting we run into the problem that the errors for the two receivers are entangled. However, their message sets are disjoint. The way out is to look at the average error for each (averaged over the other user’s message). The main result is that the rates only depend on the conditional marginals, and they have a strong converse.

Efficient compression of monotone and m-modal distributions
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
A monotone distribution is a distribution on $\mathbb{N}$ such that the probabilities are non-increasing. The redundancy for this class is infinite, alas, so they restrict the support to size $k$ (where $k$ can be large). They propose a two-step compression scheme in which the first step is to approximate the true distribution with a piecewise constant step distribution, and then use a compression scheme for step distributions.

Writing on a Dirty Paper in the Presence of Jamming
Amitalok J Budkuley (Indian Institute of Technology, Bombay, India); Bikash K Dey (Indian Institute of Technology Bombay, India); Vinod M Prabhakaran (Tata Institute of Fundamental Research, India)
Ahh, jamming. A topic near and dear to my heart. This paper takes a game-theoretic approach to jamming in a DPC setup: “the capacity of the channel in the presence of the jammer is the unique Nash equilibrium utility of the zero sum communication game between the user and the jammer.” This is a mutual information game, and they show that i.i.d. Gaussian jamming and dirty paper coding are a Nash equilibrium. I looked at an AVC version of this problem in my thesis, and the structure is quite a bit different, so this was an interesting different take on the same problem — how can we use the state information to render adversarial interference as harmless as noise?

Stable Grassmann Manifold Embedding via Gaussian Random Matrices
Hailong Shi (Tsinghua University & Department of Electronic Engineering, P.R. China); Hao Zhang (TsinghuaUniversity, P.R. China); Gang Li (Tsinghua University, P.R. China); Xiqin Wang (Tsinghua University, P.R. China)
This was in the session I was chairing. The idea is that you are given a subspace (e.g., a point on the Grassman manifold) and you want to see what happens when you randomly project this into a lower-dimensional subspace using an i.i.d. Gaussian matrix a la Johnson-Lindenstrauss. The JL Lemma says that projections are length-preserving. Are they also volume-preserving? It turns out that they are (no surprise). The main tools are measure concentration results together with a union bound over a covering set.

Is “Shannon capacity of noisy computing” zero?
Pulkit Grover (Carnegie Mellon University, USA)
Yes. I think. Maybe? Pulkit set up a physical model for computation and used a cut-set argument to show that the total energy expenditure is high. I started looking at the paper in the proceedings and realized that it’s significantly different than the talk though, so I’m not sure I really understood the argument. I should read the paper more carefully. You should too, probably.

After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.

• Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
• Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
• Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
• (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking $(\epsilon,\delta)$-differential privacy yield better sample complexity results than $(\epsilon,)$-differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
• Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
• Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on $n$ variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple $(x_i, x_j, y)$ a marginal $\mu_{ij}(x_i, x_j, y)$. If you want to create a rule $f: \mathcal{X} \to \mathcal{Y}$ for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.

I was a somewhat inconsistent note-taker here. Because a lot of the talks I attended were sufficiently out-of-area for me that I didn’t get the context for the work, I often found myself jotting a few “look at this later” pointers to myself rather than actual ideas from the talk.

First, the plenaries: Eric Horvitz, Michael Kearns, and Michael Jordan. Horvitz talked about how we’ve made a lot of progress in machine learning but there’s more work to be done in bringing humans back into the loop. Examples include developing semantics for what features mean, how to visualize the results, adding humans into the loop (e.g. active learning or interactive settings), crowdsourcing, and building tools that are sensitive to human cognitive limitations, like detecting and informing people of “surprising events,” which involves knowing what surprising means. He also announced a new data set, COCO for “common objects in context” (not Cocoa Puffs) which has around 300k-400k images and lots of annotations. The goal was to build al library of objects that a 4-year-old can recognize. Can a computer?

I honestly was a little too zonked/jetlagged to understand Michael Kearns’ talk, which was on challenges in algorithmic trading. He was focused on problems that brokers face, rather than the folks who are holding the risk. Michael Jordan gave a variant on a talk I’ve seen him give in the last few plenary/big talks I’ve seen: computation, statistics, and big data. The three examples he talked about were local differential privacy, bounds for distributed estimation, and the bag of little bootstraps.

As far as the research talks go, here are a few from the first day:

• Robust Principal Component Analysis with Complex Noise(Qian Zhao; Deyu Meng; Zongben Xu; Wangmeng Zuo; Lei Zhang): This paper interpreted the Robust PCA problem (given $Y = L = E$ where $L$ is low-rank and $E$ is sparse, recover $L$) in terms of MAP inference. The solution generally looks like a nuclear-norm plus $L_1$ regularization, which they claim implies a kind of Laplace-like model for the noise. They build a generative model and then change the distributions around to get different noise models.
• Discriminative Features via Generalized Eigenvectors (Nikos Karampatziakis; Paul Mineiro): This was on how to learn features that are discriminative in a multiclass setting while still being somewhat efficient. The main idea was to look at correlations in the existing features via the tensor $x \otimes x \otimes y$ where $x$ are the features and $y$ are the labels, and to then find generalized eigenvalues and eigenvectors by looking for vectors $v$ that maximize (for a given $(i,j)$ the ratio $\frac{ \mathbb{E}[ (v^{\top} x)^2 | y = i] }{ \mathbb{E}[ (v^{\top} x)^2 | y = j] }$. This nonlinearity is important for reasons which I wasn’t entirely sure about.
• Randomized Nonlinear Component Analysis (David Lopez-Paz; Suvrit Sra; Alex Smola; Zoubin Ghahramani; Bernhard Schoelkopf): I really enjoyed this talk — basically the idea is kernel versions of PCA and CCA have annoyingly large running times. So what they do here is linearize the kernel using sampling and then do some linear component analysis on the resulting features. The key tool is to use Matrix Bernstein inequalities to bound the kernel approximations.
• Memory and Computation Efficient PCA via Very Sparse Random Projections (Farhad Pourkamali Anaraki; Shannon Hughes): This talk was on efficient approximations to PCA for large data sets, but not in a streaming setting. The idea was, as I recall, that you have big data sets and different sites. Each site takes a very sparse random projection of its data (e.g. via a random signed Bernoulli matrix) and then these get aggregated via an estimator. They show that the estimator is unbiased and the variance depends on the kurtosis of the distribution of elements in the projection matrix. One thing that was interesting to me is that the covariance estimate has bias term towards the canonical basis, which is one of those facts that makes sense after you hear it.
• Concept Drift Detection Through Resampling (Maayan Harel; Shie Mannor; Ran El-Yaniv; Koby Crammer): This talk was sort of about change-detection, but not really. The idea is that a learning algorithm sees examples sequentially and wants to tell if there is a significant change in the expected risk of the distribution. The method they propose is a sequential permutation test — the challenge is that a gradual change in risk might be hard to detect, and the number of possible hypotheses to consider grows rather rapidly. I got some more clarification from Harel’s explanation at the poster, but I think this is one where reading the paper will make it clearer.

Noted without notes, but I enjoyed the posters (sometimes I read them since the presenter was not around):

• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar)
• Clustering in the Presence of Background Noise (Shai Ben-David; Nika Haghtalab)
• Demystifying Information-Theoretic Clustering (Greg Ver Steeg; Aram Galstyan; Fei Sha; Simon DeDeo)
• Consistency of Causal Inference under the Additive Noise Model (Samory Kpotufe; Eleni Sgouritsa; Dominik Janzing; Bernhard Schoelkopf)
• Concentration in unbounded metric spaces and algorithmic stability (Aryeh Kontorovich)
• Hard-Margin Active Linear Regression (Zohar Karnin; Elad Hazan)
• Heavy-tailed regression with a generalized median-of-means (Daniel Hsu; Sivan Sabato)

My friend Cynthia her friends have a tumblr on inclusivity in STEM. See also the quarterly Model View Culture, which I think I had seen an article from but didn’t realize it was a whole journal. Thanks to Lily Irani for the link.

This list of streamable Errol Morris movies is dangerous.

How Chicago’s neighborhoods got their names. It does not explain Mr. Wicker’s crazy hat though.

Alex Smola gave a talk at DIMACS recently where he talked about the alias method for generating biased random variables. I think he even snagged the figures from that website as well…

I always end up bookmarking a bunch of papers from ArXiV and then looking at them a bit later than I want. So here are a few notes on some papers from the last month. I have a backlog of reading to catch up on, so I’ll probably split this into a couple of posts.

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning
H. Brendan McMahan
This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective
Wee Peng Tay
This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent $0$) and $K$ experts (agents $1, 2, \ldots, K$). The agent is trying to do an $M$-ary hypothesis test. The experts (and the agent) have access to local (private) observations $Y_k[1], Y_k[2], \ldots, Y_k[n_k]$ for $k \in \{0,1,2,\ldots,K\}$. The observations come from a family of distributions determined by the true hypothesis $m$. The agent $0$ needs to pick one of the $K$ experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function $C_k$ which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent $0$ should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Ji Liu, Stephen J. Wright
This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector $\hat{x}$ is a version of $x$ that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.

For those readers of the blog who have not submitted papers to machine learning (or related) conferences, the conference review process is a bit like a mini-version of a journal review. You (as the author) get the reviews back and have to write a response and then the reviewers discuss the paper and (possibly, but in my experience rarely) revise their reviews. However, they generally are supposed to take into account the response in the discussion. In some cases people even adjust their scores; when I’ve been a reviewer I often adjust my scores, especially if the author response addresses my questions.

This morning I had the singular experience of having a paper rejected from ICML 2014 in which all of the reviewers specifically marked that they did not read and consider the response. Based on the initial scores the paper was borderline, so the rejection is not surprising. However, we really did try to address their criticisms in our rebuttal. In particular, some misunderstood what our claims were. Had they bothered to read our response (and proposed edits), perhaps they would have realized this.

Highly selective (computer science) conferences often tout their reviews as being just as good as a journal, but in both outcomes and process, it’s a pretty ludicrous claim. I know this post may sound like sour grapes, but it’s not about the outcome, it’s about the process. Why bother with the facade of inviting authors to rebut if the reviewers are unwilling to read the response?

My friend Ranjit is working on this Crash Course in Psychology. Since I’ve never taken psychology, I am learning a lot!

Apparently the solution for lax editorial standards is to scrub away the evidence. (via Kevin Chen).

Some thoughts on high performance computing vs. Map Reduce. I think about this a fair bit, since some of my colleagues work on HPC, which feels like a different beast than a lot of the problems I’ve been thinking about.

A nice behind-the-scenes on Co-Op Sauce, a staple at Chicagoland farmers’ markets.

Due to weather issues, I was unable to make it on time to ITA to give my talk, which is based on an ArXiV preprint with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. The full work will be presented at ICML 2014 this summer. I decided to give the talk anyway and upload it to YouTube (warning: single take, much stammering):

I have a Mac, so I used the screencast recording features of QuickTime Player, as recommended to me by Manu Sridharan. Worked like a charm.

I plan to post a bit more about this problem later (I know, promises, promises), but in the meantime, this talk is mostly background about the MAP perturbation framework.

I started working this fall on an interesting problem (shameless plug!) with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. What we do there is basically a measure concentration result, but I wanted to write a bit about the motivation for the paper. It’s nicely on the edge of that systems EE / CS divide, so I thought it might be a nice topic for the blog. One name for this idea is “MAP perturbations” so the first thing to do is explain what that means. The basic idea is to take a posterior probability distribution (derived from observed data) and do a random perturbation of the probabilities, and then take the maximum of that perturbed distribution. Sometimes this is called “perturb-and-MAP” but as someone put it, that sounds a bit like “hit-and-run.”

The basic problem is to sample from a particular joint distribution on $n$ variables. For simplicity, let’s consider an $n$-bit vector $\mathbf{x} \in \{0,1\}^n$. There are $2^n$ possible values, so explicitly maximizing the distribution could be computationally tricky. However we are often aided by the fact probability model has some structure. For example, the $n$ bits may be identified with labels {foreground,background} and correspond to $n$ pixels in an image, and there may be some local constraints which make it more likely for adjacent pixels to have the same label. In general, these local constraints get lumped together into a potential function $\theta(\mathbf{x})$ which assigns a score to each $\mathbf{x}$. The distribution on $\mathbf{x}$ is a Gibbs distribution:

$p(\mathbf{x}) = \frac{1}{Z} \exp(\theta( \mathbf{x} ))$

where the normalizing constant $Z$ is the infamous partition function:

$Z = \sum_{\mathbf{x}} \exp(\theta(\mathbf{x}))$

It’s infamous because it’s often hard to compute explicitly. This also makes sampling from the Gibbs distribution hard.

The MAP rule chooses the $\mathbf{x}$ that maximizes this distribution. Doing this means you don’t need to calculate $Z$ since you can maximize the potential function instead:

$\mathbf{X}_{\mathsf{MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) \right\}$.

This isn’t any easier in general, computationally, but people have put lots of blood, sweat, and tears into creating MAP solvers that use tricks to do this maximization. In some cases, these solvers work pretty well. Our goal will be to use the good solvers as a black box.

Unfortunately, in a lot of applications, we really would like to sample from he Gibbs distribution, because the number-one best configuration $\mathbf{x}$ may not be the only “good” one. In fact, there may be many almost-maxima, and sampling will let you produce a list of those. One way to do this is via Markov-chain Monte Carlo (MCMC), but the problem with all MCMC methods is you have to know how long to run the chain.

The MAP perturbation approach is different — it adds a random function $\gamma(\mathbf{x})$ to the potential function and solves the resulting MAP problem:

$\mathbf{X}_{\mathsf{R-MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\}$

The random function $\gamma(\cdot)$ associates a random variable to each $\mathbf{x}$. The simplest approach to designing a perturbation function is to associate an independent and identically distributed (i.i.d.) random variable $\gamma(\mathbf{x})$ for each $\mathbf{x}$. We can find the distribution of the randomized MAP predictor when each $\gamma(\mathbf{x})$ a Gumbel random variable with zero mean, variance $\pi^2/6$, and cumulative distribution function

$G( y ) = \exp( - \exp( - (y + c)))$,

where $c \approx 0.5772$ is the Euler-Mascheroni constant.

So what’s the distribution of the output of the randomized predictor $\mathbf{X}_{\mathsf{R-MAP}}$? It turns out that the distribution is exactly that of the Gibbs distribution we want to sample from:

$\mathbb{P}_{\gamma}\left( \mathbf{X}_{\mathsf{R-MAP}} = \mathbf{x} \right) = p(\mathbf{x})$

and the expected value of the maximal value is the log of the partition function:

$\mathbb{E}_{\gamma}\left[ \max_{\mathbf{x}} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\} \right] = \log Z$

This follows from properties of the Gumbel distribution.

Great – we’re done. We can just generate the $\gamma$ random variables, shove the perturbed potential function into our black-box MAP solver, and voilà: samples from the Gibbs distribution. Except that there’s a little problem here – we have to generate $2^n$ Gumbel variables, one for each $\mathbf{x}$, so we’re still a bit sunk.

The trick is to come up with lower-complexity perturbation (something that Tamir and Tommi have been working on for a bit, among others), but I will leave that for another post…

I recently saw that Andrew Gelman hasn’t really heard of compressed sensing. As someone in the signal processing/machine learning/information theory crowd, it’s a little flabbergasting, but I think it highlights two things that aren’t really appreciated by the systems EE/algorithms crowd: 1) statistics is a pretty big field, and 2) the gulf between much statistical practice and what is being done in SP/ML research is pretty wide.

The other aspect of this is a comment from one of his readers:

Meh. They proved L1 approximates L0 when design matrix is basically full rank. Now all sparsity stuff is sometimes called ‘compressed sensing’. Most of it seems to be linear interpolation, rebranded.

I find such dismissals disheartening — there is a temptation to say that every time another community picks up some models/tools from your community that they are reinventing the wheel. As a short-hand, it can be useful to say “oh yeah, this compressed sensing stuff is like the old sparsity stuff.” However, as a dismissal it’s just being parochial — you have to actually engage with the use of those models/tools. Gelman says it can lead to “better understanding one’s assumptions and goals,” but I think it’s more important to “understand what others’ goals.”

I could characterize rate-distortion theory as just calculating some large deviations rate functions. Dembo and Zeitouni list RD as an application of the LDP, but I don’t think they mean “meh, it’s rebranded LDP.” For compressed sensing, the goal is to do the inference in a computationally and statistically efficient way. One key ingredient is optimization. If you just dismiss all of compressed sensing as “rebranded sparsity” you’re missing the point entirely.