NIPS 2012 : the rest of it

Almost a month later, I’m finishing up blogging about NIPS. Merry Christmas and all that (is anyone reading this thing?), and here’s to a productive 2013, research-wise. It’s a bit harder to blog these things because unlike a talk, it’s hard to take notes during a poster presentation.

Overall, I found NIPS to be a bit overwhelming — the single-track format makes it feel somehow more crowded than ISIT, but also it was hard for me to figure out how to strike the right balance of going to talks/posters and spending time talking to people and getting to know what they are working on. Now that I am fairly separated from my collaborators, conferences should be a good time to sit down and work on some problems, but somehow things are always a bit more frantic than I want them to be.

Anyway, from the rest of the conference, here are a few talks/posters that I went to and remembered something about.

T. Dietterich
Challenges for Machine Learning in Computational Sustainability
This was a plenary talk on machine learning problems that arise in natural resources management. There was a lot in this talk, and a lot of different problems ranging from prediction (for bird migrations, etc), imputation of missing data, and classification. These were real-world hands-on problems and one thing I got out of it is how much work you need to put into the making algorithms that work for the dat you have, rather than pulling some off-the-shelf works-great-in-theory method. He gave a version of this talk at TTI but I think the new version is better.

K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf
Learning from Distributions via Support Measure Machines
This was on generalizing SVMs to take distributions as inputs instead of points — instead of getting individual points as training data, you get distributions (perhaps like clusters) and you have to do learning/classification on that kind of data. Part of the trick here is finding the right mathematical framework that remains computationally tractable.

J. Duchi, M. Jordan, M. Wainwright
Privacy Aware Learning
Since I work on privacy, this was of course interesting to me — John told me a bit about the work at Allerton. The model of privacy is different than the “standard” differential privacy model — data is stochastic and the algorithm itself (the learner) is not trusted, so noise has to be added to individual data points. A bird’s eye view of the idea is this : (1) stochastic gradient descent (SGD) is good for learning, and is robust to noise (e.g. noisy gradients), (2) noise is good at protecting privacy, so (3) SGD can be used to guarantee privacy by using noisy gradients. Privacy is measured here in terms of the mutual information between the data point and a noisy gradient using that data point. The result is a slowdown in the convergence rate that is a function of the mutual information bound, and it appears in the same place in the upper and lower bounds.

J. Wiens, J. Guttag, E. Horvitz
Patient Risk Stratification for Hospital-Associated C. Diff as a Time-Series Classification Task
This was a cool paper on predicting which patients would be infected with C. Diff (a common disease people get as a secondary infection from being the hospital). Since we have different data for each patient and lots of missing data, the classification problem is not easy — they try to assess a time-evolving risk of infection and then predict whether or not the patient will test positive for C. Diff.

P. Loh, M. Wainwright
No voodoo here! Learning discrete graphical models via inverse covariance estimation
This paper won a best paper award. The idea is that for Gaussian graphical models the inverse covariance matrix is graph-compatible — zeros correspond to missing edges. However, this is not true/easy to do for discrete graphical models. So instead they build the covariance matrix for all tuples of variables — \{X_1, X_2, X_3, X_4, X_1 X_2, X_1 X_3, \ldots \} (really what they want is a triangulation of the graph) and then show that indeed, the inverse covariance matrix does respect the graph structure in a sense. More carefully, they have to augment the variables with the power set of the maximal cliques in a triangulation of the original graphical model. The title refers to so-called “paranormal” methods which are also used for discrete graphical models.

V. Kanade, Z. Liu, B. Radunovic
Distributed Non-Stochastic Experts
This was a star-network with a centralized learner and a bunch of experts, except that the expert advice arrives at arbitrary times — there’s a tradeoff between how often the experts communicate with the learner and the achievable regret, and they try to quantify this tradeoff.

M. Streeter, B. McMahan
No-Regret Algorithms for Unconstrained Online Convex Optimization
There’s a problem with online convex optimization when the feasible set is unbounded. In particular, we would want to know that the optimal x^{\ast} is bounded so that we could calculate the rate of convergence. They look at methods which can get around this by proposing an algorithm called “reward doubling” which tries to maximize reward instead of minimize regret.

Y. Chen, S. Sanghavi, H. Xu
Clustering Sparse Graphs
Suppose you have a graph and want to partition it into clusters with high intra-cluster edge density and low inter-cluster density. They come up with nuclear-norm plus L_1 objective function to find the clusters. It seems to work pretty well, and they can analyze it in the planted partition / stochastic blockmodel setting.

P. Shenoy, A. Yu
Strategic Impatience in Go/NoGo versus Forced-Choice Decision-Making
This was a talk on cognitive science experimental design. They explain the difference between these two tasks in terms of a cost-asymmetry and use some decision analysis to explain a bias in the Go/NoGo task in terms of Bayes-risk minimization. The upshot is that the different in these two tasks may not represent a difference in cognitive processing, but in the cost structure used by the brain to make decisions. It’s kind of like changing the rules of the game, I suppose.

S. Kpotufe, A. Boularias
Gradient Weights help Nonparametric Regressors
This was a super-cute paper, which basically says that if the regressor is very sensitive in some coordinates and not so much in others, you can use information about the gradient/derivative of the regressor to rebalance things and come up with a much better estimator.

K. Jamieson, R. Nowak, B. Recht
Query Complexity of Derivative-Free Optimization
Sometimes taking derivatives is expensive or hard, but you can approximate them by taking two close points and computing an approximation. This requires the function evaluations to be good. Here they look at how to handle approximate gradients computed with noisy function evaluations and find the convergence rate for those procedures.

NIPS 2012 : day two

I took it a bit easy today at the conference and managed to spend some time talking to collaborators about work, so perhaps I wasn’t as 100% all in to the talks and posters. In general I find that it’s hard to understand for many posters what the motivating problem is — it’s not clear from the poster, and it’s not always clear from the explanation. Here were a few papers which I thought were interesting:

W. Koolen, D. Adamskiy, M. Warmuth
Putting Bayes to sleep
Some signals look sort of jump Markov — the distribution of the data changes over time so that there are segments which have distribution A, then later it switches to B, then perhaps back to A, and so on. A prediction procedure which “mixes past posteriors” works well in this setting but it was not clear why. This paper provides a Bayesian interpretation for the predictor as mixing in a “sleeping experts” setting.

J. Duchi, M. Jordan, M. Wainwright, A. Wibisono
Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
This paper looked at stochastic gradient descent when function evaluations are cheap but gradient evaluations are expensive. The idea is to compute an unbiased approximation to the gradient by evaluating the function at the \theta_t and \theta_t + \mathrm{noise} and then do the discrete approximate to the gradient. Some of the attendees claimed this is similar to an approach proposed by Nesterov, but the distinction was unclear to me.

J. Lloyd, D. Roy, P. Orbanz, Z. Ghahramani
Random function priors for exchangeable graphs and arrays
This paper looked at Bayesian modeling for structures like undirected graphs which may represent interactions, like protein-protein interactions. Infinite random graphs whose distributions are invariant under permutations of the vertex set can be associated to a structure called a graphon. Here they put a prior on graphons, namely a Gaussian process prior, and then try to do inference on real graphs to estimate the kernel function of the process, for example.

N. Le Roux, M. Schmidt, F. Bach
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
This was a paper marked for oral presentation — the idea is that in gradient descent it is expensive to evaluate gradients if your objective function looks like \sum_{i=1}^{n} f(\theta, x_i), where x_i are your data points and n is huge. This is because you have to evaluate n gradients. On the other hand, stochastic gradient descent can be slow because it picks a single i and does a gradient step at each iteration on f(\theta_t, x_i). Here what they do at step t is pick a random point j, evaluate its gradient, but then take a gradient step on all n points. For points i \ne j they just use the gradient from the last time i was picked. Let T_i(t) be the last time i was picked before time t, and T_j(t) = t. Then they take a gradient step like \sum_{i = 1}^{n} f(\theta_{T_i(t)}, x_i). This works surprisingly well.

Stephane Mallat
Classification with Deep Invariant Scattering Networks
This was an invited talk — Mallat was trying to explain why deep networks seem to do learning well (it all seems a bit like black magic), but his explanation felt a bit heuristic to me in the end. The first main point he had is that wavelets are good at capturing geometric structure like translation and rotation, and appear to have favorable properties with respect to “distortions” in the signal. The notion of distortion is a little vague, but the idea is that if two signals (say images) are similar but one is slightly distorted, they should map to representations which are close to each other. The mathematics behind his analysis framework was group theoretic — he wants to estimate the group of actions which manipulate images. In a sense, this is a control-theory view of the problem (at least it seemed to me). The second point that I understood was that sparsity in representation has a big role to play in building efficient and layered representations. I think I’d have to see the talk again to understand it better, but in the end I wasn’t sure that I understood why deep networks are good, but I did understand some more interesting things about wavelet representations, which is cool.

Postdoc at Cornell in smart grid, learning, optimization and control

Applicants are sought for postdoctoral scholar position(s) at Cornell University in the areas of smart grid, learning, optimization, and control. Topics include, but are not limited to,

  1. the economics and operation of power systems with significant penetration of intermittent renewables.
  2. stochastic optimization, learning, game theory, mechanism design, and their applications.
  3. inference and control involving heavy tail distributions.

Successful candidates will participate in research activities led by Professors Lang Tong and/or Eilyan Bitar.

To apply, please send your CV, two recent papers, and 2-3 names references to Lang Tong (ltong@ece.cornell.edu) and Eilyan Bitar (eyb5@cornell.edu).