# NIPS 2012 : day two

I took it a bit easy today at the conference and managed to spend some time talking to collaborators about work, so perhaps I wasn’t as 100% all in to the talks and posters. In general I find that it’s hard to understand for many posters what the motivating problem is — it’s not clear from the poster, and it’s not always clear from the explanation. Here were a few papers which I thought were interesting:

W. Koolen, D. Adamskiy, M. Warmuth
Putting Bayes to sleep
Some signals look sort of jump Markov — the distribution of the data changes over time so that there are segments which have distribution A, then later it switches to B, then perhaps back to A, and so on. A prediction procedure which “mixes past posteriors” works well in this setting but it was not clear why. This paper provides a Bayesian interpretation for the predictor as mixing in a “sleeping experts” setting.

J. Duchi, M. Jordan, M. Wainwright, A. Wibisono
Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
This paper looked at stochastic gradient descent when function evaluations are cheap but gradient evaluations are expensive. The idea is to compute an unbiased approximation to the gradient by evaluating the function at the $\theta_t$ and $\theta_t + \mathrm{noise}$ and then do the discrete approximate to the gradient. Some of the attendees claimed this is similar to an approach proposed by Nesterov, but the distinction was unclear to me.

J. Lloyd, D. Roy, P. Orbanz, Z. Ghahramani
Random function priors for exchangeable graphs and arrays
This paper looked at Bayesian modeling for structures like undirected graphs which may represent interactions, like protein-protein interactions. Infinite random graphs whose distributions are invariant under permutations of the vertex set can be associated to a structure called a graphon. Here they put a prior on graphons, namely a Gaussian process prior, and then try to do inference on real graphs to estimate the kernel function of the process, for example.

N. Le Roux, M. Schmidt, F. Bach
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
This was a paper marked for oral presentation — the idea is that in gradient descent it is expensive to evaluate gradients if your objective function looks like $\sum_{i=1}^{n} f(\theta, x_i)$, where $x_i$ are your data points and $n$ is huge. This is because you have to evaluate $n$ gradients. On the other hand, stochastic gradient descent can be slow because it picks a single $i$ and does a gradient step at each iteration on $f(\theta_t, x_i)$. Here what they do at step $t$ is pick a random point $j$, evaluate its gradient, but then take a gradient step on all $n$ points. For points $i \ne j$ they just use the gradient from the last time $i$ was picked. Let $T_i(t)$ be the last time $i$ was picked before time $t$, and $T_j(t) = t$. Then they take a gradient step like $\sum_{i = 1}^{n} f(\theta_{T_i(t)}, x_i)$. This works surprisingly well.

Stephane Mallat
Classification with Deep Invariant Scattering Networks
This was an invited talk — Mallat was trying to explain why deep networks seem to do learning well (it all seems a bit like black magic), but his explanation felt a bit heuristic to me in the end. The first main point he had is that wavelets are good at capturing geometric structure like translation and rotation, and appear to have favorable properties with respect to “distortions” in the signal. The notion of distortion is a little vague, but the idea is that if two signals (say images) are similar but one is slightly distorted, they should map to representations which are close to each other. The mathematics behind his analysis framework was group theoretic — he wants to estimate the group of actions which manipulate images. In a sense, this is a control-theory view of the problem (at least it seemed to me). The second point that I understood was that sparsity in representation has a big role to play in building efficient and layered representations. I think I’d have to see the talk again to understand it better, but in the end I wasn’t sure that I understood why deep networks are good, but I did understand some more interesting things about wavelet representations, which is cool.

/

# NIPS 2012 : day one

I am attending NIPS this year for the first time, and so I figured it would be good to blog about some of it here. I totally dropped the ball on Allerton, so maybe I’ll make up for it by writing more about the actual talks here. Fortunately, or unfortunately, most of the conference is about things I have almost no experience with, so I am having a bit of an explore/exploit tradeoff in my selection process.

Every day of the conference has a poster session from 7 to midnight — there are 90+ posters in a single room and people go in and out of hanging out with friends and looking at posters. My poster (a paper with Kamalika Chaudhuri and Kaushik Sinha on differentially private approximations to PCA) was last night, so I was on the presenting end of things. I gave up at 10:30 because I was getting hoarse and tired, but even then there were a fair number of people milling about. Since I was (mostly) at my poster I missed out on the other works.

During the day the conference is a single-track affair with invited and highlighted talks. There are two kinds of highlighted talks — some papers are marked for oral prevention (ETA: presentation), and some are marked as “spotlights,” which means that the authors get to make a 5 minute elevator pitch for their poster in front of the whole conference. Those start today, and I’m looking forward to it.

In the meantime, here is a picture from the hike I took yesterday with Erin:

Mountain Range on a hike near Lake Lily.