A few weeks ago I got to go to Bellairs in Holetown, Barbados for a workshop organized by Mike Rabbat and Mark Coates of McGill University. It’s a small workshop, mostly for Mike and Mark’s students, and it’s a chance to interact closely and perhaps start some new research collaborations. Here’s a brief summary of the workshop as I remember it from my notes.
Better late than never, I suppose. A few weeks ago I escaped the cold of New Jersey to my old haunts of San Diego. Although La Jolla was always a bit fancy for my taste, it’s hard to beat a conference which boasts views like this:
I’ll just recap a few of the talks that I remember from my notes — I didn’t really take notes during the plenaries so I don’t have much to say about them. Mostly this was due to laziness, but finding the time to blog has been challenging in this last year, so I think I have to pick my battles. Here’s a smattering consisting of
Emina Soljanin talked about designing codes that are good for fast access to the data in distributed storage. Initial work focused on how to repair codes under disk failures. She looked at how easy it is to retrieve the information afterwords to guarantee some QoS for the storage system. Adam Kalai talked about designing compression schemes that work for an “audience” of decoders. The decoders have different priors on the set of elements/messages so the idea is to design an encoder that works for this ensemble of decoders. I kind of missed the first part of the talk so I wasn’t quite sure how this relates to classical work in mismatched decoding as done in the information theory world. Gireeja Ranade gave a great talk about defining notions of capacity/rate need to control a system which as multiplicative uncertainty. That is, where has the uncertainty. She gave a couple of different notions of capacity, relating to the ratio — either the expected value of the square or the log, appropriately normalized. She used a “deterministic model” to give an explanation of how control in this setting is kind of like controlling the number of significant bits in the state: uncertainty increases this and you need a certain “amount” of control to cancel that growth.
(Learning and statistics)
I learned about active regression approaches from Sivan Sabato that provably work better than passive learning. The idea there is do to use a partition of the X space and then do piecewise constant approximations to a weight function that they use in a rejection sampler. The rejection sampler (which I thought of as sort of doing importance sampling to make sure they cover the space) helps limit the number of labels requested by the algorithm. Somehow I had never met Raj Rao Nadakuditi until now, and I wish I had gotten a chance to talk to him further. He gave a nice talk on robust PCA, and in particular how outliers “break” regular PCA. He proposed a combination of shrinkage and truncation to help make PCA a bit more stable/robust. Laura Balzano talked about “estimating subspace projections from incomplete data.” She proposed an iterative algorithm for doing estimation on the Grassmann manifold that can do subspace tracking. Constantine Caramanis talked about a convex formulation for mixed regression that gives a guaranteed solution, along with minimax sample complexity bounds showing that it is basically optimal. Yingbin Liang talked about testing approaches for understanding if there is an “anomalous structure” in a sequence of data. Basically for a sequence , the null hypothesis is that they are all i.i.d. and the (composite) alternative is that there an interval of indices which are instead. She proposed a RKHS-based discrepancy measure and a threshold test on this measure. Pradeep Ravikumar talked about a “simple” estimator that was a “fix” for ordinary least squares with some soft thresholding. He showed consistency for linear regression in several senses, competitive with LASSO in some settings. Pretty neat, all said, although he also claimed that least squares was “something you all know from high school” — I went to a pretty good high school, and I don’t think we did least squares! Sanmi Koyejo talked about a Bayesian devision theory approach to variable selection that involved minimizing some KL-divergence. Unfortunately, the resulting optimization ended up being NP-hard (for reasons I can’t remember) and so they use a greedy algorithm that seems to work pretty well.
Cynthia Dwork gave a tutorial on differential privacy with an emphasis on the recent work involving false discovery rate. In addition to her plenary there were several talks on differential privacy and other privacy measures. Kunal Talwar talked about their improved analysis of the SuLQ method for differentially private PCA. Unfortunately there were two privacy sessions in parallel so I hopped over to see John Duchi talk about definitions of privacy and how definitions based on testing are equivalent to differential privacy. The testing framework makes it easier to prove minimax bounds, though, so it may be a more useful view at times. Nadia Fawaz talked about privacy for time-series data such as smart meter data. She defined different types of attacks in this setting and showed that they correspond to mutual information or directed mutual information, as well as empirical results on a real data set. Raef Bassily studied a estimation problem in the streaming setting where you want to get a histogram of the most frequent items in the stream. They reduce the problem to one of finding a “unique heavy hitter” and develop a protocol that looks sort of like a code for the MAC: they encode bits into a real vector, had noise, and then add those up over the reals. It’s accepted to STOC 2015 and he said the preprint will be up soon.
Last week I attended the National Academies Keck Futures Initiative (NAKFI) Conference on Collective Behavior, which was really a huge amount of fun. I learned a ton of science (and that I basically know nothing about science — or rather, there is soooo much science to know), and had some very interesting discussion about… stuff. Why am I so cagey? Because the details of discussions at the conference are officially embargoed until the report is issued by the National Academies in spring.
This embargo concept is not entirely new to me, but coming as I do from a tribe that tries to post things on ArXiV as fast as possible, the idea that one should keep mum for a few months feels a bit strange. It makes a lot of sense — people presented posters on work in progress or partial results that they were still working on, and without an embargo there is a potential danger of getting scooped, which could inhibit the free and open sharing of ideas. I certainly felt more comfortable talking about (possibly half-baked) future research ideas, although that was primarily because I didn’t think the ecologist I was conversing with would care as much about stochastic gradient methods.
Embargoes seem to be the norm in Science because of… Science… and Nature… and PNAS. If you have a high-profile article to appear in one of those fancy journals, they want the credit for having chosen it/are the venue in which it appeared. Slapping up your preprint on ArXiV is not on, since it bursts the balloon (although Nature says “[n]either conference presentations nor posting on recognized preprint servers constitute prior publication”). This is newsworthy science, and there’s a relationship between the press, the academic press, and the research community that has been discussed at length.
I came across a blog called Embargo Watch that looks to see how the media/reporters breach the embargoes imposed by the publisher. Indeed, if you look at various embargo policies (even PLoS has one!) show that the embargo thing is really about controlling the news media’s description of the article prior to publication. There’s been a longstanding (un?)healthy debate about the value of embargoes. Personally, I’d prefer to see a someone who studies communication and science studies (like Marisa) do a more critical evaluation of the role of embargoes in enforcing particular constructions and interpretations of the scientific process, the role of power and control, and how researchers propagate and resist the tensions inherent in publishing in high-impact journals.
Regardless, I am following the embargo and keeping quiet while trying to process everything I learned last week. I guess I am glad the ArXiV is there for me — it’s a little more my speed. Actually, it may be a bit too speedy, but it works for now. I think people working in engineering, computer science, and mathematics might find the notion of an embargo somewhat puzzling, as I did. Does this concept even make sense in those fields?
After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.
- Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
- An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
- Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
- Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
- (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking -differential privacy yield better sample complexity results than -differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
- Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
- Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple a marginal . If you want to create a rule for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.
This is my first time at ICML, and every paper here has a talk and a poster. It’s a lot of work to prepare, but one nice benefit is that because my poster had to be done before I left, the talk was also pretty much done at the same time, modulo minor tweaks. Having to be ready early means less last-minute preparations and lower-stress at the conference overall. Another plus is that some talks are probably better as posters and some posters are probably better as talks, so the two modes of presentation gives a diversity to the delivery process. Some people also prefer talks to posters or vice-versa, so that’s good for them as well. Finally, the conference has 6 parallel tracks, so knowing that there’s a poster takes some of the stress out of deciding which session to attend — you can always catch the poster if you missed the talk.
The major minus is time. Sessions run from 8:30 to 6 and then posters run from 7 to 11 PM — it’s overwhelming! You can easily spend the entire conference at talks and then at posters, resulting in a brain overload. This also leaves less time for chatting and catching up with colleagues over dinner, starting up new research ideas or continuing ongoing projects in person, and the informal communication that happens at conferences. People do make time for that, but the format less conducive to it, or so it appeared to me. I ended up taking time off a bit during the sessions to take a walk around the Olympic park and have a chat, and I saw others leaving to do some sightseeing, so perhaps I am adhering to the schedule too much.
It’s interesting how different the modes of conference/social research communication are across research disciplines. I’ve yet to go to ICASSP or ICC, and while I have been to a medical informatics conference once, I haven’t gone to a Big Science conference or the joint meetings for mathematics or statistics. I imagine the whole purpose and format of those is completely different, and it makes me wonder if the particular formats of machine learning conferences are intentional: since there is rarely an extended/journal version of the paper, the conference is the only opportunity for attendees to really buttonhole the author and ask questions about details that are missing from the paper. Perhaps maximizing author exposure is a means to an end.
I was a somewhat inconsistent note-taker here. Because a lot of the talks I attended were sufficiently out-of-area for me that I didn’t get the context for the work, I often found myself jotting a few “look at this later” pointers to myself rather than actual ideas from the talk.
First, the plenaries: Eric Horvitz, Michael Kearns, and Michael Jordan. Horvitz talked about how we’ve made a lot of progress in machine learning but there’s more work to be done in bringing humans back into the loop. Examples include developing semantics for what features mean, how to visualize the results, adding humans into the loop (e.g. active learning or interactive settings), crowdsourcing, and building tools that are sensitive to human cognitive limitations, like detecting and informing people of “surprising events,” which involves knowing what surprising means. He also announced a new data set, COCO for “common objects in context” (not Cocoa Puffs) which has around 300k-400k images and lots of annotations. The goal was to build al library of objects that a 4-year-old can recognize. Can a computer?
I honestly was a little too zonked/jetlagged to understand Michael Kearns’ talk, which was on challenges in algorithmic trading. He was focused on problems that brokers face, rather than the folks who are holding the risk. Michael Jordan gave a variant on a talk I’ve seen him give in the last few plenary/big talks I’ve seen: computation, statistics, and big data. The three examples he talked about were local differential privacy, bounds for distributed estimation, and the bag of little bootstraps.
As far as the research talks go, here are a few from the first day:
- Robust Principal Component Analysis with Complex Noise(Qian Zhao; Deyu Meng; Zongben Xu; Wangmeng Zuo; Lei Zhang): This paper interpreted the Robust PCA problem (given where is low-rank and is sparse, recover ) in terms of MAP inference. The solution generally looks like a nuclear-norm plus regularization, which they claim implies a kind of Laplace-like model for the noise. They build a generative model and then change the distributions around to get different noise models.
- Discriminative Features via Generalized Eigenvectors (Nikos Karampatziakis; Paul Mineiro): This was on how to learn features that are discriminative in a multiclass setting while still being somewhat efficient. The main idea was to look at correlations in the existing features via the tensor where are the features and are the labels, and to then find generalized eigenvalues and eigenvectors by looking for vectors that maximize (for a given the ratio . This nonlinearity is important for reasons which I wasn’t entirely sure about.
- Randomized Nonlinear Component Analysis (David Lopez-Paz; Suvrit Sra; Alex Smola; Zoubin Ghahramani; Bernhard Schoelkopf): I really enjoyed this talk — basically the idea is kernel versions of PCA and CCA have annoyingly large running times. So what they do here is linearize the kernel using sampling and then do some linear component analysis on the resulting features. The key tool is to use Matrix Bernstein inequalities to bound the kernel approximations.
- Memory and Computation Efficient PCA via Very Sparse Random Projections (Farhad Pourkamali Anaraki; Shannon Hughes): This talk was on efficient approximations to PCA for large data sets, but not in a streaming setting. The idea was, as I recall, that you have big data sets and different sites. Each site takes a very sparse random projection of its data (e.g. via a random signed Bernoulli matrix) and then these get aggregated via an estimator. They show that the estimator is unbiased and the variance depends on the kurtosis of the distribution of elements in the projection matrix. One thing that was interesting to me is that the covariance estimate has bias term towards the canonical basis, which is one of those facts that makes sense after you hear it.
- Concept Drift Detection Through Resampling (Maayan Harel; Shie Mannor; Ran El-Yaniv; Koby Crammer): This talk was sort of about change-detection, but not really. The idea is that a learning algorithm sees examples sequentially and wants to tell if there is a significant change in the expected risk of the distribution. The method they propose is a sequential permutation test — the challenge is that a gradual change in risk might be hard to detect, and the number of possible hypotheses to consider grows rather rapidly. I got some more clarification from Harel’s explanation at the poster, but I think this is one where reading the paper will make it clearer.
Noted without notes, but I enjoyed the posters (sometimes I read them since the presenter was not around):
- An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar)
- Clustering in the Presence of Background Noise (Shai Ben-David; Nika Haghtalab)
- Demystifying Information-Theoretic Clustering (Greg Ver Steeg; Aram Galstyan; Fei Sha; Simon DeDeo)
- Consistency of Causal Inference under the Additive Noise Model (Samory Kpotufe; Eleni Sgouritsa; Dominik Janzing; Bernhard Schoelkopf)
- Concentration in unbounded metric spaces and algorithmic stability (Aryeh Kontorovich)
- Hard-Margin Active Linear Regression (Zohar Karnin; Elad Hazan)
- Heavy-tailed regression with a generalized median-of-means (Daniel Hsu; Sivan Sabato)