# Banff blog

I figured I would blog about this week’s workshop at Banff in a more timely fashion. Due to the scheduling of flights out of Calgary, I will have to miss the last day of talks. The topics of people’s presentations varied rather widely, and many were not about the sort of Good-Turing estimator setup. Sometimes it was a bit hard to see how to see how the problems or approaches were related (not that they had to be directly), but given that the crowd had widely varying backgrounds, presenters had a hard time because the audience had to check in a new set of notation or approach for every talk. The advantage is that there were lots of questions — the disadvantage is that people insisted on “finishing” their presentations. By mid-week my brain was over-full, and a Wednesday afternoon hike up Sulphur Mountain was the perfect solution.

The view from Sulpur Mountain

My notes from the workshop are not chronological — due to travel schedules (including conflicts with FOCS), the order of topics did not necessarily take a particular “flow” from topic A to topic B, although talks with a common methodological approach were grouped together. The last day had talks on sparsity and goodness-of-fit testing (by Sanghavi, Saligrama, and Huang), but unfortunately I could not make those.

Distribution estimation

The presentations on Tuesday were on what might be called “foundational” aspects of the problem : given a sequences of random samples $X_1, X_2, \ldots$ taking values in a set (alphabet) $\mathcal{X}$ whose size is unknown, what can we infer about the distribution that gave rise to those samples? The interesting case is when we have $n$ samples, but the alphabet is much larger (but still of unknown size), so our sample cannot even cover all of the alternatives. Alon Orlitsky laid out an information theoretic take on the problem : suppose that the samples are generated i.i.d. according to some distribution $atex P$. The best we can hope to do is recover the probability multiset of $P$, that is the set of values $\{P(x) : x \in \mathcal{X} \}$. The empirical distribution of the samples is a maximum-likelihood estimator but is often looks quite bad when $|\mathcal{X}|$ is large. Since we don’t care about the $x$ values themselves, just their probabilities, we can think instead about the pattern of a sequence, where we replace the first unique element by 1, the second by 2, and so on, so AnandSarwate becomes 121234156178. This is called the pattern of the sample, and we can instead find the $\hat{P}$ which maximizes the likelihood of seeing a particular pattern. For some short patterns this can be done, but it’s quite tricky in general. The corresponding multiset estimates look much closer to the true distribution, especially when the number of samples is smaller than the support of the distribution.

Prasad Santhanam talked about a prediction and compression view of the problem, which was the original motivation for the Good-Turing estimator : given the sequence of samples, how should we update our belief about what the next sample will be? En-Hui Yang talked more specifically about compression, and proposed an approach to learning structure in the source via context-free grammars. Krisha Viswanathan talked a bit more about the so-called pattern maximum likelihood estimator and what we know about it — for example, we think it’s unique but it may not be. It’s related to maximizing the permanent of a matrix parameterized by the probability distribution, which is known to be #P-complete in general, but not necessarily for this specific permanent. His talked segued nicely into Pascal Vontobel‘s talk about approximating these permanents using the “Bethe approximation.” I really ought to learn more physics.

A very different perspective was given that morning by Peter Orbanz, Sonia Petrone, and Jaeyong Lee, who discussed Bayesian nonparametric approaches to the the modeling problem. Peter and Sonia’s talks were on fundamentals — in particular, deriving the Poisson-Dirichlet and Pitman-Yor priors, the mathematical difficulties in doing Bayesian updating, and how these priors are related to different predictive distributions. Jaeyong Lee talked about a species sampling model in which species are drawn from a distribution with a discrete and continuous component. He showed that such models are only consistent in certain special cases. Species sampling was one of the original motivations for these problems (see my previous post).

Entropy estimation and property testing

Several of the speakers talked about estimation of properties of distributions, such as the entropy or support. Mesrob Ohanessian talked about consistent estimation in the “rare events regime” where we have a sequence of distributions $\{P_n\}$ on alphabets $\{ \mathcal{X}_n \}$ where $P_n(x) = \Theta(1/n)$. He argued that if $P_n \to P$ weakly and we have an estimator $\tilde{P}_n \to P$ weakly then we can estimate many properties consistently. Aaron Wagner presented some older work (which I blogged about before) on problems of hypothesis testing in this regime — given two processes $\{P_n\}$ and $\{Q_n\}$, can you tell if a third process is equal to one or the other?

Paul and Greg Valiant presented their recent work on estimating symmetric properties of distributions. If you have a bound $m$ on the support tide, they show more or less tight sample complexity bounds of $\theta(m / \log m)$ samples to estimate these properties. For the upper bound they show, using Stein’s method, the convergence of functionals of the probability multiset. This convergence is with respect to the Wasserstein (or earthmover) metric. The key is to do a random-sized resampling that renders some of the statistics independent (more details later if I get around to blogging it). The lower bound comes from explicitly constructing some distributions which are indistinguishable from samples but have very different entropies. Paul’s portion of the talk was on linear estimators and showed that they can be surprisingly good. The final take on estimation was given by Al Hero, who gave a brief second talk about entropy estimation for continuous sources and then asked how we can reconcile estimation in the continuous domain with the discrete domain approaches.

The different perspective, known as property testing, was given by Tugkan Batu and Ronitt Rubenfeld. There you have an unknown distribution $P$ from which you draw samples, but you just want to ask if $P$ has a given property. They show that you can get away with fewer samples in this setting. Tugkan focused on what kind of properties you can test when there is one source, and Ronitt talked about testing sets of distributions. In that case you can have a query model in which you say “give me a sample from distribution $P_i$,” or a model in which a source is chosen at random and the sample given to you. At the end of his talk, Aaron showed a sketch of an argument reducing his problem to a combination of property testing and estimation.

Markov stuff

There were two talks on what might be best described as stochastic processes. Serdar Yüksel talked about sequences of estimators converging from the perspective of control, and particular looked at estimators which looked like quantizers. It was a little over my head. Ramon Van Handel gave a talk on model selection for hidden Markov models in which the hidden states have an unknown number of possible values and you want to estimate them. This is quite hard and there are various penalized ML estimators (BIC penalty or MDL), but he claimed there are no correct proofs of the consistency of these estimators for large alphabets. This problem seems too hard, so he described two other problems in which the some of the same fundamental issues appear : the observations are not independent, the geometry of the likelihoods is weird, and there is no upper bound on the number of states.

Learning

There were a few talks on learning algorithms. Dean Foster talked about modeling text using HMMs but also trying to cluster words semantically using multi-view learning and canonical correlation analysis (CCA). Alexandr Andoni talked about a budgeted learning game which he called precision sampling. In that model, you want to estimate the sum of some numbers $x_1, \ldots, x_n$, but you will get noisy answers $\tilde{x}_1, \ldots, \tilde{x}_n$. You can specify precisions $u_i$ such that $|x_i - \tilde{x}_i| \le u_i$ at a cost $1/u_i$. Suppose you pick the precisions and independently an adversary picks the $\{x_i\}$. You reveal your precisions and the adversary produces $\{\tilde{x}_i\}$. How well can you estimate the sum as a function of the budget? David Woodruff talked about the perceptron algorithm (and other problems in Euclidean space), and how to lower the complexity by essentially sampling points to do the update steps. Given my recent interests in stochastic approximation, that talk was quite interesting. Plus, David was one of the first people I met when I started at MIT, so we came full circle.

Networks and graphs

Finally, several of the speakers talked about problems on graphs and networks whose techniques or ideas could possibly prove useful for estimation. Sidharth Jaggi talked about network tomography. Al Hero talked about inferring correlation in graphical models via better thresholding. Michael Mahoney talked about how certain approximate optimization algorithms like PageRank can really be viewed as regularized optimizations with different regularization functions.

Finally, in one of my favorite presentations, Balazs Szegedy talked about inferring structures in very large graphs from sampling, using the Szemeredi Regularity Lemma and other tools. He covered arithmetic progressions and graphons and all sorts of cool stuff. Things got a bit technical for a while, and when he saw that some people were getting lost he encouraged us to “enjoy the music of it.” I’m totally going to use that line.

This site uses Akismet to reduce spam. Learn how your comment data is processed.