More like an old paper… this (finally) a journal version of some older work that we did on analyzing Bayesian nonparametric estimators form an information-theoretic (redundancy) perspective.

Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson-Dirichlet priors (or “Chinese restaurant processes”) and the Pitman-Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators.

In the large alphabet setting, one thing we might be interested in is sequential prediction: I observe a sequence of butterfly species and want to predict whether the next butterfly I collect will be new or one that I have seen before. One simple way to do this prediction is to put a prior on the set of all distributions on infinite supports and do inference on that model given the data. This corresponds to the so-called Chinese Restaurant Process (CRP) approach to the problem. The information-theoretic view is that sequential prediction is equivalent to compression: the estimator is assigning a probability $q(x^n)$ to the sequence $x^n$ seen so far. An estimator is good if for any distribution $p$, if $x^n$ is drawn i.i.d. according to $p$, then the divergence between $p(x^n)$ and $q(x^n)$ is “small.” The goal of this work is to understand when CRP estimators are good in this sense.

This sort of falls in with the “frequentist analysis of Bayesian procedures” thing which some people work on.

After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.

• Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
• Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
• Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
• (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking $(\epsilon,\delta)$-differential privacy yield better sample complexity results than $(\epsilon,)$-differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
• Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
• Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on $n$ variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple $(x_i, x_j, y)$ a marginal $\mu_{ij}(x_i, x_j, y)$. If you want to create a rule $f: \mathcal{X} \to \mathcal{Y}$ for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.

Aalto University is a new university with over a century of experience. Created from a high-profile merger between three leading universities in Finland – the Helsinki School of Economics, Helsinki University of Technology and the University of Art and Design Helsinki – Aalto University opens up new possibilities for strong multidisciplinary education and research. The university has 20 000 students and a staff of 5,000 including 350 professors.

The stochastics research group at the Department of Mathematics and Systems Analysis is currently undergoing a period of regeneration, as new associate and assistant professors have been employed to replace previous long-term faculty, and several new young researchers are being recruited with the aim of significant growth. To strengthen this line of development, we are now seeking to hire a postdoctoral researcher with a PhD in mathematics or a related area.

The postdoctoral researcher will carry out research in collaboration with the stochastics research group, with a small amount of teaching duties included. The salary is competitive, based on experience and qualifications, and includes occupational health and a travel budget for international conferences and workshops. The position is for one year with a possible extension for another year, starting preferably in September 2014 and no later than January 2015.

Further information and instructions for applying:

Please feel free to forward this message to colleagues and potential candidates. The application deadline is 13 June 2014.

One of the things I’m always asked when giving a talk on differential privacy is “how should we interpret $\epsilon$?” There a lot of ways of answering this but one way that seems to make more sense to people who actually think about risk, hypothesis testing, and prediction error is through the “area under the curve” metric, or AUC. This post came out of a discussion from a talk I gave recently at Boston University, and I’d like to thank Clem Karl for the more detailed questioning.

Zoltán Szabó of the Gatsby Unit forwarded me a link to his Information Theoretical Estimators Toolbox, which has MATLAB-friendly estimators for standard information-theoretic quantities. It might be of interest to readers of the blog.

My cousin Supriya has started a blog, wading through soup, on green parenting and desi things. Her recent post, Pretty in Pink: Can Boys Wear Pink? made it to HuffPo.

Larry Wasserman is quitting blogging.

If you have a stomach for horrible things, here are some images from the Nauru immigration center, where hundreds of (mostly Iranian) asylum-seekers are kept by the Australian government (via mefi).

At Rutgers, I am going to be in a union. Recent grad student union actions have come under fire from peeved faculty at UChicago (a place with horrendous institutional politics if I have ever seen one). Corey Robin breaks it down.

I recently saw that Andrew Gelman hasn’t really heard of compressed sensing. As someone in the signal processing/machine learning/information theory crowd, it’s a little flabbergasting, but I think it highlights two things that aren’t really appreciated by the systems EE/algorithms crowd: 1) statistics is a pretty big field, and 2) the gulf between much statistical practice and what is being done in SP/ML research is pretty wide.

The other aspect of this is a comment from one of his readers:

Meh. They proved L1 approximates L0 when design matrix is basically full rank. Now all sparsity stuff is sometimes called ‘compressed sensing’. Most of it seems to be linear interpolation, rebranded.

I find such dismissals disheartening — there is a temptation to say that every time another community picks up some models/tools from your community that they are reinventing the wheel. As a short-hand, it can be useful to say “oh yeah, this compressed sensing stuff is like the old sparsity stuff.” However, as a dismissal it’s just being parochial — you have to actually engage with the use of those models/tools. Gelman says it can lead to “better understanding one’s assumptions and goals,” but I think it’s more important to “understand what others’ goals.”

I could characterize rate-distortion theory as just calculating some large deviations rate functions. Dembo and Zeitouni list RD as an application of the LDP, but I don’t think they mean “meh, it’s rebranded LDP.” For compressed sensing, the goal is to do the inference in a computationally and statistically efficient way. One key ingredient is optimization. If you just dismiss all of compressed sensing as “rebranded sparsity” you’re missing the point entirely.

I took a look at this interesting paper by Sriperumbudur et al., On the empirical estimation of integral probability metrics (Electronic Journal of Statistics Vol. 6 (2012) pp.1550–1599). The goal of the paper is to estimate a distance or divergence between two distributions $P$ and $Q$ based on samples from each distribution. This sounds pretty vague at first… what kind of distributions? How many samples? This paper looks at integral probability metrics, which have the form

$\gamma(P,Q) = \sup_{f \in \mathcal{F}} \left| \int_{S} f dP - \int_{S} f dQ \right|$

where $S$ is a measurable space on which $P$ and $Q$ are defined, and $\mathcal{F}$ is a class of real-valued bounded measurable functions on $S$. This class doesn’t contain Csiszár $\phi$-divergences (also known as Ali-Silvey distances), but does contain the total variational distance.

Different choices of the function class give rise to different measures of difference used in so-called two-sample tests, such as the Kolmogorov-Smirnov test. The challenge in practically using these tests is that it’s hard to get bounds on how fast an estimator of $\gamma(P,Q)$ converges if we have to estimate it from samples of $P$ and $Q$. The main result of the paper is to provide estimators with consistency and convergence guarantees. In particular, they estimators are based on either linear programming or (in the case of kernel tests) in closed form.

The second section of the paper connects tests based on IPMs with the risk associated to classification rules for separating $P$ and $Q$ when the separation rule is restricted to come from the function class $\mathcal{F}$ associated to the rule. This is a nice interpretation of these two-sample tests — they are actually doing similar things for restricted classes of classifiers/estimators.

Getting back to KL divergence and non-IPM measures, since total variation gives a lower bound on the KL divergence, they also provide lower bounds on the total variation distance in terms of other IPM metrics. This is important since the total variation distance can’t be estimated itself in a strongly consistent way. This could be useful for algorithms which need to estimate the total variation distance for continuous data. In general, estimating distances between multivariate continuous distributions can become a bit of a mess when you have to use real data — doing a plug-in estimate using, e.g. a kernel density estimator is not always the best way to go, and instead attacking the distance you want to measure directly could yield better results.

I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

• Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
• Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with $n$ vertices:

the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most $4 n^3/27$ (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve $O(n^2)$; in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time $n + o(n)$.

How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

• Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
• On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a $K$-user interference channel with gains $\alpha_{ij}$ between transmitter $i$ and receiver $j$. Then if
$\alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}$
then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
• Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.

I came across this sally in the Bayesian/frequentist wars:

In general, the religious Bayesian states that no good and only harm can come from randomized experiments. In principle, he is opposed even to random sampling in opinion polling. However, this principle puts him in untenable computational positions, and a pragmatic Bayesian will often ignore what seems useless design information if there are no obvious quirks in a randomly selected sample.

– Herman Chernoff, Sequential Analysis and Optimal Design, Philadelphia : SIAM, 1972

This doesn’t seem to capture the current state of things, but the upshot here is that Chernoff is calling shenanigans on the “philosophical consistency” of Bayesian statistics.

Sometimes I wonder if what is needed is a Kinsey scale for statistical practice… can one be Bayes-curious?