I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

• Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
• Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with $n$ vertices:

the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most $4 n^3/27$ (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve $O(n^2)$; in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time $n + o(n)$.

How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

• Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
• On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a $K$-user interference channel with gains $\alpha_{ij}$ between transmitter $i$ and receiver $j$. Then if
$\alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}$
then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
• Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.

I’ve started doing more machine learning research lately, which means I’ve been sullying my delicate theorist’s hands testing out my algorithms on data. Perhaps the most (over) used dataset is the MNIST handwritten digits collection, which was been put into MATLAB form by Sam Roweis (RIP). As a baseline, I wanted to see how an SVM would perform after I projected the data (using PCA) into the top 100 dimensions. The primal program is

$\min_{\mathbf{w},b} \frac{1}{2} \| \mathbf{w} \|_2^2 + C \sum_{i=1}^{n} z_i$
s.t. $y_i (\mathbf{w}^T \mathbf{x}_i + b) \ge 1 - z_i)$

I chose some “reasonable” value for C and tried to train a classifier on all pairs of points and got the following error rates on the test set (in percentages, rounded).

0
0      0
0.56   0.43   0
0.33   0.45   2.37   0
0.04   0.06   1.17   0.23   0
1.02   0.11   1.89   3.77   0.72   0
0.52   0      1.31   0.08   0.60   1.66   0
0.01   0.15   1.01   0.80   0.80   0.42   0      0
0.43   1.15   2.22   2.69   0.38   3.41   0.54   0.47   0
0.20   0.14   0.85   1.13   3.03   1.02   0      3.82   1.27   0

This is digits from 0 to 9, so for example, the training error for classifying 0 versus 1 was zero percent, but it’s about 3.8 percent error to decide between 9 and 7. I did this to try and get a sense of which digits were “harder” for SVM to distinguish between so that I could pick a good pair for experiments, or better yet, to pick a pair based on a target error criterion. Running experiments on Gaussian synthetic examples is all fine and good, but it helps to have a range of data sets to test out how resilient an algorithm is to more noise, for example.

Assumptionless consistency of the Lasso
Sourav Chatterjee
The title says it all. Given $p$-dimensional data points $\{ \mathbf{x}_i : i \in [n] \}$ the Lasso tries to fit the model $\mathbb{E}( y_i | \mathbf{x_i}) = \boldsymbol{\beta} \mathbf{x}_i$ by minimizing the $\ell^1$ penalized squared error
$\sum_{i=1}^{n} (y_i - \boldsymbol{\beta} \mathbf{x}_i)^2 + \lambda \| \boldsymbol{\beta} \|_1$.
The paper analyzes the Lasso in the setting where the data are random, so there are $n$ i.i.d. copies of a pair of random variables $(\mathbf{X},Y)$ so the data is $\{(\mathbf{X}_i, Y_i) : i \in [n] \}$. The assumptions are on the random variables $(\mathbf{X},Y)$ : (1) each coordinate $|X_i| \le M$ is bounded, the variable $Y = (\boldsymbol{\beta}^*)^T \mathbf{X} + \varepsilon$, and $\varepsilon \sim \mathcal{N}(0,\sigma^2)$, where $\boldsymbol{\beta}^*$ and $\sigma$ are unknown constants. Basically that’s all that’s needed — given a bound on $\|\boldsymbol{\beta}\|_1$, he derives a bound on the mean-squared prediction error.

On Learnability, Complexity and Stability
Silvia Villa, Lorenzo Rosasco, Tomaso Poggio
This is a handy survey on the three topics in the title. It’s only 10 pages long, so it’s a nice fast read.

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
Francis Bach
A central challenge in stochastic optimization is understanding when the convergence rate of the excess loss, which is usually $O(1/\sqrt{n})$, can be improved to $O(1/n)$. Most often this involves additional assumptions on the loss functions (which can sometimes get a bit baroque and hard to check). This paper considers constant step-size algorithms but where instead they consider the averaged iterate $\latex \bar{\theta}_n = \sum_{k=0}^{n-1} \theta_k$. I’m trying to slot this in with other things I know about stochastic optimization still, but it’s definitely worth a skim if you’re interested in the topic.

On Differentially Private Filtering for Event Streams
Jerome Le Ny
Jerome Le Ny has been putting differential privacy into signal processing and control contexts for the past year, and this is another paper in that line of work. This is important because we’re still trying to understand how time-series data can be handled in the differential privacy setting. This paper looks at “event streams” which are discrete-valued continuous-time signals (think of count processes), and the problem is to design a differentially private filtering system for such signals.

Gossips and Prejudices: Ergodic Randomized Dynamics in Social Networks
Paolo Frasca, Chiara Ravazzi, Roberto Tempo, Hideaki Ishii
This appears to be a gossip version of Acemoglu et al.’s work on “stubborn” agents in the consensus setting. They show similar qualitative behavior — opinions fluctuate but their average over time converges (the process is ergodic). This version of the paper has more of a tutorial feel to it, so the results are a bit easier to parse.

Last week I was reading Active Learning via Perfect Selective Classification by El-Yaniv and Wiener, and came across a neat result due to Hug and Reitzner that they use in some of their bounds for active learning on Gaussian distributions.

The setup is the following : let $X_1, X_2, \ldots, X_n$ be $n$ jointly Gaussian vectors with distribution $\mathcal{N}(0,I_d)$ in $\mathbb{R}^d$. The convex hull $P_n$ of these points is called a Gaussian polytope. This is a random polytope of course, and we can ask various things about their shape : what is the distribution of the number of vertices, or the number of $k$-faces? Let $f_k(P_n)$ be the number of $k$-faces Distributions are hard, but for general $k$ the expected number of faces (as $n \to infty$) is given by

$\mathbb{E}[ f_k(P_n)] = \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} (1 + o(1))$,

where $\beta_{k,d-1}$ is the internal angle of a regular $(d-1)$-simplex at one of its $k$-dimensional faces. What Hug and Reitzner show is a bound on the variance (which then El-Yaniv and Plan use in a Chebyshev bound) : there exists a constant $c_d$ such that

$\mathrm{Var}( F_k(P_n) ) \le c_d (\ln n)^{(d-1)/2}$

So the variance of the number of $k$-faces can be upper bounded by something that does not depend at all on the actual value of $k$. In fact, they show that

$f_k(P_n) (\ln n)^{-(d-1)/2} \to \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1} \pi^{(d-1)/2}$

in probability as $n \to \infty$. That is, appropriately normalized, the number of faces converges to a constant.

To me this result was initially surprising, but after some more thought it makes a bit more sense. If you give me a cloud of Gaussian points, then I need $k+1$ points to define a $k$-face. The formula for the mean says that if I chose a random set of $k+1$ points, then approximately $\frac{2^d}{\sqrt{d}} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2}$ fraction of them will form a real $k$-face of the polytope. This also explains why the simplex-related quantity appears — points that are on “opposite sides” of the sphere (the level sets of the density) are not going to form a face together. As $n \to \infty$ this fraction will change, but effectively the rate of growth in the number of faces with $n$ is $(\ln n)^{(d-1)/2}$, regardless of $k$.

I’m not sure where this result will be useful for me (yet!) but it seemed like something that the technically-minded readers of the blog would find interesting as well.

Venkatesh Saligrama sent out a call for an ICML workshop he is organizing:

I wanted to bring to your attention an ICML workshop on “Machine Learning with Test-Time Budgets” that I am helping organize. The workshop will be held during the ICML week. The workshop will feature presentations both from data-driven as well as model-based perspectives and will feature researchers from machine learning and control/decision theory.

We are accepting papers related to these topics. Please let me know if you have questions about the workshop or wish to submit a paper.

I just wanted to write a few words about the workshop at the Bellairs Research Institute. I just returned from sunny Barbados to frigid Chicago, so writing this will help me remember the sunshine and sand:

The beach at Bathsheba on the east coast of Barbados

Mike Rabbat put on a great program this year, and there were lots of talks on a range of topics in machine learning, signal processing, and optimization. The format of the workshop was to have talks with lots of room for questions and discussion. Talks were given out on the balcony where we were staying, and we had to end at about 2:30 because the sunshine would creep into our conference area, baking those of us sitting too far west.

I think this is the end of my ITA blogging! But there were some issues that came up during the conference that may be of interest to some of the readers of this blog (although from anecdotal reports, there are many people who read but never comment, so I’m not sure what to do to encourage more discussions).

I promised some ITA blogging, so here it is. Maybe Alex will blog a bit too. These notes will by necessity be cursory, but I hope some people will find some of these papers interesting enough to follow up on them.

A Reverse Pinsker Inequality
Daniel Berend, Peter Harremoës , Aryeh Kontorovich
Aryeh gave this talk on what we can say about bounds in the reverse direction of Pinsker’s inequality. Of course, in general you can’t say much, but what they do is show an expansion of the KL divergence in terms of the total variation distance in terms of the balance coefficient of the distribution $\beta = \inf \{ P(A) : P(A) \ge 1/2 \}$.

Unfolding the entropy power inequality
Mokshay gave a talk on the entropy power inequality. Given vector random variables $X_1$ and $X_2$ is there a term we know that $h(X_1 + X_2) \ge h(Z_1 + Z_2)$ where $Z_1$ and $Z_2$ are isotropic Gaussian vectors with the same differential entropy as $X_1$ and $X_2$. The question in this paper is this : can we insert a term between these two in the inequality? The answer is yes! They define a spherical rearrangement of the densities of $X_1$ and $X_2$ into variables $X_1^{\ast}$ and $X_2^{\ast}$ with spherically symmetric decreasing densities and show that the differential entropy of their sum lies between the two terms in the regular EPI.

Improved lower bounds on the total variation distance and relative entropy for the Poisson approximation
Igal Sason
The previous lower bounds mentioned in the title were based on the Chen-Stein method, and they can be strengthened by sharpening the analysis in the Chen-Stein method.

Fundamental limits of caching
This talk was on tradeoffs in caching. If there are $N$ files, $K$ users and a size $M$ cache at each user, how should they cache files so as to best allow a broadcaster to share the bandwidth to them? More simply, suppose there are three people who may want to watch one of three different TV shows, and they can buffer the content of one TV show. Since a priori you don’t know which show they want to watch, the idea might be to buffer/cache the first 3rd of each show at each user. They show that this is highly suboptimal. Because the content provider can XOR parts of the content to each user, the caching strategy should not be the same at each user, and the real benefit is the global cache size.

Simple outer bounds for multiterminal source coding
This was a very cute result on using the HGR maximal correlation to get outer bounds for multiterminal source coding without first deriving a single letterization of the outer bound. The main ideas are to use two properties of the HGR correlation : it tensorizes (to get the multiletter part) and the strong DPI from Elza Erkip and Tom Cover’s paper (referenced above).

The City of Chicago has a big open data initiative, and they are putting data online at the City of Chicago Data Portal. Lots of interesting stuff here, and some potential to get data sets for machine learning tasks.