I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

• Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
• Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with $n$ vertices:

the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most $4 n^3/27$ (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve $O(n^2)$; in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time $n + o(n)$.

How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

• Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
• On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a $K$-user interference channel with gains $\alpha_{ij}$ between transmitter $i$ and receiver $j$. Then if
$\alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}$
then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
• Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.

Persi Diaconis gave the second annual Billingsley Lecture at UChicago yesterday on the topic of coincidences and what a skeptical statistician/probabilist should say about them. He started out by talking about how Jung was fascinated by paradoxes (apparently there’s one about having fish come up all the time in conversation).

It was mostly a general-audience talk (with some asides about Poisson approximation), and the first part on the birthday problem and variants. Abstracted away, the question is given $n$ balls (people) and $C$ bins/categories (days), how big should $n$ be so that there’s an even chance that two balls land in the same bin? Turns out $n \approx latex 1.2 \sqrt{C}$, as we know, but we can expand this to deal with approximate matches (you need only 7 people to get 2 birthdays in the same week with probability around 1/2). If you want to put a graph on it you can ask social-network coincidence questions and get some scalings as a function of the number of edges and number of categories — here there are $n$ vertices and $C$ colors for the vertices. What these calculations show, of course, is that most coincidences are not so surprising, at least in this probabilistic sense. Some more advanced treatment might be found in Sukhada Fadnavis’s preprint (which also has something about a “shameful conjecture” on chromatic polynomials that was proved in 2000, but I don’t know why it is shameful). The second part of the talk was on problems arising in the study of ESP — namely that experimental controls are not really present, so the notion of a “trial” is hard to pin down, leading (of course) to more false perceptions of coincidences are being surprising. He closed with some remarks about how our perception of coincidence is really about how our minds work, and pointed to some work by Ruma Falk for those who are interested in that angle of things.

I was unaware of this body of Diaconis’s work, and it was nice to have a high-level talk to cap off the day.

Sudeep Kamath sent me a note about a recent result he posted on the ArXiV that relates to an earlier post of mine on the HGR maximal correlation and an inequality by Erkip and Cover for Markov chains $U -- X -- Y$ which I had found interesting:
$I(U ; Y) \le \rho_m(X,Y)^2 I(U ; X)$.
Since learning about this inequality, I’ve seen a few talks which have used the inequality in their proofs, at Allerton in 2011 and at ITA this year. Unfortunately, the stated inequality is not correct!

On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover
Venkat Anantharam, Amin Gohari, Sudeep Kamath, Chandra Nair

What this paper shows is that the inequality is not satisfied with $\rho_m(X,Y)^2$, but by another quantity:
$I(U ; Y) \le s^*(X;Y) I(U ; X)$
where $s^*(X;Y)$ is given by the following definition.

Let $X$ and $Y$ be random variables with joint distribution $(X, Y) \sim p(x, y)$. We define
$s^*(X;Y) = \sup_{r(x) \ne p(x)} \frac{ D( r(y) \| p(y) ) }{ D( r(x) \| p(x) }$,
where $r(y)$ denotes the $y$-marginal distribution of $r(x, y) := r(x)p(y|x)$ and the supremum on the right hand side is over all probability distributions $r(x)$ that are different from the probability distribution $p(x)$. If either $X$ or $Y$ is a constant, we define $s^*(X; Y)$ to be 0.

Suppose $(X,Y)$ have joint distribution $P_{XY}$ (I know I am changing notation here but it’s easier to explain). The key to showing their result is through deriving variational characterizations of $\rho_m$ and $s^*$ in terms of the function
$t_{\lambda}( P_X ) := H( P_Y ) - \lambda H( P_X )$
Rather than getting into that in the blog post, I recommend reading the paper.

The implication of this result is that the inequality of Erkip and Cover is not correct : not only is $\rho_m(X,Y)^2$ not the supremum of the ratio, there are distributions for which it is not an upper bound. The counterexample in the paper is the following: $X \sim \mathsf{Bernoulli}(1/2)$, and $Y$ is generated via this asymmetric erasure channel:

Joint distribution counterexample (Fig. 2 of the paper)

How can we calculate $\rho_m(X,Y)$? If either $X$ or $Y$ is binary-valued, then
$\rho_m(X,Y)^2 = -1 + \sum_{x,y} \frac{ p(x,y)^2 }{ p(x) p(y) }$
So for this example $\rho_m(X,Y)^2 = 0.6$. However, $s^*( X,Y) = \frac{1}{2} \log_2(12/5) > 0.6$ and there exists a sequence of variables $U_i$ satisfying the Markov chain such that $U_i -- X -- Y$ such that the ratio approaches $s^*$.

So where is the error in the original proof? Anantharam et al. point to an explanation that the Taylor series expansion used in the proof of the inequality with $\rho_m(X,Y)^2$ may not be valid at all points.

This seems to just be the start of a longer story, which I look forward to reading in the future!

Last week I was reading Active Learning via Perfect Selective Classification by El-Yaniv and Wiener, and came across a neat result due to Hug and Reitzner that they use in some of their bounds for active learning on Gaussian distributions.

The setup is the following : let $X_1, X_2, \ldots, X_n$ be $n$ jointly Gaussian vectors with distribution $\mathcal{N}(0,I_d)$ in $\mathbb{R}^d$. The convex hull $P_n$ of these points is called a Gaussian polytope. This is a random polytope of course, and we can ask various things about their shape : what is the distribution of the number of vertices, or the number of $k$-faces? Let $f_k(P_n)$ be the number of $k$-faces Distributions are hard, but for general $k$ the expected number of faces (as $n \to infty$) is given by

$\mathbb{E}[ f_k(P_n)] = \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} (1 + o(1))$,

where $\beta_{k,d-1}$ is the internal angle of a regular $(d-1)$-simplex at one of its $k$-dimensional faces. What Hug and Reitzner show is a bound on the variance (which then El-Yaniv and Plan use in a Chebyshev bound) : there exists a constant $c_d$ such that

$\mathrm{Var}( F_k(P_n) ) \le c_d (\ln n)^{(d-1)/2}$

So the variance of the number of $k$-faces can be upper bounded by something that does not depend at all on the actual value of $k$. In fact, they show that

$f_k(P_n) (\ln n)^{-(d-1)/2} \to \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1} \pi^{(d-1)/2}$

in probability as $n \to \infty$. That is, appropriately normalized, the number of faces converges to a constant.

To me this result was initially surprising, but after some more thought it makes a bit more sense. If you give me a cloud of Gaussian points, then I need $k+1$ points to define a $k$-face. The formula for the mean says that if I chose a random set of $k+1$ points, then approximately $\frac{2^d}{\sqrt{d}} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2}$ fraction of them will form a real $k$-face of the polytope. This also explains why the simplex-related quantity appears — points that are on “opposite sides” of the sphere (the level sets of the density) are not going to form a face together. As $n \to \infty$ this fraction will change, but effectively the rate of growth in the number of faces with $n$ is $(\ln n)^{(d-1)/2}$, regardless of $k$.

I’m not sure where this result will be useful for me (yet!) but it seemed like something that the technically-minded readers of the blog would find interesting as well.

On Lalitha’s recommendation I read Frank Nielsen’s paper “Cramer-Rao Lower Bound and Information Geometry,” which is a survey how C.R. Rao’s work has impacted information geometry. I remember spending some time in grad school trying to learn information geometry (mostly for fun), but since it ended up not being particularly useful in my research, I’m afraid a lot of it has leaked out of my ears. This paper has a short introduction to the Cramer-Rao lower bound and an introduction to information geometry which might be a nice read for some of the readers of this blog. It’s certainly faster than trying to read Amari’s monograph! In particular, it goes over the “highlights” of geodesics and other geometric features on the manifold of probability distributions.

The paper mentions the sub-family of f-divergences known as $\alpha$-divergences, which are given by

$D_{\alpha}(p \| q) = \frac{4}{1 - \alpha^2} \left( 1 - \int p(x)^{(1 - \alpha)/2)} q(x)^{(1 + \alpha)/2} dx \right)$

The KL divergence is $D_{-1}(p \| q)$ — you have to take the limit as $\alpha \to -1$. Within this family of divergences we have the relation $D_{\alpha}(p \| q) = D_{-\alpha}(q \| p)$. Consider a pair of random variables $(X,Y)$ with joint distribution $P_{XY}$ and marginal distributions $P_X$ and $P_Y$. If we take $q = P_X P_Y$ and $p = P_{XY}$ then the mutual information is $D_{-1}( p \| q )$. But we can also take

$D_{-1}( P_{X} P_{Y} \| P_{XY}) = D_1( P_{XY} \| P_{X} P_{Y} )$

Thus it turns out that the “lautum information” defined by Palomar and Verdú is a special case of this: it’s the 1-divergence between the the joint distribution and the product of the marginals. While their paper mentions the lautum information is an f-divergence, it doesn’t discuss this connection to this family of divergences. Nielsen’s paper calls this the “reverse Kullback-Leibler divergence,” but some googling doesn’t seem to indicate that this is a common term, or indeed if it has some use in information geometry. Palomar and Verdú give several operational interpretations of the lautum information.

I have a subscription to Harper’s and try to solve the cryptic crossword each month in the vain hope that I will win a free year’s subscription. The puzzles back to 1976 have been posted online.

My friend (and ex-fellow actor) Stephen Larson‘s project OpenWorm was written up in Wired UK.

A distribution that appears frequently in differential privacy is the Laplace distribution. While in the scalar case we have seen that Laplace noise may not be the best, it’s still the easiest example to start with. Suppose we have $n$ scalars $x_i \in [0,1]$ and we want to compute the average $\frac{1}{n} \sum_{i} x_i$ in a differentially private way. One way to do this is to release $Y = \frac{1}{n} \sum_{i} x_i + Z$, where $Z$ has a Laplace distribution:

$p(z) = \frac{\lambda}{2} \exp( - \lambda |z| )$.

To see that this is differentially private, note that by changing one value of $x_i$ the average can change by at most $\pm \frac{1}{n}$. Let $\bar{x}$ and $\bar{x}'$ be the average of the original data and the data with one element changed. The output density in these two cases has distribution $p( y - \bar{x})$ and $p(y - \bar{x}')$, so for a

$\frac{ p(y - \bar{x}) }{ p(y - \bar{x}') } \le \exp( \lambda ( |y - \bar{x}'| - |y - \bar{x}| ) ) \le \exp( \lambda |\bar{x} - \bar{x}|' ) \le \exp( \lambda/n )$

So we can see by choosing $\lambda = n \epsilon$ we get an $\epsilon$-differentially private approximation to the average.

What if we now have n vectors $\mathbf{x}_i \in [0,1]^d$? Well, one candidate is release a differentially private version of the mean by computing $\mathbf{Y} = \frac{1}{n} \sum_{i} \mathbf{X}_i + \mathbf{Z}$, where $\mathbf{Z}$ has a distribution that looks Laplace-like but in higher dimensions:

$p(\mathbf{z}) \propto \exp( - \lambda \| \mathbf{z} \| )$

Now we can do the same calculation with means $\bar{\mathbf{x}}$ and $\bar{\mathbf{x}}'$

$\frac{ p(\mathbf{y} - \bar{\mathbf{x}}) }{ p(\mathbf{y} - \bar{\mathbf{x}}') } \le \exp( \lambda ( \|\mathbf{y} - \bar{\mathbf{x}}'\| - \|\mathbf{y} - \bar{\mathbf{x}}\| ) ) \le \exp( \lambda \|\bar{\mathbf{x}} - \bar{\mathbf{x}}'\| )$

Now the Euclidean norm of the average vector can change by a most $\sqrt{d}/n$ (by replacing $x_i = \mathbf{0}$ with $x_i' = \mathbf{1}$, for example), so the overall bound is $\exp(\lambda \sqrt{d}/n)$, which means choosing $\lambda = n \epsilon / \sqrt{d}$ we get $\epsilon$-differential privacy.

Sampling from exponentials is easy, but what about sampling from this distribution? Here’s where people can fall into a trap because they are not careful about transformations of random variables. It’s tempting (if you are rusty on your probability) to say that

$p(\mathbf{z}) = C(\lambda) \exp( - \lambda \| \mathbf{z} \| )$

and then say “well, the norm looks exponentially distributed and the direction is isotropic so we can just sample the norm with an exponential distribution and the uniform direction by taking i.i.d. Gaussians and normalizing them.” But that’s totally wrong because that is implicitly doing a change of variables without properly adjusting the density function. The correct thing to do is to change from Euclidean coordinates to spherical coordinates. We have a map $T : (z_1, z_2, \ldots, z_d) \to (r, \phi_1, \phi_2, \ldots, \phi_{d-1})$ whose Jacobian is

$J(r, \phi_1, \phi_2, \ldots, \phi_{d-1}) = r^{d-1} \sin^{d-2}(\phi_1) \ldots, \sin(\phi_{d-2})$.

Plugging this in and noting that $r = \|\mathbf{z}\|$ we get

$p(r, \phi_1, \phi_2, \ldots, \phi_{d-1}) = C'(\lambda,\phi_1,\ldots, \phi_{d-1}) \exp( - \lambda r )$.

So now we can see that the distribution factorizes and indeed the radius and direction are independent. The radius is not exponentially distributed, it’s Erlang with parameters $(d,\lambda)$. We can generate this by taking the sum of $d$ exponential variables with parameter $\lambda$. The direction we can pick uniformly by sampling $d$ i.i.d. Gaussians and normalizing them.

In general sampling distributions for differentially private mechanisms can be complicated — for example in our work on PCA we had to use an MCMC procedure in our experiments to sample from the distribution in our algorithm. This means we could really only approximate our algorithm in the experiments, of course. There are also places to slip up in sampling from simple-looking distributions, and I’d be willing to bet that in some implementations out there people are not sampling from the correct distribution.

(Thanks to Kamalika Chaudhuri for inspiring this post.)

I tend to flag papers on ArXiV that I want to take a look at in (soon to be defunct, *sniff*) Google Reader. Here are some papers from the last month that I found interesting. I’ll post a few more of these as I work through my backlog…

Local Privacy and Statistical Minimax Rates (John C. Duchi, Michael I. Jordan, Martin J. Wainwright) — this is a paper proving minimax lower bounds for differential privacy. The approach is based on the Fano/Le Cam style of getting minimax bounds by constructing a packing of instances of the problem.

Bernstein – von Mises Theorem for growing parameter dimension (Vladimir Spokoiny) — I’m generally interested in the consistency properties of Bayesian procedures, and this looks at the effect of asymptotically growing the problem size to see how fast the problem can grow while still getting the same consistency from the BvM theorem.

On the problem of reversibility of the entropy power inequality (Sergey G. Bobkov, Mokshay M. Madiman) — More results on the EPI. Reversing it is the same as reversing the Brunn-Minkowski inequality (consider uniform distributions), but there is an interesting impossibility result here (Theorem 1.3): “For any constant $C$, there is a convex probability distribution $\mu$ on the real line with a finite entropy, such that $\min \{ H(X+Y), H(X-Y) \} \ge C H(X)$, where $X$ and $Y$ are independent random variables, distributed according to $\mu$.” The distribution they use is a truncated Pareto distribution but the calculations seem hairy.

A universal, operational theory of unicast multi-user communication with fidelity criteria (Mukul Agarwal, Sanjoy Mitter, Anant Sahai) — This is the culmination of Mukul’s work starting from a very nice paper I cite all the time from Allerton. There are several results and commentary in here — there’s a fair bit of philosophy, so it’s worth a more patient read than I could give it so far (only so many hours in the day, after all!)

The Convergence Rate of Majority Vote under Exchangeability (Miles E. Lopes) — The title says it all, really. The bounds are actually in terms of the mixture distribution of the exchangeable sequence of Bernoulli votes.

Maybe more like “paper whenever I feel like it.” This is a post on a now not-so-recent ArXiV preprint by Quan Geng and Pramod Viswanath on constructing the best mechanism (output distribution) for guaranteeing differential privacy under a utility constraint.

Having seen a talk recently by John Ioannidis on how medical research is (often) bunk, this finer corrective by Larry Wasserman was nice to read.

Computer science conferences are often not organized by the ACM, but instead there are different foundations for machine learning and vision and so on that basically exist to organize the annual conference(s). At least, that is what I understand. There are a few which are run by the ACM, and there’s often debate about whether or not the ACM affiliation is worth it, given the overheads and so on. Boaz Barak had a post a little over a week ago making the case for sticking with the ACM. Given the hegemonic control of the IEEE on all things EE (more or less), this debate is new to me. As far as I can tell, ISIT exists to cover some of the cost of publishing the IT Transactions, and so it sort of has to be run by IEEE.

As mentioned before, Tara Javidi has a nice post up on what it means for one random variable to be stochastically less variable than another.

Paul Miniero has a bigger picture view of NIPS — I saw there were lots of papers on “deep learning” but it’s not really my area so I missed many of those posters.

David Eppstein’s top 10 cs.DS papers from 2012.