Yet more skims from ArXiV

I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

  • Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
  • Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with n vertices:

    the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most 4 n^3/27 (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve O(n^2); in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time n + o(n).

    How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

  • Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
  • On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a K-user interference channel with gains \alpha_{ij} between transmitter i and receiver j. Then if
    \alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}
    then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
  • Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.

Capacity-achieving philately

The following email came through the ITSOC mailing list, but may be of interest to other readers of the blog.

Dear Colleagues,

We are making a proposal to the United States Postal Service for the production of a stamp honoring Claude Elwood Shannon on the 100th anniversary of his birth.

The proposal is available at: http://www.itsoc.org/about/shannons-centenary-us-postal-stamp/

Please add your endorsement on that page, and then spread the word!

We would love to have the endorsements of your friends, colleagues, department chair, dean, university president, CEO, government representatives, school-aged children, and the public at large. [Contact information for endorsing individuals will not be posted.]

Thanks for your support!
Best,
Michelle Effros

HGR maximal correlation revisited : a corrected reverse inequality

Sudeep Kamath sent me a note about a recent result he posted on the ArXiV that relates to an earlier post of mine on the HGR maximal correlation and an inequality by Erkip and Cover for Markov chains U -- X -- Y which I had found interesting:
I(U ; Y) \le \rho_m(X,Y)^2 I(U ; X).
Since learning about this inequality, I’ve seen a few talks which have used the inequality in their proofs, at Allerton in 2011 and at ITA this year. Unfortunately, the stated inequality is not correct!

On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover
Venkat Anantharam, Amin Gohari, Sudeep Kamath, Chandra Nair

What this paper shows is that the inequality is not satisfied with \rho_m(X,Y)^2, but by another quantity:
I(U ; Y) \le s^*(X;Y) I(U ; X)
where s^*(X;Y) is given by the following definition.

Let X and Y be random variables with joint distribution (X, Y) \sim p(x, y). We define
s^*(X;Y) = \sup_{r(x) \ne p(x)} \frac{ D( r(y) \| p(y) ) }{ D( r(x) \| p(x) },
where r(y) denotes the y-marginal distribution of r(x, y) := r(x)p(y|x) and the supremum on the right hand side is over all probability distributions r(x) that are different from the probability distribution p(x). If either X or Y is a constant, we define s^*(X; Y) to be 0.

Suppose (X,Y) have joint distribution P_{XY} (I know I am changing notation here but it’s easier to explain). The key to showing their result is through deriving variational characterizations of \rho_m and s^* in terms of the function
t_{\lambda}( P_X ) := H( P_Y ) - \lambda H( P_X )
Rather than getting into that in the blog post, I recommend reading the paper.

The implication of this result is that the inequality of Erkip and Cover is not correct : not only is \rho_m(X,Y)^2 not the supremum of the ratio, there are distributions for which it is not an upper bound. The counterexample in the paper is the following: X \sim \mathsf{Bernoulli}(1/2), and Y is generated via this asymmetric erasure channel:

Joint distribution counterexample

Joint distribution counterexample (Fig. 2 of the paper)


How can we calculate \rho_m(X,Y)? If either X or Y is binary-valued, then
\rho_m(X,Y)^2 = -1 + \sum_{x,y} \frac{ p(x,y)^2 }{ p(x) p(y) }
So for this example \rho_m(X,Y)^2 = 0.6. However, s^*( X,Y) = \frac{1}{2} \log_2(12/5) > 0.6 and there exists a sequence of variables U_i satisfying the Markov chain such that U_i -- X -- Y such that the ratio approaches s^*.

So where is the error in the original proof? Anantharam et al. point to an explanation that the Taylor series expansion used in the proof of the inequality with \rho_m(X,Y)^2 may not be valid at all points.

This seems to just be the start of a longer story, which I look forward to reading in the future!

Expected number of faces in Gaussian polytopes

Last week I was reading Active Learning via Perfect Selective Classification by El-Yaniv and Wiener, and came across a neat result due to Hug and Reitzner that they use in some of their bounds for active learning on Gaussian distributions.

The setup is the following : let X_1, X_2, \ldots, X_n be n jointly Gaussian vectors with distribution \mathcal{N}(0,I_d) in \mathbb{R}^d. The convex hull P_n of these points is called a Gaussian polytope. This is a random polytope of course, and we can ask various things about their shape : what is the distribution of the number of vertices, or the number of k-faces? Let f_k(P_n) be the number of k-faces Distributions are hard, but for general k the expected number of faces (as n \to infty) is given by

\mathbb{E}[ f_k(P_n)] = \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} (1 + o(1)),

where \beta_{k,d-1} is the internal angle of a regular (d-1)-simplex at one of its k-dimensional faces. What Hug and Reitzner show is a bound on the variance (which then El-Yaniv and Plan use in a Chebyshev bound) : there exists a constant c_d such that

\mathrm{Var}( F_k(P_n) ) \le c_d (\ln n)^{(d-1)/2}

So the variance of the number of k-faces can be upper bounded by something that does not depend at all on the actual value of k. In fact, they show that

f_k(P_n) (\ln n)^{-(d-1)/2} \to \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1} \pi^{(d-1)/2}

in probability as n \to \infty. That is, appropriately normalized, the number of faces converges to a constant.

To me this result was initially surprising, but after some more thought it makes a bit more sense. If you give me a cloud of Gaussian points, then I need k+1 points to define a k-face. The formula for the mean says that if I chose a random set of k+1 points, then approximately \frac{2^d}{\sqrt{d}} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} fraction of them will form a real k-face of the polytope. This also explains why the simplex-related quantity appears — points that are on “opposite sides” of the sphere (the level sets of the density) are not going to form a face together. As n \to \infty this fraction will change, but effectively the rate of growth in the number of faces with n is (\ln n)^{(d-1)/2}, regardless of k.

I’m not sure where this result will be useful for me (yet!) but it seemed like something that the technically-minded readers of the blog would find interesting as well.

RAR : a cry of rage

I’ve been trying to get a camera-ready article for the Signal Processing Magazine and the instructions from IEEE include the following snippet:

*VERY IMPORTANT: All source files ( .tex, .doc, .eps, .ps, .bib, .db, .tif, .jpeg, …) may be uploaded as a single .rar archived file. Please do not attempt to upload files with extensions .shs, .exe, .com, .vbs, .zip as they are restricted file types.

While I have encountered .rar files before, I was not very familiar with the file format or its history. I didn’t know it’s a proprietary format — that seems like a weird choice for IEEE to make (although no weirder than PDF perhaps).

What’s confusing to me is that ArXiV manages to handle .zip files just fine. Is .tgz so passé now? My experience with RAR is that it is good for compressing (and splitting) large files into easier-to-manage segments. All of that efficiency seems wasted for a single paper with associated figures and bibliography files and whatnot.

I was trying to find the actual compression algorithm, but like most modern compression software, the innards are a fair bit more complex than the base algorithmic ideas. The Wikipedia article suggests it does a blend of Lempel-Ziv (a variant of LZ77) and prediction by partial matching, but I imagine there’s a fair bit of tweaking. What I couldn’t figure out is if there is a new algorithmic idea in there (like in the Burrows-Wheeler Transform (BWT)), or it’s more a blend of these previous techniques.

Anyway, this silliness means I have to find some extra software to help me compress. SimplyRAR for MacOS seems to work pretty well.

C.R. Rao and information geometry

On Lalitha’s recommendation I read Frank Nielsen’s paper “Cramer-Rao Lower Bound and Information Geometry,” which is a survey how C.R. Rao’s work has impacted information geometry. I remember spending some time in grad school trying to learn information geometry (mostly for fun), but since it ended up not being particularly useful in my research, I’m afraid a lot of it has leaked out of my ears. This paper has a short introduction to the Cramer-Rao lower bound and an introduction to information geometry which might be a nice read for some of the readers of this blog. It’s certainly faster than trying to read Amari’s monograph! In particular, it goes over the “highlights” of geodesics and other geometric features on the manifold of probability distributions.

The paper mentions the sub-family of f-divergences known as \alpha-divergences, which are given by

D_{\alpha}(p \| q) = \frac{4}{1 - \alpha^2} \left( 1 - \int p(x)^{(1 - \alpha)/2)} q(x)^{(1 + \alpha)/2} dx \right)

The KL divergence is D_{-1}(p \| q) — you have to take the limit as \alpha \to -1. Within this family of divergences we have the relation D_{\alpha}(p \| q) = D_{-\alpha}(q \| p). Consider a pair of random variables (X,Y) with joint distribution P_{XY} and marginal distributions P_X and P_Y. If we take q = P_X P_Y and p = P_{XY} then the mutual information is D_{-1}( p \| q ). But we can also take

D_{-1}( P_{X} P_{Y} \| P_{XY}) = D_1( P_{XY} \| P_{X} P_{Y} )

Thus it turns out that the “lautum information” defined by Palomar and Verdú is a special case of this: it’s the 1-divergence between the the joint distribution and the product of the marginals. While their paper mentions the lautum information is an f-divergence, it doesn’t discuss this connection to this family of divergences. Nielsen’s paper calls this the “reverse Kullback-Leibler divergence,” but some googling doesn’t seem to indicate that this is a common term, or indeed if it has some use in information geometry. Palomar and Verdú give several operational interpretations of the lautum information.

ICML Workshop on Machine Learning with Test-Time Budgets

Venkatesh Saligrama sent out a call for an ICML workshop he is organizing:

I wanted to bring to your attention an ICML workshop on “Machine Learning with Test-Time Budgets” that I am helping organize. The workshop will be held during the ICML week. The workshop will feature presentations both from data-driven as well as model-based perspectives and will feature researchers from machine learning and control/decision theory.

We are accepting papers related to these topics. Please let me know if you have questions about the workshop or wish to submit a paper.

Postdoc position at KTH

(via David Tse)

The School of Electrical Engineering and the ACCESS Linnaeus Center at the KTH Royal Institute of Technology, Stockholm, Sweden, are pleased to announce post-doctoral positions in information and communication theory.

The ability and interest to work across traditional disciplines and to initiate new research collaborations are essential. Candidates should have a PhD (or be near completion) in a relevant field and a strong research and publication record. The duration of the position is 12 months which may be extended by an additional 12 months. The starting date is during fall or winter of 2013.

Candidates interested in a position should send their application material (as a single pdf file) to: openpos-commth@ee.kth.se and 0129@ee.kth.se no later than 20 April 2013. Position reference number E-2013-0129. Write this reference number on your application. The application can include any material that supports the candidate’s qualifications, but as a minimum it should include a CV, contact information of two reference persons, a full list of publications, a brief research statement, and information about academic track record and performance. Do not send any compressed files. Female candidates are explicitly invited to apply.

Sincerely
Mikael Skoglund and Lars K. Rasmussen
KTH Royal Institute of Technology, Stockholm

The KTH School of Electrical Engineering
The ACCESS Center
The KTH EE Communication Theory Lab

Some not-so-recent ArXiV skims

I tend to flag papers on ArXiV that I want to take a look at in (soon to be defunct, *sniff*) Google Reader. Here are some papers from the last month that I found interesting. I’ll post a few more of these as I work through my backlog…

Local Privacy and Statistical Minimax Rates (John C. Duchi, Michael I. Jordan, Martin J. Wainwright) — this is a paper proving minimax lower bounds for differential privacy. The approach is based on the Fano/Le Cam style of getting minimax bounds by constructing a packing of instances of the problem.

Bernstein – von Mises Theorem for growing parameter dimension (Vladimir Spokoiny) — I’m generally interested in the consistency properties of Bayesian procedures, and this looks at the effect of asymptotically growing the problem size to see how fast the problem can grow while still getting the same consistency from the BvM theorem.

On the problem of reversibility of the entropy power inequality (Sergey G. Bobkov, Mokshay M. Madiman) — More results on the EPI. Reversing it is the same as reversing the Brunn-Minkowski inequality (consider uniform distributions), but there is an interesting impossibility result here (Theorem 1.3): “For any constant C, there is a convex probability distribution \mu on the real line with a finite entropy, such that \min \{ H(X+Y), H(X-Y) \} \ge C H(X), where X and Y are independent random variables, distributed according to \mu.” The distribution they use is a truncated Pareto distribution but the calculations seem hairy.

A universal, operational theory of unicast multi-user communication with fidelity criteria (Mukul Agarwal, Sanjoy Mitter, Anant Sahai) — This is the culmination of Mukul’s work starting from a very nice paper I cite all the time from Allerton. There are several results and commentary in here — there’s a fair bit of philosophy, so it’s worth a more patient read than I could give it so far (only so many hours in the day, after all!)

The Convergence Rate of Majority Vote under Exchangeability (Miles E. Lopes) — The title says it all, really. The bounds are actually in terms of the mixture distribution of the exchangeable sequence of Bernoulli votes.

Paper a day(?) : optimal mechanisms for differential privacy

Maybe more like “paper whenever I feel like it.” This is a post on a now not-so-recent ArXiV preprint by Quan Geng and Pramod Viswanath on constructing the best mechanism (output distribution) for guaranteeing differential privacy under a utility constraint.

Continue reading