# C.R. Rao and information geometry

On Lalitha’s recommendation I read Frank Nielsen’s paper “Cramer-Rao Lower Bound and Information Geometry,” which is a survey how C.R. Rao’s work has impacted information geometry. I remember spending some time in grad school trying to learn information geometry (mostly for fun), but since it ended up not being particularly useful in my research, I’m afraid a lot of it has leaked out of my ears. This paper has a short introduction to the Cramer-Rao lower bound and an introduction to information geometry which might be a nice read for some of the readers of this blog. It’s certainly faster than trying to read Amari’s monograph! In particular, it goes over the “highlights” of geodesics and other geometric features on the manifold of probability distributions.

The paper mentions the sub-family of f-divergences known as $\alpha$-divergences, which are given by

$D_{\alpha}(p \| q) = \frac{4}{1 - \alpha^2} \left( 1 - \int p(x)^{(1 - \alpha)/2)} q(x)^{(1 + \alpha)/2} dx \right)$

The KL divergence is $D_{-1}(p \| q)$ — you have to take the limit as $\alpha \to -1$. Within this family of divergences we have the relation $D_{\alpha}(p \| q) = D_{-\alpha}(q \| p)$. Consider a pair of random variables $(X,Y)$ with joint distribution $P_{XY}$ and marginal distributions $P_X$ and $P_Y$. If we take $q = P_X P_Y$ and $p = P_{XY}$ then the mutual information is $D_{-1}( p \| q )$. But we can also take

$D_{-1}( P_{X} P_{Y} \| P_{XY}) = D_1( P_{XY} \| P_{X} P_{Y} )$

Thus it turns out that the “lautum information” defined by Palomar and Verdú is a special case of this: it’s the 1-divergence between the the joint distribution and the product of the marginals. While their paper mentions the lautum information is an f-divergence, it doesn’t discuss this connection to this family of divergences. Nielsen’s paper calls this the “reverse Kullback-Leibler divergence,” but some googling doesn’t seem to indicate that this is a common term, or indeed if it has some use in information geometry. Palomar and Verdú give several operational interpretations of the lautum information.

# Some not-so-recent ArXiV skims

I tend to flag papers on ArXiV that I want to take a look at in (soon to be defunct, *sniff*) Google Reader. Here are some papers from the last month that I found interesting. I’ll post a few more of these as I work through my backlog…

Local Privacy and Statistical Minimax Rates (John C. Duchi, Michael I. Jordan, Martin J. Wainwright) — this is a paper proving minimax lower bounds for differential privacy. The approach is based on the Fano/Le Cam style of getting minimax bounds by constructing a packing of instances of the problem.

Bernstein – von Mises Theorem for growing parameter dimension (Vladimir Spokoiny) — I’m generally interested in the consistency properties of Bayesian procedures, and this looks at the effect of asymptotically growing the problem size to see how fast the problem can grow while still getting the same consistency from the BvM theorem.

On the problem of reversibility of the entropy power inequality (Sergey G. Bobkov, Mokshay M. Madiman) — More results on the EPI. Reversing it is the same as reversing the Brunn-Minkowski inequality (consider uniform distributions), but there is an interesting impossibility result here (Theorem 1.3): “For any constant $C$, there is a convex probability distribution $\mu$ on the real line with a finite entropy, such that $\min \{ H(X+Y), H(X-Y) \} \ge C H(X)$, where $X$ and $Y$ are independent random variables, distributed according to $\mu$.” The distribution they use is a truncated Pareto distribution but the calculations seem hairy.

A universal, operational theory of unicast multi-user communication with fidelity criteria (Mukul Agarwal, Sanjoy Mitter, Anant Sahai) — This is the culmination of Mukul’s work starting from a very nice paper I cite all the time from Allerton. There are several results and commentary in here — there’s a fair bit of philosophy, so it’s worth a more patient read than I could give it so far (only so many hours in the day, after all!)

The Convergence Rate of Majority Vote under Exchangeability (Miles E. Lopes) — The title says it all, really. The bounds are actually in terms of the mixture distribution of the exchangeable sequence of Bernoulli votes.

/

A rather pretty video of an L-system made by my friend Steve.

LACMA, which I finally saw with a friend in February, has decided to offer high-resolution downloads of many of the items in its collection. This Ganesha has a pretty impressive belly. Via MeFi.

This may answer David Bowie’s question.

This slideshow makes me want to go to Slurping Turtle again.

Sometimes I wish we could just name p-values something else that is more descriptive. There’s been a fair bit of misunderstanding about them going on lately.

# ITA 2013 : part the third

I think this is the end of my ITA blogging! But there were some issues that came up during the conference that may be of interest to some of the readers of this blog (although from anecdotal reports, there are many people who read but never comment, so I’m not sure what to do to encourage more discussions).

Having seen a talk recently by John Ioannidis on how medical research is (often) bunk, this finer corrective by Larry Wasserman was nice to read.

Computer science conferences are often not organized by the ACM, but instead there are different foundations for machine learning and vision and so on that basically exist to organize the annual conference(s). At least, that is what I understand. There are a few which are run by the ACM, and there’s often debate about whether or not the ACM affiliation is worth it, given the overheads and so on. Boaz Barak had a post a little over a week ago making the case for sticking with the ACM. Given the hegemonic control of the IEEE on all things EE (more or less), this debate is new to me. As far as I can tell, ISIT exists to cover some of the cost of publishing the IT Transactions, and so it sort of has to be run by IEEE.

As mentioned before, Tara Javidi has a nice post up on what it means for one random variable to be stochastically less variable than another.

Paul Miniero has a bigger picture view of NIPS — I saw there were lots of papers on “deep learning” but it’s not really my area so I missed many of those posters.

David Eppstein’s top 10 cs.DS papers from 2012.

Here’s a roundup of some interesting posts/pages on technical things.

Over at Larry Wasserman’s blog, Rob Tibshirani suggests 9 Great Statistics papers published after 1970. You know, in case you were looking for some light reading over winter break.

Videos from the DIMACS Differential Privacy Workshop are up.

All of these ads for jobs this year want someone who works on Big Data. But… do you really have big data? Or, as I like to ask, “how big is big, anyway?”

Speaking of big data, this talk by Peter Bartlett looks cool. (h/t Andrew Gelman)

Max Raginsky and Igal Sason have a tutorial on measure concentration. Log Sobolev inequalities are a dish best served cold.

I’ll probably do an ArXiV roundup sometime soon — trying to catch up on a backlog of reading and thinking lately.

/

# Linkage : Black Friday edition

This is an amazing video that makes me miss the Bay Area. (via Bobak Nazer)

Also via Bobak, we’re number 8 and 10!

Since it’s holiday season, I figured it’s time to link to some profanity-laden humor about the holidays. For the new, The Hater’s Guide to the Williams-Sonoma Catalog, and the classic It’s Decorative Gourd Season….

A Game of Food Trucks. (via MetaFilter)

Larry Wasserman takes on the Bayesian/Frequentist debate.

My friend Erik, who started the Mystery Brewing Company, has a blog called Top Fermented. He is now starting a podcast, which also has an RSS feed.

/

Via Amber, a collection of grassroots feminist political posters from India.

Via John, some fun investigations on how 355/113 is an amazingly good approximation to $\pi$. Also related are the Stern-Brocot trees, which can give continued fraction expansions.

I had missed this speech by a 10 year old on gay marriage when it happened (I was in India), but it’s pretty heartwarming. For more background on how the principal originally deemed the story “inappropriate.”

What is a Bayesian?

Unrelatedly, ML Hipster — tight bounds and tight jeans.

# ArXiV notes : July 2012 part 2

Differentially Private Filtering and Differentially Private Kalman Filtering
Jerome Le Ny, George J. Pappas
These papers apply the differential privacy model to classical signal processing and look at the effect of adding noise before filtering and summing versus adding noise after filtering and summing. The key is that the operations have to be causal, I think — we usually think about differential privacy as operating offline or online in very restricted settings, but here the signals are coming in as time series.

Finite sample posterior concentration in high-dimensional regression
Nate Strawn, Artin Armagan, Rayan Saab, Lawrence Carin, David Dunson
They study ultra high-dimensional linear regression (cue guitars and reverb) in the “large p, small n” setting. The goal is to get a Bernstein-von Mises type theorem — e.g. assuming the data comes from a linear model $\beta_0$, then using Bayesian inference the posterior should concentrate around $\beta_0$. They of course need a sparsity assumption, and the prior must assign reasonable mass around the true parameter and assign low probability to non-sparse signals. The methods use some ideas from compressed sensing (the Dantzig selector) and should be of interest to people working in that area.

Identifying Users From Their Rating Patterns
José Bento, Nadia Fawaz, Andrea Montanari, and Stratis Ioannidis
This is a paper about recommender systems as part of the 2011 CAMRa challenge. They look at the problem of re-identification of users in this data and show that looking at the time stamps of movie ratings is much more useful than looking at the rating values. This suggests to me that people should use a masking system like Anant and Parv’s “Incentivizing Anonymous ‘Peer-to-Peer’ Reviews” (P. Venkitasubramaniam and A. Sahai, Allerton 2008) paper.

Michael Kearns, Mallesh M. Pai, Aaron Roth, Jonathan Ullman
This proposes a variant of differential privacy which they call joint differential privacy and look at mechanism designs that satisfy privacy and are incentive compatible. At first glance, these should be incompatible, since the latter implies “revealing the truth.” The model is one in which each agent has a finite set of actions but its own payoff/value is private. This is somewhat out of my area, so I can’t really get the nuances of what’s going on here, but a crucial assumption here is that there are a large number of agents. Joint differential privacy seems to be a form of $(\epsilon,\delta)$ differential privacy on the utility functions of the users.

# ISIT 2012 : more talks

Since I am getting increasingly delayed by post-ISIT and pre-SPCOM business, I am going to have to keep the rest of blogging about ISIT a little short. This post will mention some talks, and I’ll keep the other stuff for a (final) post.

Efficient Tracking of Large Classes of Experts
András György, Tamas Linder, Gabor Lugosi
This paper was on expanding the reference class against one is competing in a “prediction with experts” problem. Instead of doing well against the best expert chosen in hindsight, you compete against the best meta-expert which can switch between the existing experts. This leads to a transition diagram that is kind of complicated, but they propose a unifying approach which traces along branches — the key is that every transition path can be well approximated, so the space of possibilities one is tracking will not blow up tremendously.

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message Passing
David Donoho, Adel Javanmard, Andrea Montanari
What a trendy title! Basically this problem looks at the compressed sensing problem when the sensing matrix is banded (this is what spatially coupled means), and solves it using Bayesian approximate message passing to do progressive decoding and elimination. The optimality is in the sense of matching with the Renyi dimension of the signal class for the data. I alas did not take notes for the next talk, which also seemed related: Hybrid Generalized Approximate Message Passing with Applications to Structured Sparsity (Sundeep Rangan, Alyson Fletcher, Vivek Goyal, Philip Schniter)

Quantized Stochastic Belief Propagation: Efficient Message-Passing for Continuous State Spaces
Nima Noorshams, Martin Wainwright
This problem was on BP when the state space is continuous — instead of passing the whole belief distribution, nodes pass along samples from the distribution and the receiving node does a kind of interpolation/estimate of the density. They show that this process converges on trees. This is related to a problem I’ve been thinking about for decentralized inference, but with a different approach.

Synchrony Amplification
Ueli Maurer, Björn Tackmann
This was a cool talk on a framework for thinking about synchrony in clocks — the model is pretty formal, and it’s something I never really think about but it seemed like a fun way to think about these problems. Basically they want to formalize how you can take a given clock (a sequence of ticks) and convert it into another clock. The goal is to not throw out too many ticks (which equals slowdown), while achieving synchrony.

Non-coherent Network Coding: An Arbitrarily Varying Channel Approach
Mahdi Jafari Siavoshani, Shenghao Yang, Raymond Yeung
Of course I have to go to a talk with AVC in the title. This looks at the same operator channel for network coding but then they assume the network matrix may be arbitrarily varying (with known rank). In this model they can define all the usual AVC concepts and they get similar sorts of results that you see for AVCs, like dichotomies between deterministic coding with average error and randomized coding.

Alternating Markov Chains for Distribution Estimation in the Presence of Errors
A two-sample test means you have two strings $x^n$ and $y^n$ and you want to know if they are from the same distribution. He looked at the weak convergence of the asymptotically optimal test to get bounds on the false alarm probability.
This was on a model where two nodes get to observe $X^n$ and $Y^n$ drawn i.i.d. from either $P_{XY}$ or $Q_{XY}$ and they separately compress their observations into messages $W_1$ and $W_1$. The decision rule is to decide $P_{XY}$ if $W_1 = W_2$. What’s the best exponent?
This was on queuing. Customers come in and sample the loads of $L$ queues and then pick one to join. Their strategies may differ, so there is a game between the customers and this can affect the distribution of queue sizes. As a flavor of the weird stuff that can happen, suppose all customers but one only sample one queue and join that queue. Then the remaining customer will experience less delay if they sample two and join the shorter one. However, if all but one sample two and join the shorter one, then it’s better for her to sample just one. At least, that’s how I understood it. I’m not really a queueing guy.