Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions \mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}, where \mathcal{D} is a metric space with metric d(\cdot,\cdot). For two distributions P_1 and P_2 define the affinity \|P_1 \wedge P_2 \| by:

\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1

Let \mathop{\rm co}(\cdot) denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let \hat{\theta} be an estimator of \theta(P) on \mathcal{P}. Suppose D_1 and D_2 be two sets such that d(s_1,s_2) \ge 2 \delta for all (s_1,s_2) \in D_1 \times D_2, and \mathcal{P}_1 and \mathcal{P}_2 be two subsets of \mathcal{P} such that \theta(P) \in D_i when P \in \mathcal{P}_i. Then

\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the H = \{-1, 1\}^m, the vertices of the hypercube. We’ll write t \sim_j t' for t,t' \in H if they differ in the j-th coordinate.

Assouad’s Lemma. Let \mathcal{F}_m = \{P_{t} : t \in H\} be a set of 2^m probability measures indexed by H, and suppose there are m pseudo-distances d_m on \mathcal{D} such that for any pair (x,y)

d(x,y) = \sum_{j}^m d_j(x,y)

and that if t \sim_j t'

d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m

Then

\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}

The min comes about because it is the weakest over all neighbors (that is, over all j) of P_t in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let \mathcal{M}_r \subset \mathcal{P} contain r probability measures such that for all j \ne j' with j,j' \le r

d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r

and

D(P_j~\|~P_{j'}) \le \beta_r

Then

\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)

Here D(\cdot\|\cdot) is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message W uniformly in \{1, 2, \ldots, r\} and then setting the output Y to have the distribution P_j conditioned on W = j.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.

I’m giving a talk on Friday, so come on down! This has been your daily self-promotion effort, thank you for reading.

Consensus in context : leveraging the network to accelerate distributed consensus

October 30, 2009 ENS 637
11:00 am

Gossip algorithms are a class of decentralized solutions to the problem of achieving consensus in a network of agents. They have attracted recent research interest because they are simple and robust — attractive qualities for wireless ad-hoc and sensor networks. Unfortunately, the standard gossip protocol converges very slowly for many popular network models. I will discuss three ways to leverage properties of the network to achieve faster convergence : routing, broadcast, and mobility.

Joint work with Alex G. Dimakis, Tuncer Can Aysal, Mehmet Ercan Yildiz, Martin Wainwright, and Anna Scaglione.

I am writing this from my new iphone, for which there is a wordpress app that will let me write posts from my phone. I have to say that although the concept is cool, having to write anything complicated on this touch screen is a little maddening. That being said, I hope to blog soon about some recent reads, canonical angles between subspaces, and a generalization of Fano’s inequality that I learned about in a paper by Bin Yu.

NPR’s All Things Considered is running a 4-part series on privacy issues in the modern digital era. Since I’ve started working on privacy research (specifically related to privacy in machine learning problems and in medical databases) these popular news stories are a good insight into how people in general think about privacy. The first segment is on data companies and their increasing lack of disclosure, and today’s was mostly about facebook. I’m looking forward to the rest of it — I’ve already had one or two nebulous research ideas, which is always a nice feeling.

From Terence Tao’s blog, I see that they have finished writing the first polymath project on A new proof of the density Hales-Jewett theorem. It’s a fascinating experiment in math research (or is it maths research?) and it’s great to see that it worked.

Inside Higher Ed has an opinion essay by UToronto’s Mark Kingwell arguing against the institution of tenure, or rather arguing that it should be revisited. Some of the comments are as interesting as the article itself, but it’s worth a read.

As an aside, I disagree with Kingwell, but I don’t have a fully articulated critique (yet).

My friend Reno has a California Bankruptcy Blog.

The ISIT 2010 site seems quite definitive, no? (h/t Pulkit.)

The Times has a nice profile of Martin Gardner.

My buddy, buildingmate at UCSD, and fellow MIT thespian Stephen Larson premiered the Whole Brain Catalog at the Society for Neuroscience conference.

A fascinating article on the US-Mexico border (h/t Animikwaan.)

Kanye West is an oddly compelling trainwreck. (via MeFi).

I was at the Allerton conference last week for a quick visit home — I presented a paper I wrote with Can Aysal and Alex Dimakis on analyzing broadcast-based consensus algorithms when channels vary over time. The basic intuition is that if you can get some long-range connections “enough of the time,” then you’ll get a noticeable speed up in reaching a consensus value, but with high connectivity the variance of the consensus value from the true average increases. It was a fun chance to try out some new figures that I made in OmniGraffle.

The plenary was given by Abbas El Gamal on a set of new lecture notes on network information theory that he has been developing with Young-Han Kim. He went through the organization of the material and tried to show how much of the treatment could be developed using robust typicality (see Orlitsky-Roche) and a few basic packing and covering lemmas. This in turn simplifies many of the proofs structurally and can give a unified view. Since I’ve gotten an inside peek at the notes, I can say they’re pretty good and clear, but definitely dense going at times. They are one way to answer the question of “well, what do we teach after Cover and Thomas.” They’re self-contained and compact .

What struck me is that these are really notes on network information theory and not on advanced information theory, which is the special topics class I took at Berkeley. It might be nice to have some of the “advanced but non-network” material in a little set of notes. Maybe the format of Körner’s Fourier Analysis book would work : it could cover things like strong converses, delay issues, universality, wringing lemmas, AVCs, and so on. Almost like a wiki of little technical details and so on that could serve as a reference, rather than a class. The market’s pretty small though…

I felt a bit zombified during much of the conference, so I didn’t take particularly detailed notes, but here were a few talks that I found interesting.

  • Implementing Utility-Optimal CSMA (Lee, Jinsung (KAIST), Lee, Junhee (KAIST), Yi, Yung (KAIST), Chong, Song (KAIST), Proutiere, Alexandre (Microsoft Research), Chiang, Mung (Princeton University)) — There are lots of different models and algorithms for distributed scheduling in interference-limited networks. Many of these protocols involve message passing and the overhead from the messages may become heavy. This paper looked at how to use the implicit feedback from CSMA. They analyze a simple two-link system with two parameters (access and hold) and then use what I though was a “effective queue size” scheduling method. Some of the analysis was pretty tricky, using stochastic approximation tools. There were also extensive simulation results from a real deployment.
  • LP Decoding Meets LP Decoding: A Connection between Channel Coding and Compressed Sensing (Dimakis, Alexandros G. (USC), Vontobel, Pascal O. (Hewlett-Packard Laboratories)) — This paper proved some connections between Linear Programming (LP) decoding of LDPC codes and goodness of compressed sensing matrices. A simple example : if a parity check matrix is good for LP decoding then it is also good for compressed sensing. In particular, if it can correct k errors it can detect k-sparse signals, and if it can correct a specific error pattern \vec{e} it can detect all signals whose support is the same as \vec{e}. There were many other extensions of this results as well, to other performance metrics, and also the Gaussian setting.
  • The Compound Capacity of Polar Codes (Hassani, Hamed (EPFL), Korada, Satish Babu (EPFL), Urbanke, Ruediger (EPFL)) — Rudi gave an overview of polar codes from two different perspectives and then showed that in general polar codes do not achieve the compound channel capacity, but it’s not clear if the problem is with the code or with the decoding algorithm (so that’s still an open question).
  • Source and Channel Simulation Using Arbitrary Randomness (Altug, Yucel (Cornell University), Wagner, Aaron (Cornell University)) — The basic source simulation problem is to simulate an arbitrary source Y_1^n with an iid process (say equiprobable coin flips) X_1^n. Using the information spectrum method, non-matching necessary and sufficient conditions can be found for when this can be done. These are in terms of sup-entropy rates and so on. Essentially though, it’s a theory built on the limits or support of the spectra of the two processes X and Y. If the entropy spectrum in X dominates that of Y then you’re in gravy. This paper proposed a more refined notion of dominance which looks a bit like majorization of some sort. I think they showed that was sufficient, but maybe it’s also necessary too. Things got a bit rushed towards the end.
  • Channels That Die (Varshney, Lav R. (MIT), Mitter, Sanjoy K. (MIT), Goyal, Vivek K (MIT)) — This paper looked at a channel which has two states, alive (a) and dead (d), and at some random time during transmission, the channel will switch from alive to dead. For example, it might switch according to a Markov chain. Once dead, it stays dead. If you want to transmit over this channel in a Shannon-reliable sense you’re out of luck, but what you can do is try to maximize the expected number of bits you get through before the channel dies. I think they call this the “volume.” So you try to send a few bits at a time (say b_1, b_2, \ldots b_k are the number of bits in each chunk). How do you size the chunks to maximize the volume? This can be solved with dynamic programming. There are still several open questions left, but the whole construction relies on using “optimal codes of finite blocklength” which also needs to be solved. Lav has some interesting ideas along these lines that he told me about when I was in Boston two weeks ago…
  • The Feedback Capacity Region of El Gamal-Costa Deterministic Interference Channel (Suh, Changho (UC Berkeley), Tse, David (UC Berkeley)) — This paper found a single letter expression for the capacity region, which looks like the Han-Kobayashi achievable region minus the two weird 2 R_i + R_j constraints. Achievability uses Han-Kobayashi in a block-Markov encoding with backward decoding, and the converse uses the El Gamal-Costa argument with some additional Markov properties. For the “bit-level” deterministic channel model, there is an interpretation of the feedback as filling a “hole” in the interference graph.
  • Upper Bounds to Error Probability with Feedback (Nakiboglu, Baris (MIT), Zheng, Lizhong (MIT)) — This is a follow-on to their ISIT paper, which uses a kind of “tilted posterior matching” (a phrase which means a lot to people who really know feedback and error exponents in and out) in the feedback scheme. Essentially there’s a tradeoff between getting close to decoding quickly and minimizing the error probability, and so if you change the tilting parameter in Baris’ scheme you can do a little better. He analyzes a two phase scheme (more phases would seem to be onerous).
  • Infinite-Message Distributed Source Coding for Two-Terminal Interactive Computing (Ma, Nan (Boston University), Ishwar, Prakash (Boston University)) — This looked at the interactive function computation problem, in which Alice has X^n, Bob has Y^n and they want to compute some functions \{f_A(X_i,Y_i) : i \le n\} and \{f_B(X_i,Y_i) : i \le n\}. The pairs (X_i, Y_i) are iid and jointly distributed according to some distribution p_{XY}. As an example, (X_i,Y_i) could be correlated bits and f_A and f_B could be the AND function. In previous work the authors characterized the the rate region (allowing vanishing probability of error) for t rounds of communication, and here they look at the case when there are infinitely many back-and-forth messages. The key insight is to characterize the sum-rate needed as a function of p_{XY} — that is, to look at the function R : \mathcal{P}_{XY} \to \mathbb{R} and look at that surface’s analytic properties. In particular, they show it’s the minimum element of a class \mathcal{F} of functions which have some convexity properties. There is a preprint on ArXiV.

From The Information Theory of Emotions of Musical Chords, posted on ArXiV today:

My basic hypothesis is as follows. While perceiving a separate musical chord, the value of a relative
objective function L that is directly related to the main proportion of pitches of its constituent
sounds is generated in the mind of the subject. The idea of an increasing value of the objective
function ( L>1 ) accompanied by positive utilitarian emotions corresponds to a major chord, while
the idea of a decreasing value of the objective function ( L<1 ) accompanied by negative utilitarian
emotions corresponds to a minor chord.

I naturally thought of The Mathematics of Love:

Due to extensive traveling around. At the end of August and beginning of September I attended a workshop at the American Institute of Mathematics on Permanents and modeling probability distributions. It was a lot of fun, and a lot different than any workshop I’d been to before. There were only 20 or so participants and every day we’d break out in smaller groups to actually work on some sub-problem or area. By the end of the week I felt like I’d had a crash course in large-alphabet probability estimation and also got a sense of the huge range of problems and techniques (from exchangeable random partition processes to Markov chain Monte Carlo) involved in tackling them.

Last week I gave a talk on gossip algorithms at UC Davis and got a chance to talk with a lot of people there. It was a great trip, except for the 6:30 flight time out of San Diego. Then last weekend there were heirloom tomatoes:

Heirloom Tomato with bread and Cowgirl Creamery Red Tail cheese

Heirloom Tomato with bread and Cowgirl Creamery Red Tail cheese


And also peach pie:
Peach pie...

Peach pie...


Mmmm, pie.

Next Page »