ISIT 2010 : some talks on statistics and probability

For many of the talks I attended I didn’t take notes — partly this is because I didn’t feel expert enough to note things down correctly, and partly because I am

RÉNYI DIVERGENCE AND MAJORIZATION (Tim van Erven; Centrum Wiskunde & Informatica, Peter Harremoës; Copenhagen Business College)
Peter gave a talk on some properties of the Rényi entropy

D_{\alpha}(P \| Q) = \frac{1}{\alpha - 1} \log \int p^{\alpha} q^{1 - \alpha} d \mu

where \mu is a dominating measure, and a related geometric quantity, the Lorenz diagram. This is a set of all points in \mathbb{R}^2 which are equal to (\int f dP, \int f dQ) for some unit measure function f on [0,1]. It turns out that subset relationships of Lorenz diagrams are the same as majorization relationships between the measures. This is related to something called Markov ordering from a recent paper by Gorban, Gorban, and Judge. This was one of those nice talks which pointed me towards some results and tools that are new to me but that I could not fully understand during the talk.

MINIMAX LOWER BOUNDS VIA F-DIVERGENCES (Adityanand Guntuboyina; Yale University)
This talk was on minimax lower bounds on parameter estimation which can be calculated in terms of generalized f divergences, which are of the form

D_f(P \| Q) = \int f \left( \frac{dP}{dQ} \right) dQ

Suppose T is an estimator for some parameter \theta. We want universal (over T) lower bounds on \mathbb{P}(T(X^n) \ne \theta). This gives a generalization of Fano’s inequality:

\mathbb{P}(T(X^n) \ne \theta) \ge \inf_Q \sum_{\theta} D_f(P_{\theta} \| Q)

This has a very geometric feel to it but I had a bit of a hard time following all of the material since I didn’t know the related literature very well (much of it was given in relation to Yang and Barron’s 1999 paper).

MUTUAL INFORMATION SADDLE POINTS IN CHANNELS OF EXPONENTIAL FAMILY TYPE (Todd Coleman; University of Illinois, Maxim Raginsky; Duke University)
In some cases where we have a class of channels and a class of channel inputs, there is a saddle point. The main example is that Gaussian noise is the worst for a given power constraint and Gaussian inputs are the best. Another example is the “bits through queues” This paper gave a more general class of channels of “exponential family type” and gave a more general condition under which you get a saddle point for the mutual information. The channel models are related to Gibbs variational principle, and the arguments had a kind of “free energy” interpretation. Ah, statistical physics rears its head again.

INFORMATION-THEORETIC BOUNDS ON MODEL SELECTION FOR GAUSSIAN MARKOV RANDOM FIELDS (Wei Wang, Martin J. Wainwright, Kannan Ramchandran; University of California, Berkeley)
This was on two problems: inferring edges in a graphical model from iid samples from that model, and inverse covariance estimation. They are similar, but not quite the same. The goal was to prove necessary conditions for doing this; these necessary conditions match the corresponding achievable rates from polytime algorithms. The main result was that you need n = \Omega( d^2 \log p) samples, where p is the number of variables, and d is the max degree in the graphical model. The proof approach was through modeling graph selection as channel coding in which the channel outputs samples from a graphical model chosen by the encoder. If the decoder can identify the model then the problem is solvable. This is analogous to the talk Sriram Vishwanath gave today on matrix completion.

ISIT 2010 : Michael Jordan

Perhaps malapropos for the NBA Finals, Prof. Michael Jordan gave the first plenary talk at ISIT. It was a great overview of nonparametric Bayesian modeling. In particular, he covered his favorite Chinese restaurant process (also known as the Pitman-Yor stick-breaking process), hierarchical Dirichlet priors, and all the other jargon-laden elements of modeling. At the end he covered some of the rather stunning successes of this approach in applications with lots of data to learn from. What was missing for me was a sense of how these approaches worked in the data-poor regime, so I asked a question (foolishly) about sample complexity. Alas, since that is a “frequentist” question and Jordan is a “Bayesian,” I didn’t quite get the answer to the question I was trying to ask, but that’s what happens when you don’t phrase things properly.

One nice thing that I learned was the connection to Kingman’s earlier work on characterizing random measures via non-homogeneous Poisson processes. Kingman has been popping up all over the place in my reading, from urn processes to exchangeable partition processes (also known as paintbox processes). When I get back to SD, it will be back to the classics for me!

Shannon theory helps decipher Pictish?

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s F_2 statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

U_r = F_2 / \log\left( \frac{N_d}{N_u} \right)

where N_d and N_u are the number of di-grams and un-grams, respectively. Why normalize? The statistic F_2 by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If S_d is the number of digrams appearing once and T_d is the total number of digrams, they use a “di-gram repetition factor”

C_r = \frac{N_d}{N_u} + a \cdot \frac{S_d}{T_d}

where the tradeoff factor a is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare C_r to a threshold — if it is small then they deem the system to be more “heraldic”. If C_r is large then then do a three-way decision based on U_r. If U_r is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day

Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

paper a (long time period) : Assouad, Fano, and Le Cam

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions \mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}, where \mathcal{D} is a metric space with metric d(\cdot,\cdot). For two distributions P_1 and P_2 define the affinity \|P_1 \wedge P_2 \| by:

\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1

Let \mathop{\rm co}(\cdot) denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let \hat{\theta} be an estimator of \theta(P) on \mathcal{P}. Suppose D_1 and D_2 be two sets such that d(s_1,s_2) \ge 2 \delta for all (s_1,s_2) \in D_1 \times D_2, and \mathcal{P}_1 and \mathcal{P}_2 be two subsets of \mathcal{P} such that \theta(P) \in D_i when P \in \mathcal{P}_i. Then

\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the H = \{-1, 1\}^m, the vertices of the hypercube. We’ll write t \sim_j t' for t,t' \in H if they differ in the j-th coordinate.

Assouad’s Lemma. Let \mathcal{F}_m = \{P_{t} : t \in H\} be a set of 2^m probability measures indexed by H, and suppose there are m pseudo-distances d_m on \mathcal{D} such that for any pair (x,y)

d(x,y) = \sum_{j}^m d_j(x,y)

and that if t \sim_j t'

d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m

Then

\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}

The min comes about because it is the weakest over all neighbors (that is, over all j) of P_t in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let \mathcal{M}_r \subset \mathcal{P} contain r probability measures such that for all j \ne j' with j,j' \le r

d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r

and

D(P_j~\|~P_{j'}) \le \beta_r

Then

\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)

Here D(\cdot\|\cdot) is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message W uniformly in \{1, 2, \ldots, r\} and then setting the output Y to have the distribution P_j conditioned on W = j.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.

Rationing healthcare

A recurring link in my facebook feed today was to an article in the NY Times Magazine on rationing health care. It’s worth reading, but the this made me squirm a bit:

But even in emergency rooms, people without health insurance may receive less health care than those with insurance. Joseph Doyle, a professor of economics at the Sloan School of Management at M.I.T., studied the records of people in Wisconsin who were injured in severe automobile accidents and had no choice but to go to the hospital. He estimated that those who had no health insurance received 20 percent less care and had a death rate 37 percent higher than those with health insurance. This difference held up even when those without health insurance were compared with those without automobile insurance, and with those on Medicaid — groups with whom they share some characteristics that might affect treatment. The lack of insurance seems to be what caused the greater number of deaths.

Oh correlation/causation fallacy, I long for your demise. At least he said “seems.”

ISIT 2009 : plenaries and the Shannon Award

As I mentioned earlier, Rich Baraniuk’s plenary was quite energetic and entertaining. David Tse gave the next plenary, called It’s Easier to Approximate, mainly building on his recent string of work with Raul Etkin, Hua Wang, Suhas Diggavi, Salman Avestimehr, Guy Bresler, Changho Suh, and Mohammad Ali Maddah-Ali (and others too I imagine). He motivated the idea of using deterministic models for multiterminal Gaussian problems essentially by appealing to the idea of building approximation algorithms, although the connection directly to that community wasn’t made as explicitly (c.f. Michelle Effros’s plenary). David is also a great speaker, so even though I had seen a lot of these results before from being at Berkeley, the talk helped really put it all together. Raymond Yeung gave another great talk on information inequalities and for the first time I think I understood the point of “non-Shannon information inequalities” in terms of their connections to other disciplines. Hopefully I’ll get around to posting something about the automatic inequality provers out there. Unfortunately I missed all of Friday, so I didn’t get to see Noga Alon‘s plenary, but I’m sure it was great too. Does anyone else care to comment?

The Shannon Lecture was given, of course, by Jorma Rissanen, and it was on a basic question in statistics : model selection. Rissanen developed the Minimum Description Length (MDL) principle, which I had always understood in a fuzzy sense to mean that you choose a model which is easy to describe information theoretically, but I never had a good handle on what that meant besides taking some logs. The talk was peppered with good bits of philosophy. One which stood out to me was that our insistence that there be a “true model” for the data often leads to problems. That is, sometimes we’re better off not assuming that the data was generating according to a particular model, but focus on finding the best model that fits the data in our class. I got to chat with Rissanen later about this and pointed out that it’s a bit like religion to assume an underlying true distribution. Another great tidbit was his claim that “nonparametric models are nonsense,” by which he meant that essentially every model is parametric — it’s just that sometimes the number of parameters is mind-bogglingly huge. The most interesting thing is that there were new results in the Shannon lecture, and Rissanen is working on a paper with those results now!

The big news was of course that Te Sun Han is going to be next year’s Shannon Awardee. I was very happy to hear this — I’ve been hoping he would win it for a while now, so I had a big grin on my face leaving the banquet…

Infocom 2009 : network statistics problems

Graph Sampling Techniques for Studying Unstructured Overlays
Amir Hassan Rasti Ekbatani (University of Oregon, US); Mojtaba Torkjazi (University of Oregon, US); Reza Rejaie (University of Oregon, US); Nick Duffield (AT&T Labs – Research, US); Walter Willinger (AT&T Labs – Research, US); Daniel Stutzbach (Stutzbach Enterprises LLC, US)
The question in this paper was how to collect network statistics (such as node degree) in peer-to-peer networks. Since peers join and leave, selecting a random peer is difficult because there are biases caused by high-degree peers and low-lifetime peers. One approach to correct for this is to use Markov chain Monte Carlo (MCMC) to sample from the nodes. This talk proposed respondent-driven sampling (RDS), a technique used to map social networks. The idea is to do a random walk on the network and gather statistics along the walk (not just at the end, as in MCMC). These measurements are biased, but then we can correct for the bias at the end (using something called the Hansen-Hurwitz estimator, which I don’t know much about). Experimental works showed a big improvement when vanilla MCMC and RDS were compared on hierarchical scale-free graphs.

A Framework for Efficient Class-based Sampling
Mohit Saxena (Purdue University, US); Ramana Rao Kompella (Purdue University, US)
This talk was about increasing the coverage of monitoring flows in networks. Currently routers sample flows uniformly, which catches the big “elephant” flows but not the low-volume “mouse” flows. In order to increase coverage, the authors propose dividing flows into classes and then sample the classes. So the architecture is to classify a packet first, then choose whether to sample it. The goal is to do a combination of sketching and hashing that shifts the complexity to slower memory, thereby saving costs. They use something called a composite Bloom filter to do this.

Estimating Hop Distance Between Arbitrary Host Pairs
Brian Eriksson (University of Wisconsin – Madison, US); Paul Barford (University of Wisconsin – Madison, US); Rob Nowak (University of Wisconsin, Madison, US)
Given a big network (like the internet) with N end hosts and M monitors, how can we estimate the hop distance between two end hosts from measurements between end hosts and monitors? This paper proposes using passive measurements, so that there may be many holes in the (N+M) \times (N+M) matrix of pairwise measurements. Their approach is to use a weighted version of multidimensional scaling (MDS) to deal with the holes. This approach can also be extended to deal with side information about the network topology (for example autonomous system IDs).