ISIT 2010 : a few more talks

I think I lack the willpower to write up more notes on talks, and there are other things I’d like to blog about, but here are one or two sentences on some other talks that I found interesting… I also enjoyed the energy session on Friday and the talks by Sundeep and Galen on compressed sensing, but time has gotten the better of me. Next time, folks.

Channel Intrinsic Randomness
Matthieu Bloch
This was on extracting random bits from the output of noisy channel. These bits should be independent of the input to the channel. Matthieu uses the enigmatic information spectrum method to get his results — thanks to the plenary lecture I was able to understand it a bit better than I might have otherwise.

Assisted Common Information
Vinod Prabhakaran and Manoj Prabhakaran
I was very interested in this talk because I have been thinking of a related problem. Two terminals observe correlated sources X_1^n and Y_1^n respectively. A genie observes both sources and sends messages at rates R_1 and R_2 to the two terminals, who then have to produce variables W_1 and W_2 which have large entropies and are also equal with high probability. This problem is connected to the Gacs-Korner problem and Wyner’s common information problem, and also possibly this recent preprint by Andrej Bogdanov and Elchanan Mossel. They manage to solve it using a novel construction of “monotone regions.”

Patterns and Exchangeability
N. P. Santhanam and M. Madiman
This work grew out of the AIM workshop on permanents. De Finetti’s theorem says an exchangeable process can be built up as a mixture of iid processes. Kingman showed that something called an exchangeable partition process is built up from something he called “paintbox processes.” One thing that we discovered at the workshop was that the pattern process of an iid process is the same as a paintbox process (and vice-versa). The paper then goes through many connections between these processes, certain limits of graphs, and connections to universal compression.

Universal Hypothesis Testing in the Learning-Limited Regime
Benjamin G. Kelly, Thitidej Tularak, Aaron B. Wagner, and Pramod Viswanath
This was a really great talk. The problem here is that for each n you get n samples X^n distributed i.i.d. according to p_n or q_n on an alphabet \mathcal{A}_n which can grow with n. Given a new sequence Z^n you have to decide if it was generated according to p_n or q_n. They show a number of results which say that consistency is possible for |\mathcal{A}_n| sublinear in n, impossible for quadratic in n, and other intermediate results. In particular, for well-behaved distributions with |\mathcal{A}_n| = \Theta(n^{\alpha}) and all probabilities are \Theta(n^{-\alpha}), they can get some consistency results, but in particular the generalized likelihood ratio test (GLRT) is inconsistent for \alpha = 1.

Feature Extraction for Universal Hypothesis Testing via Rank-Constrained Optimization
Dayu Huang and Sean Meyn
This talk was of interest to me because I have been looking at hypothesis testing problems in connection with election auditing. In universal testing you know a lot about the distribution for one hypothesis, but much less about the other hypothesis. The Hoeffding test is a threshold test on the KL-divergence between the empirical distribution and the known hypothesis. This test is asymptotically optimal but has a high variance when the data is in high dimension. Thus for smaller sample sizes, a so-called mismatched divergence test may be better. In this paper they look at how to tradeoff the variance and the error exponent of the test.

ISIT 2010 : Abbas El Gamal and Te Sun Han

I seem to have gotten all behind on wrapping up the ISIT blogging, so the remainder may be more compressed takes on things. This is not in the compressed sensing world, where the signals are sparse and my comments are meant to reconstruct, but more like lossy compression where D \to \sigma^2 (for the Gaussian case).

Abbas El Gamal gave a very nice plenary on “Coding for Noisy Networks” in which he really brought together a lot of different eras and streams of work on network information theory and tried to tie them together in a conceptual framework. There was a nice mix of older and newer results. The thing I liked best about it was that he was very optimistic about making progress in understanding how to communicate in networks from an information-theory perspective, which counteracts the sentiment that I heard that “well, it’s just too messy.”

Te Sun Han gave the Shannon Lecture, of course, and he used his time to give a tutorial on the information spectrum method. I had tried to read the book earlier, and honestly found it a little impenetrable (or rather, I wasn’t sure what I was supposed to use from it). The talk was more like reading the papers — concisely stated, but with a clear line of intuition. I know some people are not a big fan of Shannon Lectures as tutorials, but I think there is also a case to be made that most people are unfamiliar with the information spectrum method. A nice example he gave was to show when the output of an optimal source coder looks “completely random.” Maybe this has been done already, but is there a connection between existing theories of pseudorandomness and the information spectrum method?

ISIT 2010 : error analysis and asynchrony

The remaining blogging from ISIT on my end will probably be delayed a bit longer since I have camera deadlines for EVT and ITW, a paper with Can Aysal for the special SP gossip issue that is within epsilon of getting done (and hence keeps getting put off) and a paper for Allerton with Michele Wigger and Young-Han Kim that needs some TLC. Phew! Maybe Alex will pick up the slack… (hint, hint)

Asynchronous Capacity per Unit Cost (Venkat Chandar, Aslan Tchamkerten, David Tse)
This paper was the lone non-adversarial coding paper in the session my paper was in. Aslan talked about a model for coding in which you have a large time block [0,A] in which B bits arrive at a random time at the encoder. The bits have to be transmitted to the receiver with delay less than D. The decoder hears noise when the encoder is silent, and furthermore doesn’t know when the encoder starts transmitting. They both know A, however. They do a capacity per-unit-cost analysis for this system and give capacity results in various cases, such as whether or not there is is a 0-cost symbol, whether or not the delay is subexponential in the number of bits to be sent, and so on.

Moderate Deviation Analysis of Channel Coding: Discrete Memoryless Case (Yucel Altug, Aaron Wagner)
This was a paper that looked at a kind of intermediate behavior in summing and normalizing random variables. If you sum up n iid variables and normalize by n you get a law of large numbers and large deviations analysis. If you normalized by \sqrt{n} you get a central limit theorem (CLT). What if you normalize by n^{2/3} for example? It turns out you get something like
\mathbb{P}( \frac{1}{n^{\beta}} \sum_{i=1}^{n} X_i \ge \epsilon ) \le \exp( -n^{2 \beta - 1} I_{\mathcal{N}}(\epsilon) )
where I_{\mathcal{N}}(\epsilon) is the rate function for the Gaussian you get out of the CLT. This is useful for error exponent analysis because it lets you get a handle on how fast you can get the error to decay if your sequence of codes has rate tending to capacity. They use these techniques to show that for a sequence of codes with gaps \epsilon_n to capacity the error decays like \exp( - n \epsilon_n^2/(2 \sigma), where \sigma is the dispersion of the channel.

Minimum energy to send k bits with and without feedback (Yury Polyanskiy, H. Vincent Poor, Sergio Verdu)
This was on the energy to send a bit but in the non-asymptotic energy regime (keeping philosophically with the earlier line of work by the authors) for the AWGN channel. Without feedback they show an achievability and converse for sending M codewords with error \epsilon and energy E. These give matching scalings up to the first three terms as E \to \infty. The first term is (E/N_0) \log e. Interestingly, with feedback, they show that the first term is 1/(1 - \epsilon) times the term without feedback, which makes sense since as the error goes to 0 the rates should coincide. They also have a surprising result showing that with energy E > k N_0 you can send 2^k messages with 0 error.

ISIT 2010 : some talks on statistics and probability

For many of the talks I attended I didn’t take notes — partly this is because I didn’t feel expert enough to note things down correctly, and partly because I am

RÉNYI DIVERGENCE AND MAJORIZATION (Tim van Erven; Centrum Wiskunde & Informatica, Peter Harremoës; Copenhagen Business College)
Peter gave a talk on some properties of the Rényi entropy

D_{\alpha}(P \| Q) = \frac{1}{\alpha - 1} \log \int p^{\alpha} q^{1 - \alpha} d \mu

where \mu is a dominating measure, and a related geometric quantity, the Lorenz diagram. This is a set of all points in \mathbb{R}^2 which are equal to (\int f dP, \int f dQ) for some unit measure function f on [0,1]. It turns out that subset relationships of Lorenz diagrams are the same as majorization relationships between the measures. This is related to something called Markov ordering from a recent paper by Gorban, Gorban, and Judge. This was one of those nice talks which pointed me towards some results and tools that are new to me but that I could not fully understand during the talk.

MINIMAX LOWER BOUNDS VIA F-DIVERGENCES (Adityanand Guntuboyina; Yale University)
This talk was on minimax lower bounds on parameter estimation which can be calculated in terms of generalized f divergences, which are of the form

D_f(P \| Q) = \int f \left( \frac{dP}{dQ} \right) dQ

Suppose T is an estimator for some parameter \theta. We want universal (over T) lower bounds on \mathbb{P}(T(X^n) \ne \theta). This gives a generalization of Fano’s inequality:

\mathbb{P}(T(X^n) \ne \theta) \ge \inf_Q \sum_{\theta} D_f(P_{\theta} \| Q)

This has a very geometric feel to it but I had a bit of a hard time following all of the material since I didn’t know the related literature very well (much of it was given in relation to Yang and Barron’s 1999 paper).

MUTUAL INFORMATION SADDLE POINTS IN CHANNELS OF EXPONENTIAL FAMILY TYPE (Todd Coleman; University of Illinois, Maxim Raginsky; Duke University)
In some cases where we have a class of channels and a class of channel inputs, there is a saddle point. The main example is that Gaussian noise is the worst for a given power constraint and Gaussian inputs are the best. Another example is the “bits through queues” This paper gave a more general class of channels of “exponential family type” and gave a more general condition under which you get a saddle point for the mutual information. The channel models are related to Gibbs variational principle, and the arguments had a kind of “free energy” interpretation. Ah, statistical physics rears its head again.

INFORMATION-THEORETIC BOUNDS ON MODEL SELECTION FOR GAUSSIAN MARKOV RANDOM FIELDS (Wei Wang, Martin J. Wainwright, Kannan Ramchandran; University of California, Berkeley)
This was on two problems: inferring edges in a graphical model from iid samples from that model, and inverse covariance estimation. They are similar, but not quite the same. The goal was to prove necessary conditions for doing this; these necessary conditions match the corresponding achievable rates from polytime algorithms. The main result was that you need n = \Omega( d^2 \log p) samples, where p is the number of variables, and d is the max degree in the graphical model. The proof approach was through modeling graph selection as channel coding in which the channel outputs samples from a graphical model chosen by the encoder. If the decoder can identify the model then the problem is solvable. This is analogous to the talk Sriram Vishwanath gave today on matrix completion.

ISIT 2010 : Michael Jordan

Perhaps malapropos for the NBA Finals, Prof. Michael Jordan gave the first plenary talk at ISIT. It was a great overview of nonparametric Bayesian modeling. In particular, he covered his favorite Chinese restaurant process (also known as the Pitman-Yor stick-breaking process), hierarchical Dirichlet priors, and all the other jargon-laden elements of modeling. At the end he covered some of the rather stunning successes of this approach in applications with lots of data to learn from. What was missing for me was a sense of how these approaches worked in the data-poor regime, so I asked a question (foolishly) about sample complexity. Alas, since that is a “frequentist” question and Jordan is a “Bayesian,” I didn’t quite get the answer to the question I was trying to ask, but that’s what happens when you don’t phrase things properly.

One nice thing that I learned was the connection to Kingman’s earlier work on characterizing random measures via non-homogeneous Poisson processes. Kingman has been popping up all over the place in my reading, from urn processes to exchangeable partition processes (also known as paintbox processes). When I get back to SD, it will be back to the classics for me!

From Polya urns to the OK Corral

A new postdoc here, Punyaslok Purkayastha, pointed out to me a Probability Surveys paper by Robin Pemantle on random processes with reinforcement, and I’ve been reading through it in spare moments (usually on the bus). I have had a problem knocking about my head that is related (Ram Rajagopal and I thought we could do in our spare time but we never managed to find enough time to work on it). All this stuff is pretty well-known already, but I thought it was a nice story.

A Polya urn process starts by taking an urn with R_0 = a red balls and B_0 = b black balls at time 0. For each time t \ge 1, draw a ball uniformly from the balls in the urn and then replace it together with another ball of the same color. Let X_n be the sequence of colors that you draw, where red is 0 and black is 1. So with probability R_n / (R_n + B_n) we have X_n = 0. The urn is updated according to R_{n+1} = R_n + 1(X_n = 0) and B_{n+1} = B_n + 1(X_n = 1).`

To figure out the asymptotic behavior of this process, just note that the sequence of colors is exchangeable. If x_1^n has k 1’s in it, then

\mathbb{P}(X_0^{n} = x_0^n) = \frac{ \prod_{i = 0}^{k-1} (a + i) \prod_{i=0}^{n-k-1} (b + i)}{ \prod_{i=0}^{n-1} (a + b + i) }

From this it is clear that the probability of seeing $x_1^n$ only depends on its type (empirical distribution, for non-information theorists). Since the sequence is exchangeable, de Finetti’s Theorem shows that the fraction of red balls W_n = R_n/(R_n + B_n) converges almost surely to a random variable W, and the limit of the formula above shows that the limit is a beta distribution with parameters (a,b):

f_W(w) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} w^{a-1} (1 - w)^{b-1}

A more general process is the Friedman urn process, in which you add \alpha balls of the same color you drew and \beta balls of the opposite color. If \alpha > \beta > 0 then Freedman (different spelling, different guy) showed that the proportion of red balls tends almost surely to 1/2. This happens because the evolution ends up looking like a symmetric two-state Markov chain, which has (1/2, 1/2) as its stationary distribution. Note here that the limiting fraction is almost surely constant, as opposed to there being a limiting distribution, as in the Polya process.

Let’s look at a special case where instead of adding a ball of the same color that you drew, you only add a ball of the opposite color (which is a Friedman process with parameters (0,1)). Freedman’s result doesn’t apply here, but Kingman has a nice paper (which I am reading now) relating this process to a third process: the OK Corral process.

In the OK Corral process, say we start with (c,d) good and bad cowboys. We choose one cowboy uniformly at random and then he kills a cowboy on the opposite side. This process continues until one side has all been shot. This process is precisely the time reversal of the special Friedman urn process above. If we let S_N denote the number of surviving cowboys after the process terminates when we start with (N,N), Kingman proves that

V_N = \frac{3S_N^4}{2N^3}

converges to a gamma distribution with parameter 1/2:

f_V(v) = v^{-1/2} e^{-v}/\Gamma(1/2)

The weird thing about this is that the scaling is N^{3/4} to get an asymptotic distribution.

Kingman and Volkov extend the analysis by embedding the urns into birth processes, which is another technique for balls-into-bins with feedback of urns with reinforcement. But that’s another story for another time (but maybe not another blog post, this one took a bit longer than I expected).

Bayes-Ball in a nutshell

One of the fun thing about graphical models is that arguments can be done by looking at diagrams (kind of like a diagram chase in algebraic topology). One such trick is from R.D. Shachter’s paper in UAI called “Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams)” (see it here. for example). This is a handy method for figuring out conditional independence relations, and is a good short-cut for figuring out when certain conditional mutual information quantities are equal to 0. The diagram below shows the different rules for when the ball can pass through a node or when it bounces off. Gray means that the variable is observed (or is in the conditioning). I tend to forget the rules, so I made this little chart summary to help myself out.