Bayes-Ball in a nutshell

One of the fun thing about graphical models is that arguments can be done by looking at diagrams (kind of like a diagram chase in algebraic topology). One such trick is from R.D. Shachter’s paper in UAI called “Bayes-Ball: The Rational Pastime (for Determining Irrelevance and Requisite Information in Belief Networks and Influence Diagrams)” (see it here. for example). This is a handy method for figuring out conditional independence relations, and is a good short-cut for figuring out when certain conditional mutual information quantities are equal to 0. The diagram below shows the different rules for when the ball can pass through a node or when it bounces off. Gray means that the variable is observed (or is in the conditioning). I tend to forget the rules, so I made this little chart summary to help myself out.

Another take on matched sources and channels

Ehsan Ardestanizadeh posed a little problem in Young-Han Kim‘s group meeting before break and Paolo Minero presented a cute solution which I thought might be of interest. It’s related to ideas my advisor, Michael Gastpar worked on in his thesis.

Suppose I have a Gaussian variable S that is \mathcal{N}(0,1), so zero-mean and unit variance. If I observe Y = S + Z where Z is also \mathcal{N}(0,1) and independent of S, the minimum mean-square error (MMSE) estimate of S given Y is just

\mathsf{MMSE}( S | Y) = \frac{1}{2} Y

and the MSE is \mathbb{E}[ (S - \hat{S})^2 ] = 1/2. The question is this: can we somehow get an MSE of less than 1/2 by “encoding” S? Suppose now we can take any function f and let X = f(S), and Y = X + Z but with the restriction that \mathsf{Var}(X) \le 1. That is, the encoding cannot take any more power.

The answer is no, and comes via an “information theoretic” argument. Consider the vector version of the problem where you have n iid unit variance Gaussians S^n and you want to estimate S^n from Y^n = f(S^n) + Z^n where Z^n is iid unit variance Gaussian as well. The goal is to minimize the average per-letter distortion:

d(\hat{S}^n, S^n) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[ (S_i - \hat{S}_i)^2 ]

This is just the problem of joint source-channel coding of a Gaussian source over an AWGN channel with quadratic distortion and encoding function f must satisfy a unit power constraint. For this problem the rate distortion function is

R(D) = \frac{1}{2} \log \frac{1}{D}

and the capacity of the channel is \frac{1}{2} \log(1 + 1) = \frac{1}{2}. Since separate source coding (compression) followed by channel coding (error control) is optimal, in order to get distortion D the rate R(D) \le 1/2 so D \ge 1/2. Furthermore, this is achievable with no coding at all by just setting f(S^n) = S^n.

Now if there was a scheme for the single-letter case which got MSE less than 1/2, we could concatenate it to get a vector scheme with distortion less than 1/2. But since D \ge 1/2 in the optimal code, we get a contradiction. Thus encoding does not help in the single-letter case either. If S isn’t Gaussian the whole story changes though.

Concert Bleg : The Messiah

I’m singing again!

The Messiah

Orchestra Nova
Conducted by Jung-Ho Pak

Bach Collegium San Diego
Ruben Valenzuela, Music Director

Guest Artists:
Virginia Sublett, Soprano
Katherine Lundeen, Alto
Robert MacNeil, Tenor
John Polhamus, Bass

St. Paul’s Cathedral, San Diego
Thursday, December 10, 7:30 p.m.

St. James by-the-Sea Episcopal Church, La Jolla
Friday, December 11, 7:30 p.m.

Solana Beach Presbyterian Church, Solana Beach
Saturday, December 12, 7:30 p.m.

This season’s Masterpiece Messiah is an encore presentation of our dramatic video experience of the great masterpieces of art
complementing the most famous of all oratorios, George Frideric Handel’s Messiah. Joined again by Bach Collegium San Diego, our
interpretation has become well-known for its original 18th-century period approach, creating an unforgettable emotional experience that
goes beyond most traditional performances.

New this year and by popular demand, we’re adding a performance at St. Paul’s Cathedral, Downtown San Diego. Don’t miss out on one of the hottest tickets in town during December. Buy tickets online or call 858-350-0290.

The sun always rises?

I went to a workshop at the end of the summer at the American Institute of Mathematics on Permanents and modeling probability distributions. The main questions we looked at were how to estimate a probability distribution from samples when you don’t know e.g. how many possible values there are. A simple version of this which is often mentioned is estimating the number of different butterfly species from a sample containing many unique species. As C.B. Williams wrote:

About 1940, a closer cooperation in a new field was started by a letter from Dr Steven Corbet, who had collected butterflies for many years in Malaya. On studying his collections after returning to England, he found that he had approximately 9000 specimens which included 316 species. Of these 118 were represented by only a single individual; 74 by 2 individuals, 44 by 3 and so on-the greater the number of individuals per species, the fewer were the number of species at that
level. From this skew distribution it followed that the 37% rarer species accounted for only 1.3% of the total sample.

R.A. Fisher, Corbet and Williams wrote a paper in the Journal of Animal Ecology in 1943 to estimate the number of species (“The relation between the number of species and the number of individuals in a random sample of an animal population”). Another famous example is estimating the number of words Shakespeare knew, as in the work of Thisted and Efron (popularized a bit more by the application to determining if a new poem is by Shakespeare or not).

At the workshop I learned from Susan Holmes about this collection of essays by S.L. Zabell called Symmetry and its Discontents, which I have been enjoying tremendously. Alon Orlitsky gave a short presentation orienting the workshop participants, and started out with a “simpler” problem of inductive inference — given that you saw a trial succeed p times and fail q times, what is the bias of the coin? Bayes thought about this a lot, as did Laplace, who wrote this infamous example (excerpted from Zabell’s book):

Thus we find that an event having occurred successively any number of times, the probability that it will happen again the next time is equal to this number increased by unity, divided by the same number, increased by two units. Placing the most ancient epoch of history at five thousand years ago, or at least 1826213 days, and the sun having risen constantly in the interval at each revolution of twenty-four hours, it is a bet of 1826214 to one that it will rise again tomorrow. [Essai philosophique p. xvii]

Laplace gets a lot of flak for this estimate, and it’s an infamous example of the “excesses of probabilistic reasoning.” But as I learned this morning, Laplace immediately went on to say:

But this number is incomparably greater for him who, recognizing in the totality of the phenomena the regulatory principle of days and seasons, sees that nothing at the present moment can arrest the course of it.

Laplace was advocating this method of calculation formally, as the correct way to compute a probability based only on the information from the sample (e.g. the number of repeated successes/events). As Persi Diaconis pointed out at the workshop, Laplace would be “turning over in his grave” at some of the things people have accused him of mathematically.

And that is your historical nugget for the day.

Lazy random walks on the torus via Dirichlet forms

One of the simple example graphs I’ve used in some of my research on gossip algorithms has been the 2-dimensional torus with n vertices, which looks like a \sqrt{n} \times \sqrt{n} grid with the top and left edges wrapped around to connect with the bottom and right edges. Every vertex has 4 neighbors. Now imagine a very lazy random walk on this graph in which a random walker moves from vertex u to one of its neighbors with probability 1/n. It’s “well known” that this random walk takes around n^2 steps to mix. That is, if P is the matrix of transition probabilities then

T_{r} = \frac{1}{1 - \lambda_2(P)} \approx n^2

Here \lambda_2(P) is the second largest eigenvalue of P and the relaxation time T_{r} is the inverse of the spectral gap of the matrix. One way of characterizing T_{r} for reversible Markov chains is via the Dirichlet form. For a function f : V \to \mathbb{R} on the states of the chain define the Dirichlet form D(f,f) by

D(f,f) = \frac{1}{2} \sum_{u,v \in V} \pi_u P_{uv} (f(u) - f(v))^2

In our example the stationary distribution pi_u = 1/n and P_{uv} = 1/n for all edges in the graph. We write f \perp \pi if

\sum_{v \in V} \pi_v f(v) = 0

Define a norm associated with \pi via

\|f\|^2 = \sum_{v \in V} \pi_i f(v)^2

Then the characterization is

T_{r} = \sup_{f \perp \pi, f \ne 0} \frac{ \|f\|^2 }{ D(f,f) }

One question I asked myself today was whether it was “easy” to see what f you should choose in the grid example to get the scaling of n^2. Here’s one choice that gives the correct scaling. We’ll set f to be constant on each column. Assume without loss of generality that \sqrt{n} is divisible by 4 and set m = \sqrt{n}/4. The values for f on the columns will be like two triangles:

\{\frac{1}{n}, \frac{2}{n}, \ldots, \frac{m}{n},  \frac{m}{n}, \frac{m-1}{n}, \ldots, \frac{1}{n}, \frac{-1}{n}, \frac{-2}{n}, \ldots, \frac{-m}{n}, \frac{-m}{n}, \frac{-m+1}{n}, \ldots, \frac{-1}{n} \}

Now we can evaluate the norm, noting that there are \sqrt{n} vertices per column:

\|f\|^2 = 4 \sum_{i =1}^{m} \frac{1}{n} \sqrt{n} \frac{i^2}{n^2} = c \frac{1}{n}

This is because the sum of the first m squares scales like m^3 and m = \sqrt{n}/4. Now turning to the Dirichlet form, note that each difference between columns is at most 2/n and there are fewer than n edges for which f(u) \ne f(v). Thus:

D(f,f) \le \frac{1}{2} n \frac{1}{n} \frac{1}{n} \frac{4}{n^2} = c' \frac{1}{n^3}

Taking the ratio gives the lower bound of T_r \ge c''/n^2 as desired.

The first f I tried was just equal to +1 on the first half of the columns and -1 on the second half of the columns. This ends up giving a suboptimal bound, because the norm \|f\|^2 = 1 but in the denominator we get 2 \sqrt{n} positive terms. The key is to make all the differences (f(u) - f(v))^2 in the denominator small while keeping the average of \|f\|^2 large enough. Even though you sum over n small differences in the denominator, it stays small enough to pay for the \|f\|^2 = c/n in the numerator.

While doing this calculation, I noticed that the book Markov Chains and Mixing Times is also online — it makes a handy reference and is a little easier to use than my old go-to, the Aldous-Fill book.

paper a (long time period) : Assouad, Fano, and Le Cam

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions \mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}, where \mathcal{D} is a metric space with metric d(\cdot,\cdot). For two distributions P_1 and P_2 define the affinity \|P_1 \wedge P_2 \| by:

\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1

Let \mathop{\rm co}(\cdot) denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let \hat{\theta} be an estimator of \theta(P) on \mathcal{P}. Suppose D_1 and D_2 be two sets such that d(s_1,s_2) \ge 2 \delta for all (s_1,s_2) \in D_1 \times D_2, and \mathcal{P}_1 and \mathcal{P}_2 be two subsets of \mathcal{P} such that \theta(P) \in D_i when P \in \mathcal{P}_i. Then

\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the H = \{-1, 1\}^m, the vertices of the hypercube. We’ll write t \sim_j t' for t,t' \in H if they differ in the j-th coordinate.

Assouad’s Lemma. Let \mathcal{F}_m = \{P_{t} : t \in H\} be a set of 2^m probability measures indexed by H, and suppose there are m pseudo-distances d_m on \mathcal{D} such that for any pair (x,y)

d(x,y) = \sum_{j}^m d_j(x,y)

and that if t \sim_j t'

d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m

Then

\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}

The min comes about because it is the weakest over all neighbors (that is, over all j) of P_t in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let \mathcal{M}_r \subset \mathcal{P} contain r probability measures such that for all j \ne j' with j,j' \le r

d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r

and

D(P_j~\|~P_{j'}) \le \beta_r

Then

\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)

Here D(\cdot\|\cdot) is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message W uniformly in \{1, 2, \ldots, r\} and then setting the output Y to have the distribution P_j conditioned on W = j.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.

In case you are in Austin…

I’m giving a talk on Friday, so come on down! This has been your daily self-promotion effort, thank you for reading.

Consensus in context : leveraging the network to accelerate distributed consensus

October 30, 2009 ENS 637
11:00 am

Gossip algorithms are a class of decentralized solutions to the problem of achieving consensus in a network of agents. They have attracted recent research interest because they are simple and robust — attractive qualities for wireless ad-hoc and sensor networks. Unfortunately, the standard gossip protocol converges very slowly for many popular network models. I will discuss three ways to leverage properties of the network to achieve faster convergence : routing, broadcast, and mobility.

Joint work with Alex G. Dimakis, Tuncer Can Aysal, Mehmet Ercan Yildiz, Martin Wainwright, and Anna Scaglione.

Definitely not the desired interface

I am writing this from my new iphone, for which there is a wordpress app that will let me write posts from my phone. I have to say that although the concept is cool, having to write anything complicated on this touch screen is a little maddening. That being said, I hope to blog soon about some recent reads, canonical angles between subspaces, and a generalization of Fano’s inequality that I learned about in a paper by Bin Yu.

NPR on the end of privacy

NPR’s All Things Considered is running a 4-part series on privacy issues in the modern digital era. Since I’ve started working on privacy research (specifically related to privacy in machine learning problems and in medical databases) these popular news stories are a good insight into how people in general think about privacy. The first segment is on data companies and their increasing lack of disclosure, and today’s was mostly about facebook. I’m looking forward to the rest of it — I’ve already had one or two nebulous research ideas, which is always a nice feeling.