November 2009


One of the simple example graphs I’ve used in some of my research on gossip algorithms has been the 2-dimensional torus with n vertices, which looks like a \sqrt{n} \times \sqrt{n} grid with the top and left edges wrapped around to connect with the bottom and right edges. Every vertex has 4 neighbors. Now imagine a very lazy random walk on this graph in which a random walker moves from vertex u to one of its neighbors with probability 1/n. It’s “well known” that this random walk takes around n^2 steps to mix. That is, if P is the matrix of transition probabilities then

T_{r} = \frac{1}{1 - \lambda_2(P)} \approx n^2

Here \lambda_2(P) is the second largest eigenvalue of P and the relaxation time T_{r} is the inverse of the spectral gap of the matrix. One way of characterizing T_{r} for reversible Markov chains is via the Dirichlet form. For a function f : V \to \mathbb{R} on the states of the chain define the Dirichlet form D(f,f) by

D(f,f) = \frac{1}{2} \sum_{u,v \in V} \pi_u P_{uv} (f(u) - f(v))^2

In our example the stationary distribution pi_u = 1/n and P_{uv} = 1/n for all edges in the graph. We write f \perp \pi if

\sum_{v \in V} \pi_v f(v) = 0

Define a norm associated with \pi via

\|f\|^2 = \sum_{v \in V} \pi_i f(v)^2

Then the characterization is

T_{r} = \sup_{f \perp \pi, f \ne 0} \frac{ \|f\|^2 }{ D(f,f) }

One question I asked myself today was whether it was “easy” to see what f you should choose in the grid example to get the scaling of n^2. Here’s one choice that gives the correct scaling. We’ll set f to be constant on each column. Assume without loss of generality that \sqrt{n} is divisible by 4 and set m = \sqrt{n}/4. The values for f on the columns will be like two triangles:

\{\frac{1}{n}, \frac{2}{n}, \ldots, \frac{m}{n},  \frac{m}{n}, \frac{m-1}{n}, \ldots, \frac{1}{n}, \frac{-1}{n}, \frac{-2}{n}, \ldots, \frac{-m}{n}, \frac{-m}{n}, \frac{-m+1}{n}, \ldots, \frac{-1}{n} \}

Now we can evaluate the norm, noting that there are \sqrt{n} vertices per column:

\|f\|^2 = 4 \sum_{i =1}^{m} \frac{1}{n} \sqrt{n} \frac{i^2}{n^2} = c \frac{1}{n}

This is because the sum of the first m squares scales like m^3 and m = \sqrt{n}/4. Now turning to the Dirichlet form, note that each difference between columns is at most 2/n and there are fewer than n edges for which f(u) \ne f(v). Thus:

D(f,f) \le \frac{1}{2} n \frac{1}{n} \frac{1}{n} \frac{4}{n^2} = c' \frac{1}{n^3}

Taking the ratio gives the lower bound of T_r \ge c''/n^2 as desired.

The first f I tried was just equal to +1 on the first half of the columns and -1 on the second half of the columns. This ends up giving a suboptimal bound, because the norm \|f\|^2 = 1 but in the denominator we get 2 \sqrt{n} positive terms. The key is to make all the differences (f(u) - f(v))^2 in the denominator small while keeping the average of \|f\|^2 large enough. Even though you sum over n small differences in the denominator, it stays small enough to pay for the \|f\|^2 = c/n in the numerator.

While doing this calculation, I noticed that the book Markov Chains and Mixing Times is also online — it makes a handy reference and is a little easier to use than my old go-to, the Aldous-Fill book.

The Large Hadron Collider has been turned on.

Did the world end yet? Guess not.

Carry on, folks, nothing to see here.

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions \mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}, where \mathcal{D} is a metric space with metric d(\cdot,\cdot). For two distributions P_1 and P_2 define the affinity \|P_1 \wedge P_2 \| by:

\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1

Let \mathop{\rm co}(\cdot) denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let \hat{\theta} be an estimator of \theta(P) on \mathcal{P}. Suppose D_1 and D_2 be two sets such that d(s_1,s_2) \ge 2 \delta for all (s_1,s_2) \in D_1 \times D_2, and \mathcal{P}_1 and \mathcal{P}_2 be two subsets of \mathcal{P} such that \theta(P) \in D_i when P \in \mathcal{P}_i. Then

\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the H = \{-1, 1\}^m, the vertices of the hypercube. We’ll write t \sim_j t' for t,t' \in H if they differ in the j-th coordinate.

Assouad’s Lemma. Let \mathcal{F}_m = \{P_{t} : t \in H\} be a set of 2^m probability measures indexed by H, and suppose there are m pseudo-distances d_m on \mathcal{D} such that for any pair (x,y)

d(x,y) = \sum_{j}^m d_j(x,y)

and that if t \sim_j t'

d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m

Then

\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}

The min comes about because it is the weakest over all neighbors (that is, over all j) of P_t in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let \mathcal{M}_r \subset \mathcal{P} contain r probability measures such that for all j \ne j' with j,j' \le r

d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r

and

D(P_j~\|~P_{j'}) \le \beta_r

Then

\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)

Here D(\cdot\|\cdot) is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message W uniformly in \{1, 2, \ldots, r\} and then setting the output Y to have the distribution P_j conditioned on W = j.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.

Follow

Get every new post delivered to your Inbox.

Join 908 other followers