Lazy random walks on the torus via Dirichlet forms

One of the simple example graphs I’ve used in some of my research on gossip algorithms has been the 2-dimensional torus with $n$ vertices, which looks like a $\sqrt{n} \times \sqrt{n}$ grid with the top and left edges wrapped around to connect with the bottom and right edges. Every vertex has 4 neighbors. Now imagine a very lazy random walk on this graph in which a random walker moves from vertex $u$ to one of its neighbors with probability $1/n$. It’s “well known” that this random walk takes around $n^2$ steps to mix. That is, if $P$ is the matrix of transition probabilities then

$T_{r} = \frac{1}{1 - \lambda_2(P)} \approx n^2$

Here $\lambda_2(P)$ is the second largest eigenvalue of $P$ and the relaxation time $T_{r}$ is the inverse of the spectral gap of the matrix. One way of characterizing $T_{r}$ for reversible Markov chains is via the Dirichlet form. For a function $f : V \to \mathbb{R}$ on the states of the chain define the Dirichlet form $D(f,f)$ by

$D(f,f) = \frac{1}{2} \sum_{u,v \in V} \pi_u P_{uv} (f(u) - f(v))^2$

In our example the stationary distribution $pi_u = 1/n$ and $P_{uv} = 1/n$ for all edges in the graph. We write $f \perp \pi$ if

$\sum_{v \in V} \pi_v f(v) = 0$

Define a norm associated with $\pi$ via

$\|f\|^2 = \sum_{v \in V} \pi_i f(v)^2$

Then the characterization is

$T_{r} = \sup_{f \perp \pi, f \ne 0} \frac{ \|f\|^2 }{ D(f,f) }$

One question I asked myself today was whether it was “easy” to see what $f$ you should choose in the grid example to get the scaling of $n^2$. Here’s one choice that gives the correct scaling. We’ll set $f$ to be constant on each column. Assume without loss of generality that $\sqrt{n}$ is divisible by 4 and set $m = \sqrt{n}/4$. The values for $f$ on the columns will be like two triangles:

$\{\frac{1}{n}, \frac{2}{n}, \ldots, \frac{m}{n}, \frac{m}{n}, \frac{m-1}{n}, \ldots, \frac{1}{n}, \frac{-1}{n}, \frac{-2}{n}, \ldots, \frac{-m}{n}, \frac{-m}{n}, \frac{-m+1}{n}, \ldots, \frac{-1}{n} \}$

Now we can evaluate the norm, noting that there are $\sqrt{n}$ vertices per column:

$\|f\|^2 = 4 \sum_{i =1}^{m} \frac{1}{n} \sqrt{n} \frac{i^2}{n^2} = c \frac{1}{n}$

This is because the sum of the first $m$ squares scales like $m^3$ and $m = \sqrt{n}/4$. Now turning to the Dirichlet form, note that each difference between columns is at most $2/n$ and there are fewer than $n$ edges for which $f(u) \ne f(v)$. Thus:

$D(f,f) \le \frac{1}{2} n \frac{1}{n} \frac{1}{n} \frac{4}{n^2} = c' \frac{1}{n^3}$

Taking the ratio gives the lower bound of $T_r \ge c''/n^2$ as desired.

The first $f$ I tried was just equal to +1 on the first half of the columns and -1 on the second half of the columns. This ends up giving a suboptimal bound, because the norm $\|f\|^2 = 1$ but in the denominator we get $2 \sqrt{n}$ positive terms. The key is to make all the differences $(f(u) - f(v))^2$ in the denominator small while keeping the average of $\|f\|^2$ large enough. Even though you sum over $n$ small differences in the denominator, it stays small enough to pay for the $\|f\|^2 = c/n$ in the numerator.

While doing this calculation, I noticed that the book Markov Chains and Mixing Times is also online — it makes a handy reference and is a little easier to use than my old go-to, the Aldous-Fill book.

paper a (long time period) : Assouad, Fano, and Le Cam

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions $\mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}$, where $\mathcal{D}$ is a metric space with metric $d(\cdot,\cdot)$. For two distributions $P_1$ and $P_2$ define the affinity $\|P_1 \wedge P_2 \|$ by:

$\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1$

Let $\mathop{\rm co}(\cdot)$ denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let $\hat{\theta}$ be an estimator of $\theta(P)$ on $\mathcal{P}$. Suppose $D_1$ and $D_2$ be two sets such that $d(s_1,s_2) \ge 2 \delta$ for all $(s_1,s_2) \in D_1 \times D_2$, and $\mathcal{P}_1$ and $\mathcal{P}_2$ be two subsets of $\mathcal{P}$ such that $\theta(P) \in D_i$ when $P \in \mathcal{P}_i$. Then

$\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|$

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the $H = \{-1, 1\}^m$, the vertices of the hypercube. We’ll write $t \sim_j t'$ for $t,t' \in H$ if they differ in the j-th coordinate.

Assouad’s Lemma. Let $\mathcal{F}_m = \{P_{t} : t \in H\}$ be a set of $2^m$ probability measures indexed by $H$, and suppose there are $m$ pseudo-distances $d_m$ on $\mathcal{D}$ such that for any pair $(x,y)$

$d(x,y) = \sum_{j}^m d_j(x,y)$

and that if $t \sim_j t'$

$d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m$

Then

$\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}$

The min comes about because it is the weakest over all neighbors (that is, over all j) of $P_t$ in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let $\mathcal{M}_r \subset \mathcal{P}$ contain $r$ probability measures such that for all $j \ne j'$ with $j,j' \le r$

$d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r$

and

$D(P_j~\|~P_{j'}) \le \beta_r$

Then

$\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)$

Here $D(\cdot\|\cdot)$ is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message $W$ uniformly in $\{1, 2, \ldots, r\}$ and then setting the output $Y$ to have the distribution $P_j$ conditioned on $W = j$.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.