paper a (long time period) : Assouad, Fano, and Le Cam

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions $\mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}$, where $\mathcal{D}$ is a metric space with metric $d(\cdot,\cdot)$. For two distributions $P_1$ and $P_2$ define the affinity $\|P_1 \wedge P_2 \|$ by:

$\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1$

Let $\mathop{\rm co}(\cdot)$ denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let $\hat{\theta}$ be an estimator of $\theta(P)$ on $\mathcal{P}$. Suppose $D_1$ and $D_2$ be two sets such that $d(s_1,s_2) \ge 2 \delta$ for all $(s_1,s_2) \in D_1 \times D_2$, and $\mathcal{P}_1$ and $\mathcal{P}_2$ be two subsets of $\mathcal{P}$ such that $\theta(P) \in D_i$ when $P \in \mathcal{P}_i$. Then

$\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|$

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the $H = \{-1, 1\}^m$, the vertices of the hypercube. We’ll write $t \sim_j t'$ for $t,t' \in H$ if they differ in the j-th coordinate.

Assouad’s Lemma. Let $\mathcal{F}_m = \{P_{t} : t \in H\}$ be a set of $2^m$ probability measures indexed by $H$, and suppose there are $m$ pseudo-distances $d_m$ on $\mathcal{D}$ such that for any pair $(x,y)$

$d(x,y) = \sum_{j}^m d_j(x,y)$

and that if $t \sim_j t'$

$d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m$

Then

$\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}$

The min comes about because it is the weakest over all neighbors (that is, over all j) of $P_t$ in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let $\mathcal{M}_r \subset \mathcal{P}$ contain $r$ probability measures such that for all $j \ne j'$ with $j,j' \le r$

$d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r$

and

$D(P_j~\|~P_{j'}) \le \beta_r$

Then

$\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)$

Here $D(\cdot\|\cdot)$ is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message $W$ uniformly in $\{1, 2, \ldots, r\}$ and then setting the output $Y$ to have the distribution $P_j$ conditioned on $W = j$.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.