Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.
Suppose we have a parametric family of distributions , where is a metric space with metric . For two distributions and define the affinity by:
Let denote the convex hull. Then Le Cam’s lemmas is the following.
Le Cam’s Lemma. Let be an estimator of on . Suppose and be two sets such that for all , and and be two subsets of such that when . Then
This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.
Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the , the vertices of the hypercube. We’ll write for if they differ in the j-th coordinate.
Assouad’s Lemma. Let be a set of probability measures indexed by , and suppose there are pseudo-distances on such that for any pair
and that if
The min comes about because it is the weakest over all neighbors (that is, over all j) of in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.
Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.
Fano’s Lemma. Let contain probability measures such that for all with
Here is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message uniformly in and then setting the output to have the distribution conditioned on .
The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.