Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.
Suppose we have a parametric family of distributions , where
is a metric space with metric
. For two distributions
and
define the affinity
by:
Let denote the convex hull. Then Le Cam’s lemmas is the following.
Le Cam’s Lemma. Let be an estimator of
on
. Suppose
and
be two sets such that
for all
, and
and
be two subsets of
such that
when
. Then
This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.
Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the , the vertices of the hypercube. We’ll write
for
if they differ in the j-th coordinate.
Assouad’s Lemma. Let be a set of
probability measures indexed by
, and suppose there are
pseudo-distances
on
such that for any pair
and that if
Then
The min comes about because it is the weakest over all neighbors (that is, over all j) of in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.
Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.
Fano’s Lemma. Let contain
probability measures such that for all
with
and
Then
Here is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message
uniformly in
and then setting the output
to have the distribution
conditioned on
.
The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.