The Johnson-Lindenstrauss lemma shows how to project points in much lower dimension without significant distortion of their pairwise Euclidean distances.
Say you have n arbitrary vectors in
.
Use a fat random (normalized) Gaussian matrix of dimensions where
,
to project the points in m dimensions.
Then, with probability at least 1/2, the projected points have pairwise distances that satisfy:
.
This is quite remarkable since we reduced the dimension from into
.
This is a very fat matrix.
A number of other matrices beyond Gaussian have been shown to also work with similar guarantees.
The Johnson-Lindenstrauss lemma sounds similar to the Restricted Isometry Property (RIP).
A fat matrix A has the RIP property if
,
for all vectors that are k-sparse, i.e.
.
This means that the matrix A does not distort the norm of sparse vectors.
As before, well-known results have shown that random Gaussian matrices satisfy RIP with high probability if we set
.
To understand the similarities and differences between RIP and JL, we need to see how these results are proven. The central question is how fat random matrices distort the length of vectors and with what probability.
All known proofs of JL (for each matrix family) start with a lemma that deals with one fixed vector:
Lemma 1:
For each given vector , the matrix A does not distort the length too much, with good probability:
The key issue is between for all versus for each vector quantifiers.
If we first fix the vector and throw the matrix randomly, this is much easier than promising something for a random matrix that once realized, works w.h.p. for all vectors.
Clearly, if first realize the matrix, there are certainly vectors where lemma 1 cannot hold: since the matrix is fat, it has a nullspace so there are nonzero vectors that have . Lemma 1 simply says that this is unlikely and gives a `for each vector, a random matrix is good whp’ guarantee.
How small we can make depends crucially on how small we make m. Since we want to achieve dimensionality reduction we want to make the matrix as fat as possible, i.e. m as small as possible. But
will become larger.
To prove JL for some family of matrices we need to make .
If we have lemma 1 with this guarantee, we just do a union bound over all pairs of vectors and obtain the pairwise distance distortion guarantee for all vectors
. Hence JL is Lemma 1 + union bound over all pairs.
It turns out that for Gaussian iid matrices, if you make , the probability of error is smaller than
. This follows by the wonderful symmetries of the Gaussian:
has the same distribution as
i.e. the
norm of the first column of A. Therefore
has a Chi-square distribution with m degrees of freedom and standard concentration results suffice to bound
.
To prove RIP, exactly the same recipe can be followed: first prove Lemma 1 for one vector and then do a union bound over all k-sparse vectors. The problem is now we have a much larger union bound and therefore we need
. This enforces a much larger m compared to JL.
The general recipe emerges. Say you want to prove a good property for all hairy vectors.
First show that for each fixed vector a random matrix has the good property with probability at least .
Then ensure . This will limit how much dimensionality reduction you can achieve.
Finally, a union bound establishes the property for all hairy vectors.
Is a hairy vector a cowlick in the hairy ball theorem (the only technical thing that came up when googling “hairy”) or is it just a troublesome exception that fails?