Johnson-Lindenstrauss, Restricted Isometry (RIP), for-each vs for-all, and all that

The Johnson-Lindenstrauss lemma shows how to project points in much lower dimension without significant distortion of their pairwise Euclidean distances.

Say you have n arbitrary vectors p_1, p_2 \ldots, p_n in \mathbb{R}^n.
Use a fat random (normalized) Gaussian matrix of dimensions m \times n where

m= \frac{c}{\epsilon^2} \log n,

to project the points in m dimensions.

Then, with probability at least 1/2, the projected points A p_1, A p_2, \ldots A p_n have pairwise distances that satisfy:

(1-\epsilon) ||p_i -p_j ||_2 \leq ||A p_i - A p_j ||_2 \leq (1+\epsilon) ||p_i -p_j||_2.

This is quite remarkable since we reduced the dimension from n into \log n.

This is a very fat matrix.

A number of other matrices beyond Gaussian have been shown to also work with similar guarantees.

The Johnson-Lindenstrauss lemma sounds similar to the Restricted Isometry Property (RIP).
A fat m \times n matrix A has the RIP property if

(1-\epsilon) ||x||_2 \leq ||A x ||_2 \leq ||(1+\epsilon) ||x||_2,

for all vectors x that are k-sparse, i.e. ||x||_0 \leq k.

This means that the matrix A does not distort the \ell_2 norm of sparse vectors.
As before, well-known results have shown that random Gaussian matrices satisfy RIP with high probability if we set
m = \Theta( k \log(n/k)).

To understand the similarities and differences between RIP and JL, we need to see how these results are proven. The central question is how fat random matrices distort the length of vectors and with what probability.

 

All known proofs of JL (for each matrix family) start with a lemma that deals with one fixed vector:
Lemma 1:

For each given vector x, the matrix A does not distort the length too much, with good probability:

Pr[ (1-\epsilon) ||x||_2 \leq ||Ax||_2 \leq (1+\epsilon) ||x||_2] \geq 1- P_{fail}(m,n) \quad (1)

The key issue is between for all versus for each vector quantifiers.
If we first fix the vector and throw the matrix randomly, this is much easier than promising something for a random matrix that once realized, works w.h.p. for all vectors.

Clearly, if first realize the matrix, there are certainly vectors where lemma 1 cannot hold: since the matrix is fat, it has a nullspace so there are nonzero vectors that have ||Ax||=0. Lemma 1 simply says that this is unlikely and gives a `for each vector, a random matrix is good whp’ guarantee.

How small we can make P_{fail} depends crucially on how small we make m. Since we want to achieve dimensionality reduction we want to make the matrix as fat as possible, i.e. m as small as possible. But P_{fail} will become larger.

To prove JL for some family of matrices we need to make P_{fail}\leq 1/n^2.

If we have lemma 1 with this guarantee, we just do a union bound over all {n \choose 2}= \Theta(n^2) pairs of vectors and obtain the pairwise distance distortion guarantee for all vectors p_1, p_2, \ldots p_n. Hence JL is Lemma 1 + union bound over all pairs.

It turns out that for Gaussian iid matrices, if you make m=\frac{1}{\epsilon^2} \log n, the probability of error is smaller than P_{fail} \leq 1/n^2. This follows by the wonderful symmetries of the Gaussian: ||Ax ||_2 has the same distribution as ||A e_1 ||_2 i.e. the \ell_2 norm of the first column of A. Therefore ||Ax ||^2_2 has a Chi-square distribution with m degrees of freedom and standard concentration results suffice to bound P_{fail}.

To prove RIP, exactly the same recipe can be followed: first prove Lemma 1 for one vector and then do a union bound over all {n \choose k} k-sparse vectors. The problem is now we have a much larger union bound and therefore we need P_{fail} \leq \frac{1}{{n \choose k}}. This enforces a much larger m compared to JL.

looking at a bipartite expander

The general recipe emerges. Say you want to prove a good property for all hairy vectors.

First show that for each fixed vector a random matrix has the good property with probability at least 1- P_{fail}.

Then ensure P_{fail} \leq \frac{1}{\text{No of hairy vectors}}. This will limit how much dimensionality reduction you can achieve.

Finally, a union bound establishes the property for all hairy vectors.

Advertisements

Irving S. Reed

USC announced today: “Professor Irving S. Reed passed away this morning, September 11, 2012, at the age of 88.  Professor Reed was a member of the Ming Hsieh Department of Electrical Engineering faculty from 1963 until his retirement in 1993 as Powell Chair of Engineering Emeritus.”

Prof. Reed had made significant contributions in radar design theory, programming languages and, of course, coding theory.  Reed-Muller and Reed-Solomon codes have been incredibly influential in electrical engineering, theoretical computer science and mathematics.

Reed-Solomon codes are used to protect data in countless applications including CDs, DVDs, DSL and RAID. They were also used to encode the digital images sent back by space probes like the Voyager. RS codes are my favorite example of something practically useful and at the same time theoretically deep.

The announcement mentions: “Millions of people today enjoy the benefits of Reed’s many inventions and contributions of technology, without being aware of their remarkable benefactor.”

Meanwhile in Greece: Tax inspectors besieged in Hydra.

This Friday, tax inspectors arrived in the island of Hydra located in the Aegean sea near Athens. Hydra (also spelled Ydra) is a high-end tourist destination with a strong yachting and maritime tradition. 

As in.gr reported, one restaurant owner was caught for tax evasion and fainted. Tax inspectors arrested her son and led him to the local police station.

Subsequently, protesting locals besieged the police station where he was kept. The tax inspectors were forced to remain in the station until the morning, when special police units arrived from Athens to set them free.

 

Image

A Ranking of World University Rankings

Shanghai Jiao Tong University recently published its Academic Ranking of World Universities (ARWU) for 2012. As usual, it contains both general university rankings and many specialized field and location based lists.
Ranking universities is an impressive dimensionality reduction challenge:
map from academic institutions to one dimension, hoping that your projection relates to loosely defined concepts like academic reputation. Think of trying to land a fighter jet on an aircraft carrier while reading Tractatus Logico-Philosophicus to understand the concept of landing.

Rankings of specific departments are much less meaningless compared to university rankings and serve different purposes. As a student, I must admit that rankings seemed very useful. They allegedly help in another aircraft carrier landing task of choosing a school.

Just like everybody else, I still look rankings but now I have several concerns. Graduate school rankings in particular create a self-fulfilling prophecy: the best students and young faculty join higher ranked schools when given the option, hence creating a positive feedback loop.

US News almost has a de-facto national ranking monopoly, especially in the eyes of students. This sounds like a dangerous thing. Universities and departments have strong incentives to optimize a specific formula involving indicators like PhDs awarded per year and papers per academic staff. Aligning with a specific ranking methodology might not necessarily align with society’s or student interests.

Instead of thinking about these important questions, I compiled my personal ranking of world university rankings. The methodology is highly scientific: I read through each list and counted how many times I screamed: ‘Seriously, you ranked this school higher than this other school?‘.

The ranking with the lowest number of screams wins. Interestingly, ARWU wins my least absurd ranking award. The rest are:

1. ARWU

2. Times Higher Education World University Rankings

3. QS World University Rankings (aka US News world)
(Imperial at #6 and Stanford at #11 /facepalm)

(RatER, Forbes remain unranked)

Advice on giving talks

We are at ISIT and I realize I am going over the same points multiple times with my students, so I thought of summarizing everything here.

How to give a better ISIT Talk.

1. Take your talks very seriously.

Do practice runs. Many of them. Your only hope for academia is by giving great talks. Give a practice talk to your friends. In the middle of your talk pause and quiz them: ok, did you get why alpha and beta are not independent? (hint: they did not).
If they did not, it is your problem not their problem.

2. They do not remember what alpha is.

In most talks, your audience does not understand what the notation is, what the problem is, or why they should care. Think of yourself: how often do you sleep or suffer through talks without even knowing what the problem is?
Do not treat your audience like that.

It is a typical scene when the presenter is focusing on a minor technical issue for ten minutes when 90% of the audience does not even know what exactly the problem is, or care.

One important exception is when your audience works on the same problem. Typically only a small part of your talk should be focused on these experts (see also 13).

3. Do a multi-resolution talk.

A useful guideline is: for an 18 minute talk, 7-9 minutes should go on explaining the formulation of your problem and why should anybody care. 5-6 minutes on explaining *what* the solution is and 4 minutes or so, on the actual painful technical stuff. The first part should be aimed at a first year grad student level. The second at a senior grad student in the general ISIT area and the last part to the expert working on related problems. If fewer than 90% of your audience are checking email in the last part of your talk, consider that a success.

4. Try to make things simple, not difficult.

It is a common mistake for starting grad students to think that their work is too simple. For that reason they will not mention known things (like explaining that ML decoding for the erasure channel consists of solving linear equations, because they fear this is too simple and well known).
Always mention the basic foundations while you try to explain something non-trivial. Your goal is not to sound smart but rather to have your audience walk out knowing something more.

Even when your audience hears things they already know, they get a warm fuzzy feeling, they do not think you are dumb.

5. Add redundancy, repeat a lot in words.

Do not say ‘We try to minimize d(k)’.
Say `we try to minimize the degree d which as I mentioned, is a function of the number of symbols k’. Repeat things all the time: Summarize what you will talk about, and in conclusions say the main points again.

6. Go back to basic concepts in words, repeat definitions.

Try to mention the basic mathematical components not the jargon you have introduced. Do not not say ‘Therefore, the code is MSR-optimal‘ but ‘Therefore, the code minimizes the repair communication (what we call MSR optimal)‘. Try to reduce your statements back to fundamental things like probabilities, graphs, rank of matrices, etc whenever possible. Do not just define some alpha jargon in the first slide and talk about that damn alpha throughout your talk.

7. Never go over time.

I have often seen even experienced speakers getting a warning that they have 3 minutes and still trying to go through their ten last slides. When you are running out of time, the goal is not to talk faster.

Say something like ‘Unfortunately or fortunately for you, I do not have time to go into the proof so I will have to skip it. The main ingredient involves analyzing random matchings which is done through Hall’s theorem and union bounds. Please talk to me offline if you are interested…
Then, go through your conclusions slowly, repeating your main points.
This is another example of multi-resolution: you explain the techniques at a high level first. Even if you had time, you would still first have to give a one sentence high level description and then get into the the details.

8. Draw attention to important slides.

People are probably checking the Euro final when you are at slide 4, explaining what your problem is all about. Wake them up and give a notification that this is the one slide they do not want to miss. Do this right before the critical points.

9. Every slide should have one simple message.

After you make your slides ask yourself: what is the goal of this slide, I just want to explain this part. Iteratively try to simplify your slides into smaller and smaller messages. It is easier for your audience to grasp one packet of information at a time. Do not have derivations on slides (especially for an 18 minute talk), unless there is one very simple trick you really want to show. Showing math does not make you look smarter.

10. Be minimalist.

Every word on your slides, every symbol or equation you put up there dilutes the attention of your audience. Look at each bullet/slide and ask, do I really need this part or can I remove it?

11. Be excited.

Vary the tone of your voice, it may wake up somebody. You need to entertain and perform. Think: if you are not excited with your results why should anybody else be?

12. Cite people.

When somebody has related prior work, cite them on your slide. That has the benefit of waking them up when they see or hear their name.

As Rota says: `Everyone in the audience has come to listen to your lecture with the secret hope of hearing their work mentioned.

13. Connect to what your audience cares about.

This is non-trivial and requires experience. If you are giving a talk in a fountain codes session, you do not have to spend ten minutes defining things your audience knows already. Still define it quickly to make sure everybody is on the same page on notation. Knowing how to be at the right resolution for your audience becomes easier in time.

14. Prepare your logistics.

Know the room (go there before), know who your session chair is, have your macbook projector dongle, pre-load your slides on a USB. Bring your charger, disconnect from the internet (fun Skype messages pop-up during talks). If you are using a different machine, test your Powerpoint slides (hint: they look completely different).

15. Talk to people afterwards.

Talk to people about their work and your work. Remember that this is a professional networking event. Do not hang out with your friends, you have plenty of time for that after you go back home. Networking with other students and faculty is very important, in my case I learn more by talking to people offline than in talks.

16. Engineering theory is essentially story-telling.

Our papers and talks are essentially story-telling: Here is a model for a wireless channel, here is a proof about this model. A good story has an intellectual message that will hopefully help people think about a real engineering problem in a cleaner way.
The other aspect of our job is creating algorithms that are hopefully useful in real systems. Think: what is your story and how will you present it in your talk.

17. Read the brilliant Ten Lessons I Wish I Had Been Taught by Gian-Carlo Rota.

A proof that P is not equal to NP?

Today we had the last day of the information theory summer school and Rudiger Urbanke gave a lecture on LDPC codes, random graphs, polar codes and other ways to approach Shannon Capacity.

After it was over, I received a few emails about a potential proof that P is not equal to NP from Vinay Deolalikar.

A quick look at the proof seems (to my ignorant eyes) like a serious, well-written attempt, and some people who know more seem to agree.

Surprisingly, Replica symmetry breaking, random graphs (and how they are locally tree-like) and inference on graphical models (things we discussed in the school) seem to be central in the arguments.

Let’s hope the proof is correct!

Explicit constructions of Expanders and LDPC codes

After seeing Nisim’s ISIT talk on LP Decoding of Regular LDPC Codes in Memoryless Channels, I remembered one issue that has been puzzling me for a while: if there are any explicit constructions of LDPC codes that have a provable nonzero threshold (for block-error probability, hence excluding density evolution arguments) i.e. correct a constant fraction of bit-flipping errors (other channels would be harder)– under any efficient decoder.
The question seems trivial: As is well-known,  a random (d_v,d_c) regular graph will be a very high quality expander (for very small sets) and therefore a (very small) constant fraction of bit-flipping errors can be corrected by Sipser & Spielman‘s bit flipping algorithm.
Also the work by Feldman et al. LP decoding corrects a constant fraction of errors establishes correction of an adversarial set of bit-flipping errors for expander graphs under LP decoding.
But here is the catch: sure, a random graph is a good expander with high probability, but how do we check if a given graph is a good expander?
The key of course is my definition of explicit construction: I mean an algorithm that terminates in polynomial time and gives me a code that certifiably works (and I think this is relevant to practice, since at the end of the day I have to use one fixed code). The only efficient way that I know to check expansion involves spectral methods ( This is from Modern Coding theory by Richardson & Urbanke) : If we have a (d_v,d_c) regular bipartite graph with n variable nodes and \frac{d_v}{ d_c} n check nodes, if we take H to be the incidence matrix of the bipartite graph (the parity check matrix of the code), we can form H^T H and look at its eigenvalues
\lambda_1 \geq \lambda_2,\geq \lambda_3 \cdots and Tanner’s spectral bound on expansion is that expansion \gamma is greater than
\gamma(\alpha) \geq \frac{ d_v}{(d_v d_c - \lambda_2) \alpha + \lambda_2} for any size of expanding sets $\alpha$. Unfortunately this bound cannot certify expansion for any \gamma >1/2, which is exactly the point where it starts being useful for coding theory. Perhaps there are stronger spectral bounds that could establish more expansion, the book Spectral Graph Theory by Fan Chung Graham contains a lot of material on that point but I have not seen any such applications to coding.
So ok, lets say we do not know how to construct (or certify) that LDPCs have high expansion, how about other graph properties that will guarantee a correctable fraction of errors in polynomial time? This started when I was working with Costas on LP decoding for LDPC codes and we were always (incorrectly) assuming that random regular bipartite graphs will have \log n girth with high probability. When we actually tried to find a proof for this, for example looking at the Modern Coding theory book we find that usual proofs establish a significantly weaker statement: in a random (d_v,d_c) regular graph, if you start from a variable and start expanding the tree, you will not loop around after a constant number of steps with 1/poly(n) probability. This is what is refered to as ‘locally-tree like’. I do not know of any stronger statements but I think it can be easily shown that for any fixed cycle length, the expected number of cycles of that length is constant for regular random graphs.
The breakthrough paper by Arora, Daskalakis and Steurer, ‘Message-Passing Algorithms and Improved LP Decoding‘ establishes that regular LDPCs with \log n girth will correct a constant fraction of random bit-flipping errors whp under LP decoding. But how do we find regular LDPCs with \log n girth ?
After searching a little I found the recent work of Bayati et al.
Generating Random Graphs with Large Girth and Generating random Tanner-graphs with large girth that talk about the related problem of generating a graph with high girth uniformly from all such graphs (and with a given degree distribution) but as far as I understood these constructions cannot guarantee a diameter scaling like \log n (but only any constant diameter). This is of course the relevant practical question but the scaling issue remains.
The only construction that I know is the one found in the appendix of Gallager’s thesis that contains a deterministic algorithm that constructs (d_v,d_c) regular Tanner graphs with \log n girth.
The same question is relevant for compressed sensing when we ask for sparsity in the measurement matrix: All the constructions that I know for sparse measurement matrices (that require the optimal number of measurements under Basis pursuit, e.g. see the survey of Gilbert and Indyk: Sparse Recovery using Sparse Matrices) are constructed from high \gamma>2/3 bipartite expander graphs. But given a fixed measurement matrix how do we certify that it is good and should be implemented in a system?
Of course one can use Message passing algorithms for compressed sensing (by Donoho, Maleki and Montanari) and obtain sparse measurement matrices with very good thresholds but under a different decoding algorithm.
The close connection of compressed sensing (under LP – Basis pursuit decoding) and channel coding (under LP decoding) gives one path for certifiable measurement matrices that are sparse):
1. The Arora, Daskalakis & Steurer paper guarantees that (3,6) regular graphs with \log n girth correct a (quite high– much higher than expansion arguments) constant fraction of errors under LP decoding.
2. Gallager’s thesis appendix contains deterministic constructions of such (3,6) regular sparse matrices.
3. Our connection result establishes that if a matrix corrects a constant fraction of bit flipping errors, the same matrix used as a compressed sensing measurement matrix will recover all sparse signals (of the same support as the bit-flipping errors).
Combining (1)+(2)+(3) we obtain that: for sensing a length n signal with m=n/2  measurements one recovers almost all sparse signals of size up to $0.05 n$ (whp over supports) and the (0,1) measurement matrix has O(n) non-zeros. (see our recent paper )
Conclusion: The Appendix C in Gallager’s thesis contains the best sparse compressed sensing measurement matrices known (under basis pursuit).
Not bad for 1963!