# NIPS visualization

I’m headed to NIPS next week. Via my (soon to be ex-) colleague Dhruv Batra comes this fun visualization by Andrej Karpathy of the topics of papers, clustered by an LDA model.

# The things we know we don’t know

As a theoretical engineer, I find myself getting lulled into the trap of what I now starting to call “lazy generalization.” It’s a form of bland motivation that you often find at the beginning of papers:

Sensor networks are large distributed collections of low-power nodes with wireless radios and limited battery power.

Really? All sensor networks are like this? I think not. Lots of sensor networks are wired (think of the power grid) but still communicate wirelessly. Others communicate through wires. This is the kind of ontological statement that metastasizes into the research equivalent of a meme — 3 years after Smart Dust appears, suddenly all papers are about dust-like networks, ignoring the vast range of other interesting problems that arise in other kinds of “sensor networks.”

Another good example is “it is well known that most [REAL WORLD THING] follows a power law,” which bugs Cosma to no end. We then get lots of papers papers which start with something about power laws and then proceed to analyze some algorithms which work well on graphs which have power law degree distributions. And the later we get statements like “all natural graphs follow power laws, so here’s a theory for those graphs, which tells us all about nature.”

Yet another example of this is sparsity. Sparsity is interesting! It lets you do a lot of cool stuff, like compressed sensing. And it’s true that some real world signals are approximately sparse in some basis. However, turn the crank and we get papers which make crazy statements approximately equal to “all interesting signals are sparse.” This is trivially true if you take the signal itself as a basis element, but in the way it’s mean (e.g. “in some standard basis”), it is patently false.

So why is are these lazy generalization? It’s a kind of fallacy which goes something like:

1. Topic A is really useful.
2. By assuming some Structure B about Topic A, we can do lots of cool/fun math.
3. All useful problems have Structure B

Pattern matching, we get A = [sensor networks, the web, signal acquisition], and B = [low power/wireless, power laws, sparsity].

This post may sound like I’m griping about these topics being “hot” — I’m not. Of course, when a topic gets hot, you get a lot of (probably incremental) papers all over the place. That’s the nature of “progress.” What I’m talking about is the third point. When we go back to our alabaster spire of theory on top of the ivory tower, we should not fall into the same trap of saying that “by characterizing the limits of Structure B I have fundamentally characterized Topic A.” Maybe that’s good marketing, but it’s not very good science, I think. Like I said, it’s a trap that I’m sure I’m guilty of stepping into on occasion, but it seems to be creeping into a number of things I’ve been reading lately.

Snuff [Terry Pratchett] : this was standard Discworld stuff, but I found it a little below-average. I’m finicky that way though.

Snakes Can’t Run [Ed Lin] : a follow-up to This Is A Bust, this book is about human smuggling in New York Chinatown in the 70s. Recommended if you like thing Asian and mysterious.

Shark’s Fin and Sichuan Pepper: A Sweet-Sour Memoir of Eating in China [Fuschia Dunlop] : I found this memoir to be engaging but it’s definitely got that feel of “Western person’s observations about China.” Dunlop is more aware of her situation as outsider/observer, but sometimes its hard to shake that narrative vibe. That being said, you should definitely read this if you want to know more about Chinese cuisines.

Among Others [Jo Walton] : it won a Nebula and a Hugo and I could see why. This is a really sharply observed and narrated coming-of-age story about a high school girl who is not part of the “main crowd” and finds her solace in voraciously reading all of the SciFi/Fantasy novels she can get her hands on. Really lovely writing.

The Lost Soul of Higher Education [Ellen Schrecker] : a pretty sobering read with a lot of historical background on the state of academic freedom, the corporatization of the university system, and possible ramifications for the future of the US. It was a bit depressing but well worth reading.

One Day in the Life of Ivan Denisovich [Aleksandr Solzhenitsyn] : this is Solzhenitsyn’s first book, a first-person narrative of one person’s life in a Stalinist labor camp. It really brings the grimness of the place alive — Cool Hand Luke’s prison camp had nothing on this. They’re worth comparing, I think.

# Linkage : Black Friday edition

This is an amazing video that makes me miss the Bay Area. (via Bobak Nazer)

Also via Bobak, we’re number 8 and 10!

Since it’s holiday season, I figured it’s time to link to some profanity-laden humor about the holidays. For the new, The Hater’s Guide to the Williams-Sonoma Catalog, and the classic It’s Decorative Gourd Season….

A Game of Food Trucks. (via MetaFilter)

Larry Wasserman takes on the Bayesian/Frequentist debate.

My friend Erik, who started the Mystery Brewing Company, has a blog called Top Fermented. He is now starting a podcast, which also has an RSS feed.

/

# Johnson-Lindenstrauss, Restricted Isometry (RIP), for-each vs for-all, and all that

The Johnson-Lindenstrauss lemma shows how to project points in much lower dimension without significant distortion of their pairwise Euclidean distances.

Say you have n arbitrary vectors $p_1, p_2 \ldots, p_n$ in $\mathbb{R}^n$.
Use a fat random (normalized) Gaussian matrix of dimensions $m \times n$ where $m= \frac{c}{\epsilon^2} \log n$,

to project the points in m dimensions.

Then, with probability at least 1/2, the projected points $A p_1, A p_2, \ldots A p_n$ have pairwise distances that satisfy: $(1-\epsilon) ||p_i -p_j ||_2 \leq ||A p_i - A p_j ||_2 \leq (1+\epsilon) ||p_i -p_j||_2$.

This is quite remarkable since we reduced the dimension from $n$ into $\log n$.

This is a very fat matrix. A number of other matrices beyond Gaussian have been shown to also work with similar guarantees.

The Johnson-Lindenstrauss lemma sounds similar to the Restricted Isometry Property (RIP).
A fat $m \times n$ matrix A has the RIP property if $(1-\epsilon) ||x||_2 \leq ||A x ||_2 \leq ||(1+\epsilon) ||x||_2$,

for all vectors $x$ that are k-sparse, i.e. $||x||_0 \leq k$.

This means that the matrix A does not distort the $\ell_2$ norm of sparse vectors.
As before, well-known results have shown that random Gaussian matrices satisfy RIP with high probability if we set $m = \Theta( k \log(n/k))$.

To understand the similarities and differences between RIP and JL, we need to see how these results are proven. The central question is how fat random matrices distort the length of vectors and with what probability.

All known proofs of JL (for each matrix family) start with a lemma that deals with one fixed vector:
Lemma 1:

For each given vector $x$, the matrix A does not distort the length too much, with good probability: $Pr[ (1-\epsilon) ||x||_2 \leq ||Ax||_2 \leq (1+\epsilon) ||x||_2] \geq 1- P_{fail}(m,n) \quad (1)$

The key issue is between for all versus for each vector quantifiers.
If we first fix the vector and throw the matrix randomly, this is much easier than promising something for a random matrix that once realized, works w.h.p. for all vectors.

Clearly, if first realize the matrix, there are certainly vectors where lemma 1 cannot hold: since the matrix is fat, it has a nullspace so there are nonzero vectors that have $||Ax||=0$. Lemma 1 simply says that this is unlikely and gives a `for each vector, a random matrix is good whp’ guarantee.

How small we can make $P_{fail}$ depends crucially on how small we make m. Since we want to achieve dimensionality reduction we want to make the matrix as fat as possible, i.e. m as small as possible. But $P_{fail}$ will become larger.

To prove JL for some family of matrices we need to make $P_{fail}\leq 1/n^2$.

If we have lemma 1 with this guarantee, we just do a union bound over all ${n \choose 2}= \Theta(n^2)$ pairs of vectors and obtain the pairwise distance distortion guarantee for all vectors $p_1, p_2, \ldots p_n$. Hence JL is Lemma 1 + union bound over all pairs.

It turns out that for Gaussian iid matrices, if you make $m=\frac{1}{\epsilon^2} \log n$, the probability of error is smaller than $P_{fail} \leq 1/n^2$. This follows by the wonderful symmetries of the Gaussian: $||Ax ||_2$ has the same distribution as $||A e_1 ||_2$ i.e. the $\ell_2$ norm of the first column of A. Therefore $||Ax ||^2_2$ has a Chi-square distribution with m degrees of freedom and standard concentration results suffice to bound $P_{fail}$.

To prove RIP, exactly the same recipe can be followed: first prove Lemma 1 for one vector and then do a union bound over all ${n \choose k}$ k-sparse vectors. The problem is now we have a much larger union bound and therefore we need $P_{fail} \leq \frac{1}{{n \choose k}}$. This enforces a much larger m compared to JL. The general recipe emerges. Say you want to prove a good property for all hairy vectors.

First show that for each fixed vector a random matrix has the good property with probability at least $1- P_{fail}$.

Then ensure $P_{fail} \leq \frac{1}{\text{No of hairy vectors}}$. This will limit how much dimensionality reduction you can achieve.

Finally, a union bound establishes the property for all hairy vectors.

# i’m in ur protocolz, jammin ur cellphonez

Krish Eswaran sent me a story about how a group at Virgina Tech described how LTE networks are susceptible to a certain kind of jamming strategy:

“An example strategy would be to target specific control or synchronization signals, in order to increase the geographic range of the jammer and better avoid detection,” the Wireless @ Virginia Tech research group said in a filing (PDF) submitted to the National Telecommunications and Information Administration. “The availability of low-cost and easy to use software-defined radios makes this threat even more realistic.”

Color me unsurprised! For my PhD, I studied arbitrarily varying channels (AVCs), which are information-theoretic models for communication against adversarial interference. There are a couple of design insights one can distill from considering the AVC model:

• Separating protocol and payload makes schemes susceptible to spoofing.
• Lack of synchronization/coordination between sender and receiver can be a real problem in adversarial settings.

Here we have a case where the protocol is easy to spoof/disrupt, essentially because the control information in unprotected.

This separation between control information and payload is often suboptimal in other senses. See, for example, Tchamkerten, Chandar and Wornell.

/