Video : “Matrices and their singular values” (1976)

Via Allie Fletcher, here is an awesome video on the SVD from Los Alamos National Lab in 1976:

From the caption by Cleve Moler (who also blogs):

This film about the matrix singular value decomposition was made in 1976 at the Los Alamos National Laboratory. Today the SVD is widely used in scientific and engineering computation, but in 1976 the SVD was relatively unknown. A practical algorithm for its computation had been developed only a few years earlier and the LINPACK project was in the early stages of its implementation. The 3-D computer graphics involved hidden line computations. The computer output was 16mm celluloid film.

The graphics are awesome. Moler blogged about some of the history of the film. Those who are particularly “attentive” may note that the SVD movie seems familiar:

The first Star Trek movie came out in 1979. The producers had asked Los Alamos for computer graphics to run on the displays on the bridge of the Enterprise. They chose our SVD movie to run on the science officer’s display. So, if you look over Spock’s shoulder as the Enterprise enters the nebula in search of Viger, you can glimpse a matrix being diagonalized by Givens transformations and the QR iteration.

tracks : all the diamonds (all this is yours and mine)

A mix borne of travel, monuments, and modern wonders of the world.

  1. Workers Comp (Mos Def)
  2. Madness (Deltron 3030)
  3. North Side Gal (JD McPherson)
  4. Dead Animals (The Young Evils)
  5. Little Talks (Of Monsters and Men
  6. I Just Want To Make Love To You (Buddy Guy)
  7. Doin It To It [Disrupt The Scene Remix] (Square Frames)
  8. I Believe in a Thing Called Love (The Darkness)
  9. October (The Helios Sequence)
  10. My Warfare (Laurelyn Dossett)
  11. Yours and Mine (Billie Holiday)
  12. Nye Nummer Et (Choir Of Young Believers)
  13. Beggar in the Morning (The Barr Brothers)
  14. I Wish (Forro In The Dark feat. David Byrne)

Collisions from entropy bounded random variables

I realized the other day that for two completely unrelated papers I’ve had to prove statements about how many unique elements appear in a sample from distribution P on a discrete set. The thing is we only have a handle on the entropy H(P) but not on the distribution itself.

Let P have entropy H(P) and consider a sample of size m drawn i.i.d. from P. Let M be the number of distinct symbols in the sample. Then
\mathbb{E}[M] \le \frac{m H(P)}{\log m} + 1

Let \{p_i\} be the probabilities of the different elements. There could be countably many, but we can write the expectation of the number of unique elements as
\mathbb{E}[M] = \sum_{i=1}^{\infty} (1 - (1 - p_i)^m)
By doing a bit of series manipulation you can get the bound above. This came up (perhaps unsurprisingly) in some work on large-alphabet probability estimation.

Let P have entropy H(P) and support size K and consider a sample of size n drawn i.i.d. from P. Let M be the number of distinct symbols in the sample. Then
\mathbb{P}(M = n) \ge \left( \frac{H(P) -  1 - \log n}{\log K} \right)^{n-1}

For this result we need a bound on the support of P as well, which makes things a little easier. This result came up when we were proving bound on adversarial strategies when eavesdropping. We needed to show that if P represents some sort of list-decoding based on what the adversary has seen over the channel so far, then the list has some structure/distance properties. This lemma came in handy.

I’m sure there’s a better way to clean up these kinds of results to get something more “off the shelf” for general distributions. In both cases we didn’t find these result easily somewhere else so we ended up proving them as one-off lemmas.

Linkage

Tony Kushner responds to some of the criticism of Lincoln.

Paul Frees, the actor who played Boris Badenov was in a ton of other things — who knew?

The Simons Foundation wrote about differential privacy.

After being at NIPS, which was held in a casino megaplex, Andrew Gelman’s post on casinos had more resonance.

My friend John is blogging about his time at the Abhayagiri monastery. It’s an interesting look in on this kind of monastic life.

For those who missed the news, a package for Indiana Jones arrived at UChicago, but the truth is someone less ARG-like than expected.

Scholarly communication in conferences

A few weeks (!) ago I was talking with an anthropologist friend of mine about how different fields have different modes of communicating research “findings” in the conference setting. Some places people just read their paper out loud, others have slide presentations, yet others have posters, and I imagine some people do blackboard talks. Of course, conferences have many purposes — schmoozing, job hunting, academic political wrangling, and so on. What is unclear to me is why particular academic communities have settled on particular norms for presenting their work.

One axis along which to understand this might be the degree to which the presentation of the paper is an advertisement for the written paper. In many humanities conferences, people simply read their paper out loud. You’d think that theater researchers would be able to make a more… dramatic reading of their work, but you’d be wrong much of the time. It’s very hard to sit and listen and follow a jargon-heavy analysis of something that you probably have never read about (e.g. turn of the century commercial theater in Prague), and in some sense I feel that the talk as an advertisement for the paper is minimal here.

On the other hand, a poster session maximizes the “advertisement of the paper” aspect. People stand there for 5 minutes while you explain the ideas in the paper, and if seems sufficiently interesting then they will go and read the actual paper. A difference here between the model in the humanities is that there is a paper in the proceedings, while in humanities conferences this is not necessarily the case.

Slide presentations are somewhere in the middle — I often go to a talk at a conference and think “well, now I don’t need to read the paper.” These are the trickiest because the audience is captive but you cannot give them the full story. It’s more of a method for luring already-interested people into wanting to read the paper rather than the browsing model of a poster session.

However, even this “advertisement” categorization raises the question of why we have poster sessions, slide presentations, and paper readings. Are these the best way to present the research in those fields? Should we have more posters at ISIT and fewer talks (more like NIPS)? Should NIPS have more parallel sessions to reflect the spread of interest in the “community?” Should anthropology conferences have each panelist give an 8 minute slide presentation followed by real discussion?

I missed ITW in Lausanne this year, but I heard that they mixed up the format to great success. More posters and fewer talks meant more interaction and more discussion. I think more experimenting could be good — maybe some talks should be given as chalk talks with no slides!

tracks : even the poor are not immune

Sometimes music is a good break from the day to day. This is a mix of old stuff I’ve overused and some newer things that are in my ear.

  1. This Song Is Called Ragged (Jonathan Boulet)
  2. Lycra Mistral (El Guincho)
  3. Dance For You (Dirty Projectors)
  4. Clouds (Deep Time)
  5. Yèné Felagoté (Tlahoun Gèssèssè)
  6. On the Sunny Side of the Street (Dizzy Gillespie / Sonny Rollins)
  7. Croon Spoon (EV and SS, from The Cradle Will Rock)
  8. Pink Wine (St. Paul De Vence)
  9. Teach My Heart (Charles Leo Gebhardt IV)
  10. Curse This City (Hockey)
  11. Rebels of the Sacred Heart (Flogging Molly)
  12. Groundhog Day (Corin Tucker Band)
  13. Faithful Man (Lee Fields & The Experience)
  14. Coquette (Brandi Shearer & The Robert Nolan Trio)
  15. Laughing At Life (Billie Holiday & Lester Young)
  16. Kithkin (Ampersand)

NIPS 2012 : day two

I took it a bit easy today at the conference and managed to spend some time talking to collaborators about work, so perhaps I wasn’t as 100% all in to the talks and posters. In general I find that it’s hard to understand for many posters what the motivating problem is — it’s not clear from the poster, and it’s not always clear from the explanation. Here were a few papers which I thought were interesting:

W. Koolen, D. Adamskiy, M. Warmuth
Putting Bayes to sleep
Some signals look sort of jump Markov — the distribution of the data changes over time so that there are segments which have distribution A, then later it switches to B, then perhaps back to A, and so on. A prediction procedure which “mixes past posteriors” works well in this setting but it was not clear why. This paper provides a Bayesian interpretation for the predictor as mixing in a “sleeping experts” setting.

J. Duchi, M. Jordan, M. Wainwright, A. Wibisono
Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
This paper looked at stochastic gradient descent when function evaluations are cheap but gradient evaluations are expensive. The idea is to compute an unbiased approximation to the gradient by evaluating the function at the \theta_t and \theta_t + \mathrm{noise} and then do the discrete approximate to the gradient. Some of the attendees claimed this is similar to an approach proposed by Nesterov, but the distinction was unclear to me.

J. Lloyd, D. Roy, P. Orbanz, Z. Ghahramani
Random function priors for exchangeable graphs and arrays
This paper looked at Bayesian modeling for structures like undirected graphs which may represent interactions, like protein-protein interactions. Infinite random graphs whose distributions are invariant under permutations of the vertex set can be associated to a structure called a graphon. Here they put a prior on graphons, namely a Gaussian process prior, and then try to do inference on real graphs to estimate the kernel function of the process, for example.

N. Le Roux, M. Schmidt, F. Bach
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
This was a paper marked for oral presentation — the idea is that in gradient descent it is expensive to evaluate gradients if your objective function looks like \sum_{i=1}^{n} f(\theta, x_i), where x_i are your data points and n is huge. This is because you have to evaluate n gradients. On the other hand, stochastic gradient descent can be slow because it picks a single i and does a gradient step at each iteration on f(\theta_t, x_i). Here what they do at step t is pick a random point j, evaluate its gradient, but then take a gradient step on all n points. For points i \ne j they just use the gradient from the last time i was picked. Let T_i(t) be the last time i was picked before time t, and T_j(t) = t. Then they take a gradient step like \sum_{i = 1}^{n} f(\theta_{T_i(t)}, x_i). This works surprisingly well.

Stephane Mallat
Classification with Deep Invariant Scattering Networks
This was an invited talk — Mallat was trying to explain why deep networks seem to do learning well (it all seems a bit like black magic), but his explanation felt a bit heuristic to me in the end. The first main point he had is that wavelets are good at capturing geometric structure like translation and rotation, and appear to have favorable properties with respect to “distortions” in the signal. The notion of distortion is a little vague, but the idea is that if two signals (say images) are similar but one is slightly distorted, they should map to representations which are close to each other. The mathematics behind his analysis framework was group theoretic — he wants to estimate the group of actions which manipulate images. In a sense, this is a control-theory view of the problem (at least it seemed to me). The second point that I understood was that sparsity in representation has a big role to play in building efficient and layered representations. I think I’d have to see the talk again to understand it better, but in the end I wasn’t sure that I understood why deep networks are good, but I did understand some more interesting things about wavelet representations, which is cool.

Is there an incentive for misrepresentation?

I was recently reading a paper on ArXiV that is from the VLDB 2012 conference:

Functional Mechanism: Regression Analysis under Differential Privacy
Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Marianne Winslett

The idea of the paper is to make a differentially private approximation to an optimization by perturbing a Taylor series expansion of the objective function. Which is an interesting idea at that. However, what caught my eye was how they referred to an earlier paper of mine (with Kamalika Chaudhuri and Claire Monteleoni) on differentially private empirical risk minimization. What we did in that paper was look at the problem of training classifiers via ERM and the particular examples we used for experiments were logistic regression and SVM.

In the VLDB paper, the authors write:

The algorithm, however, is inapplicable for standard logistic regression, as the cost function of logistic regression does not satisfy convexity requirement. Instead, Chaudhuri et al. demonstrate that their algorithm can address a non-standard type of logistic regression with a modified input (see Section 3 for details). Nevertheless, it is unclear whether the modified logistic regression is useful in practice.

This is just incorrect. What we look at is a fairly standard formulation of logistic regression with labels in {-1,+1}, and do the standard machine learning approach, namely regularized empirical risk minimization. The objective function is, in fact, convex. We further do experiments using that algorithm on standard datasets. Perhaps the empirical performance was not as great as they might like, but then they should make a claim of some sort instead of saying it’s “unclear.”

They further claim:

In particular, they assume that for each tuple t_i, its value on Y is not a boolean value that indicates whether t_i satisfies certain condition; instead, they assume y_i equals the probability that a condition is satisfied given x_i… Furthermore, Chaudhuri et al.’s method cannot be applied on datasets where Y is a boolean attribute…

Firstly, we never make this “assumption.” Secondly, we do experiments using that algorithm on standard datasets where the label is binary. Reading this description was like being in a weird dream-world in which statements are made up and attributed to you.

Naturally, I was a bit confused about this rather blatant misrepresentation of our paper, so I emailed the authors, who essentially said that they were confused by the description in our paper and that more technical definitions are needed because we are from “different communities.” They claimed that they emailed questions about it but we could not find any such emails. Sure, sometimes papers can be confusing if they are out of your area, but to take “I don’t understand X” to “let me make things up about X” requires a level of gumption that I don’t think I could really muster.

In a sense, the publication incentives are stacked in favor of this kind of misrepresentation. VLDB is a very selective conference, so in order to make your contribution seem like a big deal, you have to make it seem that alternative approaches to the problem are severely lacking. However, rather than making a case against the empirical performance of our method, this paper just invented “facts” about our paper. The sad thing is that it seems completely unnecessary, since their method is quite different.

NIPS 2012 : day one

I am attending NIPS this year for the first time, and so I figured it would be good to blog about some of it here. I totally dropped the ball on Allerton, so maybe I’ll make up for it by writing more about the actual talks here. Fortunately, or unfortunately, most of the conference is about things I have almost no experience with, so I am having a bit of an explore/exploit tradeoff in my selection process.

Every day of the conference has a poster session from 7 to midnight — there are 90+ posters in a single room and people go in and out of hanging out with friends and looking at posters. My poster (a paper with Kamalika Chaudhuri and Kaushik Sinha on differentially private approximations to PCA) was last night, so I was on the presenting end of things. I gave up at 10:30 because I was getting hoarse and tired, but even then there were a fair number of people milling about. Since I was (mostly) at my poster I missed out on the other works.

During the day the conference is a single-track affair with invited and highlighted talks. There are two kinds of highlighted talks — some papers are marked for oral prevention (ETA: presentation), and some are marked as “spotlights,” which means that the authors get to make a 5 minute elevator pitch for their poster in front of the whole conference. Those start today, and I’m looking forward to it.

In the meantime, here is a picture from the hike I took yesterday with Erin:

Mountain Range on a hike near Lake Lily.

Mountain Range on a hike near Lake Lily.