Uncategorized


I’ve started doing more machine learning research lately, which means I’ve been sullying my delicate theorist’s hands testing out my algorithms on data. Perhaps the most (over) used dataset is the MNIST handwritten digits collection, which was been put into MATLAB form by Sam Roweis (RIP). As a baseline, I wanted to see how an SVM would perform after I projected the data (using PCA) into the top 100 dimensions. The primal program is

\min_{\mathbf{w},b} \frac{1}{2} \| \mathbf{w} \|_2^2 + C \sum_{i=1}^{n} z_i
s.t. y_i (\mathbf{w}^T \mathbf{x}_i + b) \ge 1 - z_i)

I chose some “reasonable” value for C and tried to train a classifier on all pairs of points and got the following error rates on the test set (in percentages, rounded).

0
0      0
0.56   0.43   0
0.33   0.45   2.37   0
0.04   0.06   1.17   0.23   0
1.02   0.11   1.89   3.77   0.72   0
0.52   0      1.31   0.08   0.60   1.66   0
0.01   0.15   1.01   0.80   0.80   0.42   0      0
0.43   1.15   2.22   2.69   0.38   3.41   0.54   0.47   0 
0.20   0.14   0.85   1.13   3.03   1.02   0      3.82   1.27   0

This is digits from 0 to 9, so for example, the training error for classifying 0 versus 1 was zero percent, but it’s about 3.8 percent error to decide between 9 and 7. I did this to try and get a sense of which digits were “harder” for SVM to distinguish between so that I could pick a good pair for experiments, or better yet, to pick a pair based on a target error criterion. Running experiments on Gaussian synthetic examples is all fine and good, but it helps to have a range of data sets to test out how resilient an algorithm is to more noise, for example.

I’m on the program committee for the Cyber-Security and Privacy symposium, so I figured I would post this here to make more work for myself.

GlobalSIP 2013 – Call for Papers
IEEE Global Conference on Signal and Information Processing
December 3-5, 2013 | Austin, Texas, U.S.A.

GlobalSIP: IEEE Global Conference on Signal and Information Processing is a new flagship IEEE Signal Processing Society conference. The focus of this conference is on signal and information processing and up-and-coming signal processing themes.

GlobalSIP is composed of symposia selected based on responses to the call-for-symposia proposals. GlobalSIP is composed of symposia on hot topics related to signal and information processing.

The selected symposia are:

Paper submission will be online only through the GlobalSIP 2013 website Papers should be in IEEE two-column format. The maximum length varies among the symposia; be sure to check each symposium’s information page for details. Authors of Signal Processing Letters papers will be given the opportunity to present their work at GlobalSIP 2013, subject to space availability and approval by the Technical Program Chairs of GlobalSIP 2013. The authors need to specify in which symposium they wish to present their paper. Please check conference webpage for details.

Important Dates:
*New* Paper Submission Deadline – June 15, 2013
Review Results Announce – July 30, 2013
Camera-Ready Papers Due – September 7, 2013
*New* SPL request for presentation – September 7, 2013

Persi Diaconis gave the second annual Billingsley Lecture at UChicago yesterday on the topic of coincidences and what a skeptical statistician/probabilist should say about them. He started out by talking about how Jung was fascinated by paradoxes (apparently there’s one about having fish come up all the time in conversation).

It was mostly a general-audience talk (with some asides about Poisson approximation), and the first part on the birthday problem and variants. Abstracted away, the question is given n balls (people) and C bins/categories (days), how big should n be so that there’s an even chance that two balls land in the same bin? Turns out n \approx latex 1.2 \sqrt{C}, as we know, but we can expand this to deal with approximate matches (you need only 7 people to get 2 birthdays in the same week with probability around 1/2). If you want to put a graph on it you can ask social-network coincidence questions and get some scalings as a function of the number of edges and number of categories — here there are n vertices and C colors for the vertices. What these calculations show, of course, is that most coincidences are not so surprising, at least in this probabilistic sense. Some more advanced treatment might be found in Sukhada Fadnavis’s preprint (which also has something about a “shameful conjecture” on chromatic polynomials that was proved in 2000, but I don’t know why it is shameful). The second part of the talk was on problems arising in the study of ESP — namely that experimental controls are not really present, so the notion of a “trial” is hard to pin down, leading (of course) to more false perceptions of coincidences are being surprising. He closed with some remarks about how our perception of coincidence is really about how our minds work, and pointed to some work by Ruma Falk for those who are interested in that angle of things.

I was unaware of this body of Diaconis’s work, and it was nice to have a high-level talk to cap off the day.

The following email came through the ITSOC mailing list, but may be of interest to other readers of the blog.

Dear Colleagues,

We are making a proposal to the United States Postal Service for the production of a stamp honoring Claude Elwood Shannon on the 100th anniversary of his birth.

The proposal is available at: http://www.itsoc.org/about/shannons-centenary-us-postal-stamp/

Please add your endorsement on that page, and then spread the word!

We would love to have the endorsements of your friends, colleagues, department chair, dean, university president, CEO, government representatives, school-aged children, and the public at large. [Contact information for endorsing individuals will not be posted.]

Thanks for your support!
Best,
Michelle Effros

An op-ed from n+1 on the safety of being brown.

Via Mimosa (I think), a profile of photographer Nemai Ghosh, who worked with Satyajit Ray.

Via my father, the story of Indian Jewish actresses in early Bollywood.

Things seems to be heating up on the LAC. Not a good sign.

The death toll in Dhaka keeps rising. This makes Matthew Yglesias’s reaction (see a stunningly poor example of self-reflection here) a bit more that the usual brand of neoliberal odiousness.

Sudeep Kamath sent me a note about a recent result he posted on the ArXiV that relates to an earlier post of mine on the HGR maximal correlation and an inequality by Erkip and Cover for Markov chains U -- X -- Y which I had found interesting:
I(U ; Y) \le \rho_m(X,Y)^2 I(U ; X).
Since learning about this inequality, I’ve seen a few talks which have used the inequality in their proofs, at Allerton in 2011 and at ITA this year. Unfortunately, the stated inequality is not correct!

On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover
Venkat Anantharam, Amin Gohari, Sudeep Kamath, Chandra Nair

What this paper shows is that the inequality is not satisfied with \rho_m(X,Y)^2, but by another quantity:
I(U ; Y) \le s^*(X;Y) I(U ; X)
where s^*(X;Y) is given by the following definition.

Let X and Y be random variables with joint distribution (X, Y) \sim p(x, y). We define
s^*(X;Y) = \sup_{r(x) \ne p(x)} \frac{ D( r(y) \| p(y) ) }{ D( r(x) \| p(x) },
where r(y) denotes the y-marginal distribution of r(x, y) := r(x)p(y|x) and the supremum on the right hand side is over all probability distributions r(x) that are different from the probability distribution p(x). If either X or Y is a constant, we define s^*(X; Y) to be 0.

Suppose (X,Y) have joint distribution P_{XY} (I know I am changing notation here but it’s easier to explain). The key to showing their result is through deriving variational characterizations of \rho_m and s^* in terms of the function
t_{\lambda}( P_X ) := H( P_Y ) - \lambda H( P_X )
Rather than getting into that in the blog post, I recommend reading the paper.

The implication of this result is that the inequality of Erkip and Cover is not correct : not only is \rho_m(X,Y)^2 not the supremum of the ratio, there are distributions for which it is not an upper bound. The counterexample in the paper is the following: X \sim \mathsf{Bernoulli}(1/2), and Y is generated via this asymmetric erasure channel:

Joint distribution counterexample

Joint distribution counterexample (Fig. 2 of the paper)


How can we calculate \rho_m(X,Y)? If either X or Y is binary-valued, then
\rho_m(X,Y)^2 = -1 + \sum_{x,y} \frac{ p(x,y)^2 }{ p(x) p(y) }
So for this example \rho_m(X,Y)^2 = 0.6. However, s^*( X,Y) = \frac{1}{2} \log_2(12/5) > 0.6 and there exists a sequence of variables U_i satisfying the Markov chain such that U_i -- X -- Y such that the ratio approaches s^*.

So where is the error in the original proof? Anantharam et al. point to an explanation that the Taylor series expansion used in the proof of the inequality with \rho_m(X,Y)^2 may not be valid at all points.

This seems to just be the start of a longer story, which I look forward to reading in the future!

Assumptionless consistency of the Lasso
Sourav Chatterjee
The title says it all. Given p-dimensional data points \{ \mathbf{x}_i : i \in [n] \} the Lasso tries to fit the model \mathbb{E}( y_i | \mathbf{x_i}) = \boldsymbol{\beta} \mathbf{x}_i by minimizing the \ell^1 penalized squared error
\sum_{i=1}^{n} (y_i - \boldsymbol{\beta} \mathbf{x}_i)^2 + \lambda \| \boldsymbol{\beta} \|_1.
The paper analyzes the Lasso in the setting where the data are random, so there are n i.i.d. copies of a pair of random variables (\mathbf{X},Y) so the data is \{(\mathbf{X}_i, Y_i) : i \in [n] \}. The assumptions are on the random variables (\mathbf{X},Y) : (1) each coordinate |X_i| \le M is bounded, the variable Y = (\boldsymbol{\beta}^*)^T \mathbf{X} + \varepsilon, and \varepsilon \sim \mathcal{N}(0,\sigma^2), where \boldsymbol{\beta}^* and \sigma are unknown constants. Basically that’s all that’s needed — given a bound on \|\boldsymbol{\beta}\|_1, he derives a bound on the mean-squared prediction error.

On Learnability, Complexity and Stability
Silvia Villa, Lorenzo Rosasco, Tomaso Poggio
This is a handy survey on the three topics in the title. It’s only 10 pages long, so it’s a nice fast read.

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
Francis Bach
A central challenge in stochastic optimization is understanding when the convergence rate of the excess loss, which is usually O(1/\sqrt{n}), can be improved to O(1/n). Most often this involves additional assumptions on the loss functions (which can sometimes get a bit baroque and hard to check). This paper considers constant step-size algorithms but where instead they consider the averaged iterate $\latex \bar{\theta}_n = \sum_{k=0}^{n-1} \theta_k$. I’m trying to slot this in with other things I know about stochastic optimization still, but it’s definitely worth a skim if you’re interested in the topic.

On Differentially Private Filtering for Event Streams
Jerome Le Ny
Jerome Le Ny has been putting differential privacy into signal processing and control contexts for the past year, and this is another paper in that line of work. This is important because we’re still trying to understand how time-series data can be handled in the differential privacy setting. This paper looks at “event streams” which are discrete-valued continuous-time signals (think of count processes), and the problem is to design a differentially private filtering system for such signals.

Gossips and Prejudices: Ergodic Randomized Dynamics in Social Networks
Paolo Frasca, Chiara Ravazzi, Roberto Tempo, Hideaki Ishii
This appears to be a gossip version of Acemoglu et al.’s work on “stubborn” agents in the consensus setting. They show similar qualitative behavior — opinions fluctuate but their average over time converges (the process is ergodic). This version of the paper has more of a tutorial feel to it, so the results are a bit easier to parse.

A few months ago I was home visiting my parents and we had a lunch with a few other Maharashtrians. The conversation turned towards food, and in particular ingredients that are important for making authentic garam masala. Garam masalas vary widely by region in India, and the two ingredients in question were dagadful and nag kesar. I had never really heard of these spices so I did a bit of research to learn more.

Dagadful (Parmelia perlata) is a lichen, not to be confused with the stone flower Didymocarpus pedicellatus, which is a plant that grows on rocks and is called charela in Hindi, I believe. The confusing thing is that both plants are used for herbal remedies, but the former is used for culinary purposes.

If you search for “nag kesar” you may find Mesua ferrea, a hardwood tree that grows in India and surrounds. That’s not where the spice comes from, however. This sparked the most debate at lunch, but I think I’ve figured out that the spice is the bud of a different tree, Mammea longifolia. Both Mesua and Mammea are in the family Calophyllaceae, which probably led to the name clash.

Last week I was reading Active Learning via Perfect Selective Classification by El-Yaniv and Wiener, and came across a neat result due to Hug and Reitzner that they use in some of their bounds for active learning on Gaussian distributions.

The setup is the following : let X_1, X_2, \ldots, X_n be n jointly Gaussian vectors with distribution \mathcal{N}(0,I_d) in \mathbb{R}^d. The convex hull P_n of these points is called a Gaussian polytope. This is a random polytope of course, and we can ask various things about their shape : what is the distribution of the number of vertices, or the number of k-faces? Let f_k(P_n) be the number of k-faces Distributions are hard, but for general k the expected number of faces (as n \to infty) is given by

\mathbb{E}[ f_k(P_n)] = \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} (1 + o(1)),

where \beta_{k,d-1} is the internal angle of a regular (d-1)-simplex at one of its k-dimensional faces. What Hug and Reitzner show is a bound on the variance (which then El-Yaniv and Plan use in a Chebyshev bound) : there exists a constant c_d such that

\mathrm{Var}( F_k(P_n) ) \le c_d (\ln n)^{(d-1)/2}

So the variance of the number of k-faces can be upper bounded by something that does not depend at all on the actual value of k. In fact, they show that

f_k(P_n) (\ln n)^{-(d-1)/2} \to \frac{2^d}{\sqrt{d}} \binom{d}{k+1} \beta_{k,d-1} \pi^{(d-1)/2}

in probability as n \to \infty. That is, appropriately normalized, the number of faces converges to a constant.

To me this result was initially surprising, but after some more thought it makes a bit more sense. If you give me a cloud of Gaussian points, then I need k+1 points to define a k-face. The formula for the mean says that if I chose a random set of k+1 points, then approximately \frac{2^d}{\sqrt{d}} \beta_{k,d-1}(\pi \ln n)^{(d-1)/2} fraction of them will form a real k-face of the polytope. This also explains why the simplex-related quantity appears — points that are on “opposite sides” of the sphere (the level sets of the density) are not going to form a face together. As n \to \infty this fraction will change, but effectively the rate of growth in the number of faces with n is (\ln n)^{(d-1)/2}, regardless of k.

I’m not sure where this result will be useful for me (yet!) but it seemed like something that the technically-minded readers of the blog would find interesting as well.

I’ve been trying to get a camera-ready article for the Signal Processing Magazine and the instructions from IEEE include the following snippet:

*VERY IMPORTANT: All source files ( .tex, .doc, .eps, .ps, .bib, .db, .tif, .jpeg, …) may be uploaded as a single .rar archived file. Please do not attempt to upload files with extensions .shs, .exe, .com, .vbs, .zip as they are restricted file types.

While I have encountered .rar files before, I was not very familiar with the file format or its history. I didn’t know it’s a proprietary format — that seems like a weird choice for IEEE to make (although no weirder than PDF perhaps).

What’s confusing to me is that ArXiV manages to handle .zip files just fine. Is .tgz so passé now? My experience with RAR is that it is good for compressing (and splitting) large files into easier-to-manage segments. All of that efficiency seems wasted for a single paper with associated figures and bibliography files and whatnot.

I was trying to find the actual compression algorithm, but like most modern compression software, the innards are a fair bit more complex than the base algorithmic ideas. The Wikipedia article suggests it does a blend of Lempel-Ziv (a variant of LZ77) and prediction by partial matching, but I imagine there’s a fair bit of tweaking. What I couldn’t figure out is if there is a new algorithmic idea in there (like in the Burrows-Wheeler Transform (BWT)), or it’s more a blend of these previous techniques.

Anyway, this silliness means I have to find some extra software to help me compress. SimplyRAR for MacOS seems to work pretty well.

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 84 other followers