More ArXiV skims

Assumptionless consistency of the Lasso
Sourav Chatterjee
The title says it all. Given p-dimensional data points \{ \mathbf{x}_i : i \in [n] \} the Lasso tries to fit the model \mathbb{E}( y_i | \mathbf{x_i}) = \boldsymbol{\beta} \mathbf{x}_i by minimizing the \ell^1 penalized squared error
\sum_{i=1}^{n} (y_i - \boldsymbol{\beta} \mathbf{x}_i)^2 + \lambda \| \boldsymbol{\beta} \|_1.
The paper analyzes the Lasso in the setting where the data are random, so there are n i.i.d. copies of a pair of random variables (\mathbf{X},Y) so the data is \{(\mathbf{X}_i, Y_i) : i \in [n] \}. The assumptions are on the random variables (\mathbf{X},Y) : (1) each coordinate |X_i| \le M is bounded, the variable Y = (\boldsymbol{\beta}^*)^T \mathbf{X} + \varepsilon, and \varepsilon \sim \mathcal{N}(0,\sigma^2), where \boldsymbol{\beta}^* and \sigma are unknown constants. Basically that’s all that’s needed — given a bound on \|\boldsymbol{\beta}\|_1, he derives a bound on the mean-squared prediction error.

On Learnability, Complexity and Stability
Silvia Villa, Lorenzo Rosasco, Tomaso Poggio
This is a handy survey on the three topics in the title. It’s only 10 pages long, so it’s a nice fast read.

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
Francis Bach
A central challenge in stochastic optimization is understanding when the convergence rate of the excess loss, which is usually O(1/\sqrt{n}), can be improved to O(1/n). Most often this involves additional assumptions on the loss functions (which can sometimes get a bit baroque and hard to check). This paper considers constant step-size algorithms but where instead they consider the averaged iterate $\latex \bar{\theta}_n = \sum_{k=0}^{n-1} \theta_k$. I’m trying to slot this in with other things I know about stochastic optimization still, but it’s definitely worth a skim if you’re interested in the topic.

On Differentially Private Filtering for Event Streams
Jerome Le Ny
Jerome Le Ny has been putting differential privacy into signal processing and control contexts for the past year, and this is another paper in that line of work. This is important because we’re still trying to understand how time-series data can be handled in the differential privacy setting. This paper looks at “event streams” which are discrete-valued continuous-time signals (think of count processes), and the problem is to design a differentially private filtering system for such signals.

Gossips and Prejudices: Ergodic Randomized Dynamics in Social Networks
Paolo Frasca, Chiara Ravazzi, Roberto Tempo, Hideaki Ishii
This appears to be a gossip version of Acemoglu et al.’s work on “stubborn” agents in the consensus setting. They show similar qualitative behavior — opinions fluctuate but their average over time converges (the process is ergodic). This version of the paper has more of a tutorial feel to it, so the results are a bit easier to parse.

RAR : a cry of rage

I’ve been trying to get a camera-ready article for the Signal Processing Magazine and the instructions from IEEE include the following snippet:

*VERY IMPORTANT: All source files ( .tex, .doc, .eps, .ps, .bib, .db, .tif, .jpeg, …) may be uploaded as a single .rar archived file. Please do not attempt to upload files with extensions .shs, .exe, .com, .vbs, .zip as they are restricted file types.

While I have encountered .rar files before, I was not very familiar with the file format or its history. I didn’t know it’s a proprietary format — that seems like a weird choice for IEEE to make (although no weirder than PDF perhaps).

What’s confusing to me is that ArXiV manages to handle .zip files just fine. Is .tgz so passé now? My experience with RAR is that it is good for compressing (and splitting) large files into easier-to-manage segments. All of that efficiency seems wasted for a single paper with associated figures and bibliography files and whatnot.

I was trying to find the actual compression algorithm, but like most modern compression software, the innards are a fair bit more complex than the base algorithmic ideas. The Wikipedia article suggests it does a blend of Lempel-Ziv (a variant of LZ77) and prediction by partial matching, but I imagine there’s a fair bit of tweaking. What I couldn’t figure out is if there is a new algorithmic idea in there (like in the Burrows-Wheeler Transform (BWT)), or it’s more a blend of these previous techniques.

Anyway, this silliness means I have to find some extra software to help me compress. SimplyRAR for MacOS seems to work pretty well.

ICML Workshop on Machine Learning with Test-Time Budgets

Venkatesh Saligrama sent out a call for an ICML workshop he is organizing:

I wanted to bring to your attention an ICML workshop on “Machine Learning with Test-Time Budgets” that I am helping organize. The workshop will be held during the ICML week. The workshop will feature presentations both from data-driven as well as model-based perspectives and will feature researchers from machine learning and control/decision theory.

We are accepting papers related to these topics. Please let me know if you have questions about the workshop or wish to submit a paper.

Linkage

I’m sick today so here are some links.

Click That Hood, a game which asks you to identify neighborhoods. I was lousy at San Diego, but pretty decent at Chicago, even though I’ve lived here for half the time. Go figure.

For those who care about beer, there’s been some news about the blocked merger of Inbev and Modelo. I recommend Erik’s podcast post on the structure of the beer industry (the three-tier system) for those who care about craft beer, and (with reservations) Planet Money’s show on the antitrust regulatory framework that is at work here.

Remember step functions from your signals and systems course? We called them Heaviside step functions after Oliver Heaviside — you can read more about him in this Physics Today article.

Did you know that Pad Thai’s “birth and popularity came out of the nationalist campaign of Field Marshal Plaek Pibulsongkram, one of the revolutionary figures who in 1932 pushed Thailand out of an absolute monarchy?” Neither did I!

I need this album, since I love me some Kurt Weill. I can also live vicariously through NPR’s list of SXSW recommendations.

Bellairs Workshop 2013

I just wanted to write a few words about the workshop at the Bellairs Research Institute. I just returned from sunny Barbados to frigid Chicago, so writing this will help me remember the sunshine and sand:

The beach at Bathsheba on the east coast of Barbados

The beach at Bathsheba on the east coast of Barbados

Mike Rabbat put on a great program this year, and there were lots of talks on a range of topics in machine learning, signal processing, and optimization. The format of the workshop was to have talks with lots of room for questions and discussion. Talks were given out on the balcony where we were staying, and we had to end at about 2:30 because the sunshine would creep into our conference area, baking those of us sitting too far west.

Continue reading

ITA Workshop 2013 : post the first

I promised some ITA blogging, so here it is. Maybe Alex will blog a bit too. These notes will by necessity be cursory, but I hope some people will find some of these papers interesting enough to follow up on them.

A Reverse Pinsker Inequality
Daniel Berend, Peter Harremoës , Aryeh Kontorovich
Aryeh gave this talk on what we can say about bounds in the reverse direction of Pinsker’s inequality. Of course, in general you can’t say much, but what they do is show an expansion of the KL divergence in terms of the total variation distance in terms of the balance coefficient of the distribution \beta = \inf \{ P(A) : P(A) \ge 1/2 \}.

Unfolding the entropy power inequality
Mokshay Madiman, Liyao Wang
Mokshay gave a talk on the entropy power inequality. Given vector random variables X_1 and X_2 is there a term we know that h(X_1 + X_2) \ge h(Z_1 + Z_2) where Z_1 and Z_2 are isotropic Gaussian vectors with the same differential entropy as X_1 and X_2. The question in this paper is this : can we insert a term between these two in the inequality? The answer is yes! They define a spherical rearrangement of the densities of X_1 and X_2 into variables X_1^{\ast} and X_2^{\ast} with spherically symmetric decreasing densities and show that the differential entropy of their sum lies between the two terms in the regular EPI.

Improved lower bounds on the total variation distance and relative entropy for the Poisson approximation
Igal Sason
The previous lower bounds mentioned in the title were based on the Chen-Stein method, and they can be strengthened by sharpening the analysis in the Chen-Stein method.

Fundamental limits of caching
Mohammad A. Maddah-Ali, Urs Niesen`
This talk was on tradeoffs in caching. If there are N files, K users and a size M cache at each user, how should they cache files so as to best allow a broadcaster to share the bandwidth to them? More simply, suppose there are three people who may want to watch one of three different TV shows, and they can buffer the content of one TV show. Since a priori you don’t know which show they want to watch, the idea might be to buffer/cache the first 3rd of each show at each user. They show that this is highly suboptimal. Because the content provider can XOR parts of the content to each user, the caching strategy should not be the same at each user, and the real benefit is the global cache size.

Simple outer bounds for multiterminal source coding
Thomas Courtade
This was a very cute result on using the HGR maximal correlation to get outer bounds for multiterminal source coding without first deriving a single letterization of the outer bound. The main ideas are to use two properties of the HGR correlation : it tensorizes (to get the multiletter part) and the strong DPI from Elza Erkip and Tom Cover’s paper (referenced above).

NIPS 2012 : day two

I took it a bit easy today at the conference and managed to spend some time talking to collaborators about work, so perhaps I wasn’t as 100% all in to the talks and posters. In general I find that it’s hard to understand for many posters what the motivating problem is — it’s not clear from the poster, and it’s not always clear from the explanation. Here were a few papers which I thought were interesting:

W. Koolen, D. Adamskiy, M. Warmuth
Putting Bayes to sleep
Some signals look sort of jump Markov — the distribution of the data changes over time so that there are segments which have distribution A, then later it switches to B, then perhaps back to A, and so on. A prediction procedure which “mixes past posteriors” works well in this setting but it was not clear why. This paper provides a Bayesian interpretation for the predictor as mixing in a “sleeping experts” setting.

J. Duchi, M. Jordan, M. Wainwright, A. Wibisono
Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods
This paper looked at stochastic gradient descent when function evaluations are cheap but gradient evaluations are expensive. The idea is to compute an unbiased approximation to the gradient by evaluating the function at the \theta_t and \theta_t + \mathrm{noise} and then do the discrete approximate to the gradient. Some of the attendees claimed this is similar to an approach proposed by Nesterov, but the distinction was unclear to me.

J. Lloyd, D. Roy, P. Orbanz, Z. Ghahramani
Random function priors for exchangeable graphs and arrays
This paper looked at Bayesian modeling for structures like undirected graphs which may represent interactions, like protein-protein interactions. Infinite random graphs whose distributions are invariant under permutations of the vertex set can be associated to a structure called a graphon. Here they put a prior on graphons, namely a Gaussian process prior, and then try to do inference on real graphs to estimate the kernel function of the process, for example.

N. Le Roux, M. Schmidt, F. Bach
A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets
This was a paper marked for oral presentation — the idea is that in gradient descent it is expensive to evaluate gradients if your objective function looks like \sum_{i=1}^{n} f(\theta, x_i), where x_i are your data points and n is huge. This is because you have to evaluate n gradients. On the other hand, stochastic gradient descent can be slow because it picks a single i and does a gradient step at each iteration on f(\theta_t, x_i). Here what they do at step t is pick a random point j, evaluate its gradient, but then take a gradient step on all n points. For points i \ne j they just use the gradient from the last time i was picked. Let T_i(t) be the last time i was picked before time t, and T_j(t) = t. Then they take a gradient step like \sum_{i = 1}^{n} f(\theta_{T_i(t)}, x_i). This works surprisingly well.

Stephane Mallat
Classification with Deep Invariant Scattering Networks
This was an invited talk — Mallat was trying to explain why deep networks seem to do learning well (it all seems a bit like black magic), but his explanation felt a bit heuristic to me in the end. The first main point he had is that wavelets are good at capturing geometric structure like translation and rotation, and appear to have favorable properties with respect to “distortions” in the signal. The notion of distortion is a little vague, but the idea is that if two signals (say images) are similar but one is slightly distorted, they should map to representations which are close to each other. The mathematics behind his analysis framework was group theoretic — he wants to estimate the group of actions which manipulate images. In a sense, this is a control-theory view of the problem (at least it seemed to me). The second point that I understood was that sparsity in representation has a big role to play in building efficient and layered representations. I think I’d have to see the talk again to understand it better, but in the end I wasn’t sure that I understood why deep networks are good, but I did understand some more interesting things about wavelet representations, which is cool.

The things we know we don’t know

As a theoretical engineer, I find myself getting lulled into the trap of what I now starting to call “lazy generalization.” It’s a form of bland motivation that you often find at the beginning of papers:

Sensor networks are large distributed collections of low-power nodes with wireless radios and limited battery power.

Really? All sensor networks are like this? I think not. Lots of sensor networks are wired (think of the power grid) but still communicate wirelessly. Others communicate through wires. This is the kind of ontological statement that metastasizes into the research equivalent of a meme — 3 years after Smart Dust appears, suddenly all papers are about dust-like networks, ignoring the vast range of other interesting problems that arise in other kinds of “sensor networks.”

Another good example is “it is well known that most [REAL WORLD THING] follows a power law,” which bugs Cosma to no end. We then get lots of papers papers which start with something about power laws and then proceed to analyze some algorithms which work well on graphs which have power law degree distributions. And the later we get statements like “all natural graphs follow power laws, so here’s a theory for those graphs, which tells us all about nature.”

Yet another example of this is sparsity. Sparsity is interesting! It lets you do a lot of cool stuff, like compressed sensing. And it’s true that some real world signals are approximately sparse in some basis. However, turn the crank and we get papers which make crazy statements approximately equal to “all interesting signals are sparse.” This is trivially true if you take the signal itself as a basis element, but in the way it’s mean (e.g. “in some standard basis”), it is patently false.

So why is are these lazy generalization? It’s a kind of fallacy which goes something like:

  1. Topic A is really useful.
  2. By assuming some Structure B about Topic A, we can do lots of cool/fun math.
  3. All useful problems have Structure B

Pattern matching, we get A = [sensor networks, the web, signal acquisition], and B = [low power/wireless, power laws, sparsity].

This post may sound like I’m griping about these topics being “hot” — I’m not. Of course, when a topic gets hot, you get a lot of (probably incremental) papers all over the place. That’s the nature of “progress.” What I’m talking about is the third point. When we go back to our alabaster spire of theory on top of the ivory tower, we should not fall into the same trap of saying that “by characterizing the limits of Structure B I have fundamentally characterized Topic A.” Maybe that’s good marketing, but it’s not very good science, I think. Like I said, it’s a trap that I’m sure I’m guilty of stepping into on occasion, but it seems to be creeping into a number of things I’ve been reading lately.

Broadcast gossip revisited : companion variables and convergence to consensus

Mike Rabbat pointed out his new preprint on ArXiv:

Broadcast Gossip Algorithms for Consensus on Strongly Connected Digraphs
Wu Shaochuan, Michael G. Rabbat

The basic starting point is the unidirectional Broadcast Gossip algorithm — we have n nodes in a graph and each node i starts with a value x_i(0). At each time, a random node broadcasts its value to its neighbors in the graph and they each update their value with the weighted average of their current value and the received value. Eventually, this process converges almost surely and all nodes i will have a common value \hat{x}. However, in general \hat{x} \ne \bar{x}, where \bar{x} = \frac{1}{n} \sum x_i(0), but \mathbb{E}[\hat{x}] = \bar{x}. So this process achieves consensus but only computes the average in expectation.

The reason broadcast gossip does’t compute the average is pretty clear — since the communication and updates are unidirectional, the average of the nodes’ values changes over time. One way around this is to use companion variables \{y_i(t)\} to track the effect of the broadcast. These have been studied before, but in this paper they set the initial values y_i(0) = 0 and perform updates as follows : if node k broadcasts at time t and node j is a neighbor, then

x_j(t+1) = (1 - a_{j,k}) x_j(t) + a_{j,k} x_k(t) + \epsilon d_j^{(k)} y_j(t)
y_j(t+1) = a_{j,k} (x_j(t) - x_k(t)) + (1 - \epsilon d_j^{(k)}) y_j(t) + b_{j,k} y_k(t)
x_k(t+1) = x_k(t)
y_j(t+1) = 0

So what’s going on here? Each time a node transmits it resets its companion variable to 0. Each time it receives a broadcast it accumulates an offset in y_j. The actual value estimate at the nodes is weighted average of the current and received value plus some of the local offset.

The parameters of the algorithm are matrices A = (a_{j,k}) and B = (b_{j,k}) and per-node matrices D_k = (d_j^{(k)}). Let

A_k = A e_k e_k^T
B_j = B e_k e_k^T
L_k = \mathrm{diag}(A_k \mathbf{1}) - A_k
S_k = I - e_k e_k^T + B_k

When node k broadcasts) we can write a matrix

W_k \left[ \begin{array}{cc} I - L_k & \epsilon D_k \\ L_k & S_k - \epsilon D_k \end{array} \right]

such that

\left[ \begin{array}{c} \mathbf{x}(t+1) \\ \mathbf{y}(t+1) \end{array} \right]^T = W(t) \left[ \begin{array}{c} \mathbf{x}(t) \\ \mathbf{y}(t) \end{array} \right]^T

where W(t) = W_k if node k broadcasts at time $t$.

The main results are:

  1. If we choose \epsilon small enough, then \lim_{t \to \infty} \mathbb{E}[ \mathbf{x}(t) | \mathbf{x}(0) ] = \hat{x} \mathbf{1} and \hat{x} = \bar{x} under more conditions.
  2. A sufficient condition for the second moment of the error to go to 0.
  3. Rules on how to pick the parameters, especially \epsilon.

There are also some nice simulations. Broadcast gossip is nice because of its simplicity, and adding a little monitoring/control variable y_j(t) seems to buy a lot in terms of controlling bad sample paths for the process.

Postdoc job openings

Some people have told me about postdoctoral position openings that are opening up, and I figured I’d repost some of them here as they come along. Of course, there are other places to post announcements, but I find that postdoc opportunities are a bit harder to advertise/hear about. I think a lot of systems EE people applying for academic positions right out of grad school tend to put off applying for postdocs until they hear back about their faculty interviews — I’d tend to say this is a mistake:

  • If your graduation date may be a little flexible, pinging someone early on (e.g. in the fall) about possible postdoc opportunities can be a good plan. NSF grant deadlines are in the fall, and so they could write a postdoc position into a current proposal.
  • Of course you’re going to apply for faculty positions, and the people you’re talking to about postdoc positions know that. However, if you get to May and haven’t talked to anyone about postdoc options, you may find that those positions have filled up.
  • Don’t think of a postdoc as a “fallback plan” (akin to a “safety school”) — it’s an opportunity and a chance to make a strategic decision. Do you want to switch areas or learn about something new? Do you want to dig deeper into things you’ve already been working on? Do you want a springboard to get a job in a specific country? Do you want to build closer ties to industry? Do you want closer mentorship?

I went to a panel at Allerton once on “whether you should do a postdoc” starring (among other people) Aaron Wagner and Todd Coleman, I believe. Everyone was very enthusiastic about doing a postdoc. Everyone on the panel had faculty positions lined up for after their postdoc and deferred their start date to do that postdoc. This is the best of all possible worlds but is pretty unusual, so don’t count on it.

This is all dodging the issue of whether or not you should even do a postdoc. That might be a topic for a different post (or a debate for the comments) — I know people have strong feelings on both sides. I tend to think our system is broken or veering into brokenness.

However, more information is more power, so if you have a postdoc announcement (details are helpful) and want me to post it here, please do send it my way. You can also try to post to the IT Society website.