# ITA 2015: quick takes

Better late than never, I suppose. A few weeks ago I escaped the cold of New Jersey to my old haunts of San Diego. Although La Jolla was always a bit fancy for my taste, it’s hard to beat a conference which boasts views like this:

A view from the sessions at ITA 2015

I’ll just recap a few of the talks that I remember from my notes — I didn’t really take notes during the plenaries so I don’t have much to say about them. Mostly this was due to laziness, but finding the time to blog has been challenging in this last year, so I think I have to pick my battles. Here’s a smattering consisting of

$\{ \mathrm{talks\ attended} \} \cap \{ \mathrm{talks\ with\ understandable\ notes} \}$

(Information theory)
Emina Soljanin talked about designing codes that are good for fast access to the data in distributed storage. Initial work focused on how to repair codes under disk failures. She looked at how easy it is to retrieve the information afterwords to guarantee some QoS for the storage system. Adam Kalai talked about designing compression schemes that work for an “audience” of decoders. The decoders have different priors on the set of elements/messages so the idea is to design an encoder that works for this ensemble of decoders. I kind of missed the first part of the talk so I wasn’t quite sure how this relates to classical work in mismatched decoding as done in the information theory world. Gireeja Ranade gave a great talk about defining notions of capacity/rate need to control a system which as multiplicative uncertainty. That is, $x[n+1] = x[n] + B[n] u[n]$ where $B[n]$ has the uncertainty. She gave a couple of different notions of capacity, relating to the ratio $| x[n]/x[0] |$ — either the expected value of the square or the log, appropriately normalized. She used a “deterministic model” to give an explanation of how control in this setting is kind of like controlling the number of significant bits in the state: uncertainty increases this and you need a certain “amount” of control to cancel that growth.

(Learning and statistics)
I learned about active regression approaches from Sivan Sabato that provably work better than passive learning. The idea there is do to use a partition of the X space and then do piecewise constant approximations to a weight function that they use in a rejection sampler. The rejection sampler (which I thought of as sort of doing importance sampling to make sure they cover the space) helps limit the number of labels requested by the algorithm. Somehow I had never met Raj Rao Nadakuditi until now, and I wish I had gotten a chance to talk to him further. He gave a nice talk on robust PCA, and in particular how outliers “break” regular PCA. He proposed a combination of shrinkage and truncation to help make PCA a bit more stable/robust. Laura Balzano talked about “estimating subspace projections from incomplete data.” She proposed an iterative algorithm for doing estimation on the Grassmann manifold that can do subspace tracking. Constantine Caramanis talked about a convex formulation for mixed regression that gives a guaranteed solution, along with minimax sample complexity bounds showing that it is basically optimal. Yingbin Liang talked about testing approaches for understanding if there is an “anomalous structure” in a sequence of data. Basically for a sequence $Y_1, Y_2, \ldots, Y_n$, the null hypothesis is that they are all i.i.d. $\sim p$ and the (composite) alternative is that there an interval of indices which are $\sim q$ instead. She proposed a RKHS-based discrepancy measure and a threshold test on this measure. Pradeep Ravikumar talked about a “simple” estimator that was a “fix” for ordinary least squares with some soft thresholding. He showed consistency for linear regression in several senses, competitive with LASSO in some settings. Pretty neat, all said, although he also claimed that least squares was “something you all know from high school” — I went to a pretty good high school, and I don’t think we did least squares! Sanmi Koyejo talked about a Bayesian devision theory approach to variable selection that involved minimizing some KL-divergence. Unfortunately, the resulting optimization ended up being NP-hard (for reasons I can’t remember) and so they use a greedy algorithm that seems to work pretty well.

(Privacy)
Cynthia Dwork gave a tutorial on differential privacy with an emphasis on the recent work involving false discovery rate. In addition to her plenary there were several talks on differential privacy and other privacy measures. Kunal Talwar talked about their improved analysis of the SuLQ method for differentially private PCA. Unfortunately there were two privacy sessions in parallel so I hopped over to see John Duchi talk about definitions of privacy and how definitions based on testing are equivalent to differential privacy. The testing framework makes it easier to prove minimax bounds, though, so it may be a more useful view at times. Nadia Fawaz talked about privacy for time-series data such as smart meter data. She defined different types of attacks in this setting and showed that they correspond to mutual information or directed mutual information, as well as empirical results on a real data set. Raef Bassily studied a estimation problem in the streaming setting where you want to get a histogram of the most frequent items in the stream. They reduce the problem to one of finding a “unique heavy hitter” and develop a protocol that looks sort of like a code for the MAC: they encode bits into a real vector, had noise, and then add those up over the reals. It’s accepted to STOC 2015 and he said the preprint will be up soon.

# Fenchel duality, entropy, and the log partition function

[Update: As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is that they are true for deeper reasons than the ones given when you learn them — this is a special case that requires undergraduate probability and calculus, basically.]

One of my collaborators said to me recently that it’s well known that the “negative entropy is the Fenchel dual of the log-partition function.” Now I know what these words meant, but it somehow was not a fact that I had learned elsewhere, and furthermore, a sentence like that is frustratingly terse. If you already know what it means, then it’s a nice shorthand, but for those trying to figure it out, it’s impenetrable jargon. I tried running it past a few people here who are generally knowledgeable but are not graphical model experts, and they too were unfamiliar with it. While this is just a simple thing about conjugate duality, I think it doesn’t really show up in information theory classes unless the instructor talks a bit more about exponential family distributions, maximum entropy distributions, and other related concepts. Bert Huang has a post on Jensen’s inequality as a justification.

We have a distribution in the exponential family:

$p(x; \theta) = \exp( \langle \theta, \phi(x) \rangle - A(\theta) )$

As a side note, I often find that the exponential family is not often covered in systems EE courses. Given how important it is in statistics, I think it should be a bit more of a central concept — I’m definitely going to try and work it in to the detection and estimation course.

For the purposes of this post I’m going to assume $x$ takes values in a discrete alphabet $\mathcal{X}$ (say, n-bit strings). The function $\phi(x)$ is a vector of statistics calculated from $x$, and $\theta$ is a vector of parameters. the function $A(\theta)$ is the log partition function:

$A(\theta) = \log\left( \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \right)$

Where the partition function is

$Z(\theta) = \sum_{x} \exp( \langle \theta, \phi(x) \rangle )$

The entropy of the distribution is easy to calculate:

$H(p) = \mathbb{E}[ - \log p(x; \theta) ] = A(\theta) - \langle \theta, \mathbb{E}[\phi(x)] \rangle$

The Fenchel dual of a function $f(\theta)$ is the function

$g(\psi) = \sup_{\theta} \left\{ \langle \psi, \theta \rangle - f(\theta) \right\}$.

So what’s the Fenchel dual of the log partition function? We have to take the gradient:

$\nabla_{\theta} \left( \langle \psi, \theta \rangle - A(\theta) \right) = \psi - \frac{1}{Z(\theta)} \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \phi(x) = \psi - \mathbb{E}[ \phi(x) ]$

So now setting this equal to zero, we see that at the optimum $\theta^*$:

$\langle \psi, \theta^* \rangle = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle$

And the dual function is:

$g(\psi) = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle - A(\theta^*) = - H(p)$

The standard approach seems to go the other direction by computing the dual of the negative entropy, but that seems more confusing to me (perhaps inspiring Bert’s post above). Since the log partition function and negative entropy are both convex, it seems easier to exploit the duality to prove it in one direction only.

# Data: what is it good for? (Absolutely Something): the first few weeks

So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:

• engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
• it’s been more than a decade since we were undergraduates, let alone 18
• the students in our class are fresh out of high school and also haven’t had discussion-based classes

My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.

What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:

• Introduction: what is data?
• Potentials and perils of data science
• The importance of modeling
• Statistical considerations
• Machine learning and algorithms
• Data and society: ethics and privacy
• Data visualizaion
• Project Presentations

I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.

# MAP and ML in practice on the New Jersey Turnpike

Since I might be teaching detection and estimation next semester, I’ve been thinking a little bit about decision rules during my commute down the New Jersey Turnpike. The following question came to mind:

Suppose you see a car on the Turnpike who is clearly driving dangerously (weaving between cars, going 90+ MPH, tailgating an ambulance, and the like). You have to decide whether the car has New Jersey or New York plates [*]?

This is a hypothesis testing problem. I will assume for simplicity that New York drivers have cars with New York plates and New Jersey drivers have New Jersey plates [**]:
$H_0$: New Jersey driver
$H_1$: New York driver
Let $Y$ be a binary variable indicating whether or not I observe dangerous driving behavior. Based on my entirely subjective experience, I would say the in terms of likelihoods,
$\mathbb{P}(Y = 1 | H_1) > \mathbb{P}(Y = 1 | H_0)$
so the maximum likelihood (ML) rule would suggest that the driver is from New York.

However, if I take into account my (also entirely subjective) priors on the fraction of drivers $P(H_0), P_H(1)$ from New Jersey and New York, respectively, I would have to say
$\mathbb{P}(Y = 1 | H_1) P(H_1) < \mathbb{P}(Y = 1 | H_0) P(H_0)$
so the maximum a-posteriori probability (MAP) rule would suggest that the driver is from New Jersey.

Which is better?

[*] I am assuming North Jersey here, so Pennsylvania plates are negligible.
[**] This may be a questionable modeling assumption given suburban demographics.

# A joke for Max Raginsky

Setting: a lone house stands on a Scottish moor. The fog is dense here. It is difficult to estimate where your foot will fall. A figure in a cloak stands in front of the door.

Figure: [rapping on the door, in a Highland accent] Knock knock!

Voice from inside: Who’s there?

Figure: Glivenko!

Voice: Glivenko who?

Figure: Glivenko-Cantelli!

[The fog along the moor converges uniformly on the house, enveloping it completely in a cumulus.]

Scene.

# SPCOM 2014: some talks

Relevance Singular Vector Machine for Low-­rank Matrix Sensing
Martin Sundin; Saikat Chatterjee; Magnus Jansson; Cristian Rojas
This talk was on designing Bayesian priors for sparse-PCA problems — the key is to find a prior which induces a low-rank structure on the matrix. The model was something like $y = A \mathrm{vec}(X) + n$ where $X$ is a low-rank matrix and $n$ is noise. The previous state of the art is by Babacan et al., a paper which I obviously haven’t read, but the method they propose here (which involved some heavy algebra/matrix factorizations) appears to be competitive in several regimes. Probably more of interest to those working on Bayesian methods…

Non-Convex Sparse Estimation for Signal Processing
David Wipf
More Bayesian methods! Although David (who I met at ICML) was not trying to say that the priors are particularly “correct,” but rather that the penalty functions that they induce on the problems he is studying actually make sense. More of an algorithmist’s approach, you might say. He set up the problem a bit more generally, to minimize problems of the form
$\min_{X_i} \sum_{i} \alpha_i \mathrm{rank}[X_i] \ \ \ \ \ \ \ Y = \sum_{i} A_i(X_i)$
where $A_i$ are some operators. He made the case that convex relaxations of many of these problems, while analytically beautiful, have restrictions which are not satisfied in practice, and indeed they often have poor performance. His approach is via Empirical Bayes, but this leads to non-convex problems. What he can show is that the algorithm he proposes is competitive with any method that tries to separate the error from the “low-rank” constraint, and that the new optimization is “smoother.” I’m sure more details are in his various papers, for those who are interested.

PCA-HDR: A Robust PCA Based Solution to HDR Imaging
Vinod showed some information theoretic approaches to understanding how much communication is needed for secure computation protocols like remote oblivious transfer: Xavier has $\{X_0, X_1\}$, Yvonne has $Y \in \{0,1\}$ and Zelda wants $Z = X_Y$, but nobody should be able to infer each other’s values. Feige, Killian, and Naor have a protocol for this, which Vinod and Co. can show is communication-optimal. There were several ingredients here, including cut-set bounds, distribution switching, data processing inequalities, and special bounds for 3-party protocols. More details in his CRYPTO paper (and others).
In a MIMO wiretap setting, if the receiver has more antennas than the transmitter, then the transmitter can send noise in the nullspace of the channel matrix of the direct channel — as long as the eavesdropper has fewer antennas than the transmitter then secure transmission is possible. In this paper they show that positive secrecy capacity is possible even when the eavesdropper has more antennas, but as the number of eavesdropper antennas grows, the achievable rate goes to $0$. Perhaps a little bit of a surprise here!