ArXiV notes : July 2012 part 1

I usually tag things from ArXiV that I want to skim later, so I figured I would post the ones I found interesting in the last month or so. Some of them may deserve their own post later. Jet lag is not letting me take the nap I need right now so I figured I’d post it before conking out.

Convergence Rates for Differentially Private Statistical Estimation
K. Chaudhuri and D. Hsu
ArXiV 1206.6395
This paper is about estimation under privacy constraints. In differential privacy there are a fair number of constraints with regards to the data being well behaved — typical assumptions include boundedness, for example. One way around this is to move to a relaxed notion of privacy known as (\alpha,\delta)-differential privacy, but what this paper shows is that for the stricter \alpha-differential privacy definition, statistical estimation is governed by the gross error sensitivity (GES) of the estimator.

The Greedy Miser: Learning under Test-time Budgets
Z. Xu, K. Weinberger, and O. Chapelle
ArXiV 1206.6451
This paper is about incorporating the cost of extracting features in classification or regression. For a data point \mathbf{x}_i \in \mathbb{R}^d, the feature \alpha costs c_{\alpha} to compute and there is a total budget for the computation. The main approach is to minimize a linear combination of the loss and the cost iteratively, where each iteration consists of building a regression tree.

Sequential detection of multiple change points in networks: a graphical model approach
A.A. Amini and X. Nguyen
ArXiV 1207.1687
This looks at the change point detection problem in the network setting, where each of d nodes in a network has a change point \lambda_j. Data is observed on vertices and edges and nodes have to pass messages on the graph. There’s a prior distribution on the change point times and after observing the data the nodes can use message passing to do some belief propagation to estimate the posterior distribution and define a stopping time. This turns out to be asymptotically optimal, in that the expected delay in detecting the change point is achieved.

The Price of Privacy in Untrusted Recommendation Engines
S. Bannerjee, N. Hegde, and L. Massoulié
ArXiV 1207.3269
This studies the effect on recommendations engines of individuals masking their data due to privacy concerns. The technical issues are the the dimension is high and each user only has a small amount of data to control (e.g. the movies that they rented or have seen). The question is one of sample complexity — how many users do we need to cluster successfully. If users are not so sparse, they can achieve the Fano lower bound (order-wise) using a sketch and spectral techniques, but the dependence is like N \log N, where $N$ is the number of items (e.g. movies). It’s worse if the users are sparse — the scaling is with N^2, which is also achievable in some settings. The analysis approach using Fano is analogous to the “information leakage” connections with differential privacy.

Surrogate Losses in Passive and Active Learning
S. Hanneke and L. Yang
ArXiV 1207.3772
This paper looks at how to use surrogate losses (which are computationally tractable) in active learning problems. “We take as our starting point that we have already committed to use a given surrogate loss, and we restrict our attention to just those scenarios in which this heuristic actually does work. We are then interested in how best to make use of the surrogate loss toward the goal of producing a classifier with relatively small error rate.” The central insight in their algorithm is that by taking into account the fact that the “real loss” is (say) the 0-1 loss, then for the purposes of active learning, one only really needs to get the sign right and not the magnitude. So the process will minimize the real loss but not necessarily the surrogate.

SPCOM 2012

IISc ECE department sign

After a fun trip to Mysore this weekend, I have landed at the Indian Institute of Science in Bangalore (Bengaluru for those more PC), for SPCOM 2012. This is a 4 day conference (yesterday was tutorials) that has been held 8 times before (sometimes skipping years) here at IISc. I’ve never visited here before, but my father, grandfather, and great grandfather are all graduates, so it feels pretty special to finally see the place. My talk is dead last, so we’ll see how many people are still here by the time it rolls around.

Strategies in the Iterated Prisoner’s Dilemma

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent
William H. Press and Freeman J. Dyson
PNAS 109(26), June 26, 2012

This was an interesting little paper that I learned about that I am still digesting; as I’ve said before, I’m not a game theory guy. They start out with a 2×2 game, the classical Prisoner’s Dilemma with players X and Y — if both cooperate (C), they payoffs are (R,R), if the first (resp. second) defects (D) they are (T,S) (resp. (S,T)) and if they both defect then (P,P). The usual conditions are T > R > P > S to keep it interesting.

In an iterated game the players get to play over and over again and try to develop strategies to win more than their opponent. The first interesting (to me) observation in this paper is that strategies which maintain a long memory are dominated by those which maintain a short memory in the sense that if for any long-memory strategy of Y, the payoff for a short-memory strategy of X is the same as a certain short-memory strategy of Y. So if your opponent is playing a short-memory strategy you should also play a similar short-memory strategy.

So with that idea in mind they look at first-order Markov strategies — given the past state (CC, CD, DC, DD) player X randomly chooses whether to cooperate or not in the next round. Thus the problem is specified by 4 conditional probabilities of cooperating conditioned on the previous state, and you can compute a 4×4 Markov matrix of transition probabilities out of the randomized strategies of each player.

Now the game becomes how they payoffs behave under different choices of strategy, which boil down to spectral / structural properties of this Markov matrix. To be honest, I found the discussion in the rest of the paper more confusing than enlightening because the mathematics was a little obscured by the discussion of implications. I realize this is a PNAS paper but it felt a little discursive at times and the mathematical intuition was not entirely clear to me.

Still, it’s a short read and probably worth a peruse.

A comment about maximal versus average error

In my previous post I made some vague comments about maximal and average error which I think are not complicated but don’t often arise when we think about plain vanilla information theory problems. These come up in arbitrarily varying channels, but also help motivate some interesting alternative models.

We most often think of error criteria as being about the desiderata for communication — asking for the maximal error over messages to be small is more stringent than asking for the average error. However, if we have a code which has small average error we can expurgate bad codewords to get a code which has small maximal error. With that meta-argument in hand, at a first pass we’re done : average error and maximal error are pretty much the same.

However, choosing an error criterion can also be a choice about the communication model. If the noise can somehow depend on what is being transmitted, then this distinction can come up. If the error criterion is maximal error and the interference is arbitrary, then for each message and each interference signal, I need to guarantee that probability of error is small. Implicitly, this allows the interference to be chosen as a function of the message. Under average error, the interference cannot be a function of the message.

If we allow for common randomness, the transmitter can mask the dependence of the codeword on the message. Suppose we have a deterministic code with N codewords which has small average error. Then we can create a randomized code by choosing a random permutation \pi of N and encoding message m with the \pi(m)-th codeword of the original code. In this way we get the maximal error to be small, since the maximal error is averaged over the common randomness.

From this it’s clear that the choice of error criterion is also a modeling choice, especially when there is no common randomness allowed. However, this is all looking at the case where the interference can depend on the message. The trickier situation is when the interference can depend on the transmitted codeword, which is what I discussed in the previous post. In that setting using common randomness to mask the message-codeword dependence doesn’t buy that much. What I was working on recently was how much it buys you when the dependence is weakened.

As a final note, this model of interference depending on the transmitted codeword appears in a stochastic form under the name “correlated jamming” and has been worked on by several researchers, including Muriel Médard, Shabnam Shafiee, and Sennur Ulukus. My goal was to get an AVC and geometric perspective on these kinds of communication models.

Recommended iPad apps for technical academics

I have an iPad now (hooray research funds) and I’ve been trying to figure out how to make it more useful for me research-wise. Here are some apps I’ve found, but I would love recommendations on others that are useful for academics who do math-y stuff.

OmniFocus is a task management app — it’s a little heavy-featured and expensive, but I’ve been using it to help myself reorganize (and sync) when on planes and the like.

iAnnotate PDF is a PDF annotation tool that syncs to Dropbox (among other things). I now drop the papers I have to review in to a folder there and then read and mark them up. Given that my review load has doubled recently, being able to make notes without printing is nice. I think it will get even better once I get a stylus.

In terms of media, IEEE Spectrum is available, as is Communications magazine (but not Signal Processing).

I use Prompt to SSH into machines on the few times that I need to do that from the iPad. It has saved my butt once or twice.

PDF Presenter is supposed to be nice for doing slides, but I haven’t used it yet.

There’s an old post on LaTeX for the iPad, but nothing really looked that appealing.

Anyone else have some recommendations?

ISIT 2012 : the Shannon Award!

I am pleased as punch that Katalin Marton won the 2013 Shannon Award. She made seminal contributions in information theory in the earlier years but has been active in combinatorics and stochastic processes since. Many people start doing information theory because of an interest in digital communications — my own academic background reflects this — we get some blinders on regarding the role information theory plays in other fields, such as statistics (c.f. Rissanen) and combinatorics (c.f. Ahlswede). It’s great to see Marton’s work recognized, and I hope she talks about some of her more recent interests. Finally, I find it inspirational that a woman has finally won the Shannon Award — it’s a sign of the change that is happening in engineering more broadly, I hope.