Linkage

Via Amber, a collection of grassroots feminist political posters from India.

Via John, some fun investigations on how 355/113 is an amazingly good approximation to \pi. Also related are the Stern-Brocot trees, which can give continued fraction expansions.

I had missed this speech by a 10 year old on gay marriage when it happened (I was in India), but it’s pretty heartwarming. For more background on how the principal originally deemed the story “inappropriate.”

What is a Bayesian?

Unrelatedly, ML Hipster — tight bounds and tight jeans.

Le charme discret de l’entropie?

TTI Chicago is located on the University of Chicago campus. I am a UChicago affiliate, which means I get to use the libraries here. Unfortunately, the University is almost opposed to the idea of engineering, so things like institutional subscriptions to IEEExplore are out. Compounding this is a kind of old-fashionedness about the university which makes a lot of their collection… dated. However, since information theory is math-like, the university does have a fair number of information theory books and resources, including a number of textbooks and monographs which I had not seen before. One such is Stanford Goldman’s book, whose first chapter is entitled:

Le charme discret de l'entropie?

Le charme discret de l’entropie?

There is also a cute figure:

Moo

Moo!

ArXiV notes : July 2012 part 2

Differentially Private Filtering and Differentially Private Kalman Filtering
Jerome Le Ny, George J. Pappas
These papers apply the differential privacy model to classical signal processing and look at the effect of adding noise before filtering and summing versus adding noise after filtering and summing. The key is that the operations have to be causal, I think — we usually think about differential privacy as operating offline or online in very restricted settings, but here the signals are coming in as time series.

Finite sample posterior concentration in high-dimensional regression
Nate Strawn, Artin Armagan, Rayan Saab, Lawrence Carin, David Dunson
They study ultra high-dimensional linear regression (cue guitars and reverb) in the “large p, small n” setting. The goal is to get a Bernstein-von Mises type theorem — e.g. assuming the data comes from a linear model \beta_0, then using Bayesian inference the posterior should concentrate around \beta_0. They of course need a sparsity assumption, and the prior must assign reasonable mass around the true parameter and assign low probability to non-sparse signals. The methods use some ideas from compressed sensing (the Dantzig selector) and should be of interest to people working in that area.

Identifying Users From Their Rating Patterns
José Bento, Nadia Fawaz, Andrea Montanari, and Stratis Ioannidis
This is a paper about recommender systems as part of the 2011 CAMRa challenge. They look at the problem of re-identification of users in this data and show that looking at the time stamps of movie ratings is much more useful than looking at the rating values. This suggests to me that people should use a masking system like Anant and Parv’s “Incentivizing Anonymous ‘Peer-to-Peer’ Reviews” (P. Venkitasubramaniam and A. Sahai, Allerton 2008) paper.

<a href="http://arxiv.org/abs/1207.4084v1"Mechanism Design in Large Games: Incentives and Privacy
Michael Kearns, Mallesh M. Pai, Aaron Roth, Jonathan Ullman
This proposes a variant of differential privacy which they call joint differential privacy and look at mechanism designs that satisfy privacy and are incentive compatible. At first glance, these should be incompatible, since the latter implies “revealing the truth.” The model is one in which each agent has a finite set of actions but its own payoff/value is private. This is somewhat out of my area, so I can’t really get the nuances of what’s going on here, but a crucial assumption here is that there are a large number of agents. Joint differential privacy seems to be a form of (\epsilon,\delta) differential privacy on the utility functions of the users.

ArXiV notes : July 2012 part 1

I usually tag things from ArXiV that I want to skim later, so I figured I would post the ones I found interesting in the last month or so. Some of them may deserve their own post later. Jet lag is not letting me take the nap I need right now so I figured I’d post it before conking out.

Convergence Rates for Differentially Private Statistical Estimation
K. Chaudhuri and D. Hsu
ArXiV 1206.6395
This paper is about estimation under privacy constraints. In differential privacy there are a fair number of constraints with regards to the data being well behaved — typical assumptions include boundedness, for example. One way around this is to move to a relaxed notion of privacy known as (\alpha,\delta)-differential privacy, but what this paper shows is that for the stricter \alpha-differential privacy definition, statistical estimation is governed by the gross error sensitivity (GES) of the estimator.

The Greedy Miser: Learning under Test-time Budgets
Z. Xu, K. Weinberger, and O. Chapelle
ArXiV 1206.6451
This paper is about incorporating the cost of extracting features in classification or regression. For a data point \mathbf{x}_i \in \mathbb{R}^d, the feature \alpha costs c_{\alpha} to compute and there is a total budget for the computation. The main approach is to minimize a linear combination of the loss and the cost iteratively, where each iteration consists of building a regression tree.

Sequential detection of multiple change points in networks: a graphical model approach
A.A. Amini and X. Nguyen
ArXiV 1207.1687
This looks at the change point detection problem in the network setting, where each of d nodes in a network has a change point \lambda_j. Data is observed on vertices and edges and nodes have to pass messages on the graph. There’s a prior distribution on the change point times and after observing the data the nodes can use message passing to do some belief propagation to estimate the posterior distribution and define a stopping time. This turns out to be asymptotically optimal, in that the expected delay in detecting the change point is achieved.

The Price of Privacy in Untrusted Recommendation Engines
S. Bannerjee, N. Hegde, and L. Massoulié
ArXiV 1207.3269
This studies the effect on recommendations engines of individuals masking their data due to privacy concerns. The technical issues are the the dimension is high and each user only has a small amount of data to control (e.g. the movies that they rented or have seen). The question is one of sample complexity — how many users do we need to cluster successfully. If users are not so sparse, they can achieve the Fano lower bound (order-wise) using a sketch and spectral techniques, but the dependence is like N \log N, where $N$ is the number of items (e.g. movies). It’s worse if the users are sparse — the scaling is with N^2, which is also achievable in some settings. The analysis approach using Fano is analogous to the “information leakage” connections with differential privacy.

Surrogate Losses in Passive and Active Learning
S. Hanneke and L. Yang
ArXiV 1207.3772
This paper looks at how to use surrogate losses (which are computationally tractable) in active learning problems. “We take as our starting point that we have already committed to use a given surrogate loss, and we restrict our attention to just those scenarios in which this heuristic actually does work. We are then interested in how best to make use of the surrogate loss toward the goal of producing a classifier with relatively small error rate.” The central insight in their algorithm is that by taking into account the fact that the “real loss” is (say) the 0-1 loss, then for the purposes of active learning, one only really needs to get the sign right and not the magnitude. So the process will minimize the real loss but not necessarily the surrogate.

SPCOM 2012

IISc ECE department sign

After a fun trip to Mysore this weekend, I have landed at the Indian Institute of Science in Bangalore (Bengaluru for those more PC), for SPCOM 2012. This is a 4 day conference (yesterday was tutorials) that has been held 8 times before (sometimes skipping years) here at IISc. I’ve never visited here before, but my father, grandfather, and great grandfather are all graduates, so it feels pretty special to finally see the place. My talk is dead last, so we’ll see how many people are still here by the time it rolls around.

Strategies in the Iterated Prisoner’s Dilemma

Iterated Prisoner’s Dilemma contains strategies that dominate any evolutionary opponent
William H. Press and Freeman J. Dyson
PNAS 109(26), June 26, 2012

This was an interesting little paper that I learned about that I am still digesting; as I’ve said before, I’m not a game theory guy. They start out with a 2×2 game, the classical Prisoner’s Dilemma with players X and Y — if both cooperate (C), they payoffs are (R,R), if the first (resp. second) defects (D) they are (T,S) (resp. (S,T)) and if they both defect then (P,P). The usual conditions are T > R > P > S to keep it interesting.

In an iterated game the players get to play over and over again and try to develop strategies to win more than their opponent. The first interesting (to me) observation in this paper is that strategies which maintain a long memory are dominated by those which maintain a short memory in the sense that if for any long-memory strategy of Y, the payoff for a short-memory strategy of X is the same as a certain short-memory strategy of Y. So if your opponent is playing a short-memory strategy you should also play a similar short-memory strategy.

So with that idea in mind they look at first-order Markov strategies — given the past state (CC, CD, DC, DD) player X randomly chooses whether to cooperate or not in the next round. Thus the problem is specified by 4 conditional probabilities of cooperating conditioned on the previous state, and you can compute a 4×4 Markov matrix of transition probabilities out of the randomized strategies of each player.

Now the game becomes how they payoffs behave under different choices of strategy, which boil down to spectral / structural properties of this Markov matrix. To be honest, I found the discussion in the rest of the paper more confusing than enlightening because the mathematics was a little obscured by the discussion of implications. I realize this is a PNAS paper but it felt a little discursive at times and the mathematical intuition was not entirely clear to me.

Still, it’s a short read and probably worth a peruse.

A comment about maximal versus average error

In my previous post I made some vague comments about maximal and average error which I think are not complicated but don’t often arise when we think about plain vanilla information theory problems. These come up in arbitrarily varying channels, but also help motivate some interesting alternative models.

We most often think of error criteria as being about the desiderata for communication — asking for the maximal error over messages to be small is more stringent than asking for the average error. However, if we have a code which has small average error we can expurgate bad codewords to get a code which has small maximal error. With that meta-argument in hand, at a first pass we’re done : average error and maximal error are pretty much the same.

However, choosing an error criterion can also be a choice about the communication model. If the noise can somehow depend on what is being transmitted, then this distinction can come up. If the error criterion is maximal error and the interference is arbitrary, then for each message and each interference signal, I need to guarantee that probability of error is small. Implicitly, this allows the interference to be chosen as a function of the message. Under average error, the interference cannot be a function of the message.

If we allow for common randomness, the transmitter can mask the dependence of the codeword on the message. Suppose we have a deterministic code with N codewords which has small average error. Then we can create a randomized code by choosing a random permutation \pi of N and encoding message m with the \pi(m)-th codeword of the original code. In this way we get the maximal error to be small, since the maximal error is averaged over the common randomness.

From this it’s clear that the choice of error criterion is also a modeling choice, especially when there is no common randomness allowed. However, this is all looking at the case where the interference can depend on the message. The trickier situation is when the interference can depend on the transmitted codeword, which is what I discussed in the previous post. In that setting using common randomness to mask the message-codeword dependence doesn’t buy that much. What I was working on recently was how much it buys you when the dependence is weakened.

As a final note, this model of interference depending on the transmitted codeword appears in a stochastic form under the name “correlated jamming” and has been worked on by several researchers, including Muriel Médard, Shabnam Shafiee, and Sennur Ulukus. My goal was to get an AVC and geometric perspective on these kinds of communication models.

Recommended iPad apps for technical academics

I have an iPad now (hooray research funds) and I’ve been trying to figure out how to make it more useful for me research-wise. Here are some apps I’ve found, but I would love recommendations on others that are useful for academics who do math-y stuff.

OmniFocus is a task management app — it’s a little heavy-featured and expensive, but I’ve been using it to help myself reorganize (and sync) when on planes and the like.

iAnnotate PDF is a PDF annotation tool that syncs to Dropbox (among other things). I now drop the papers I have to review in to a folder there and then read and mark them up. Given that my review load has doubled recently, being able to make notes without printing is nice. I think it will get even better once I get a stylus.

In terms of media, IEEE Spectrum is available, as is Communications magazine (but not Signal Processing).

I use Prompt to SSH into machines on the few times that I need to do that from the iPad. It has saved my butt once or twice.

PDF Presenter is supposed to be nice for doing slides, but I haven’t used it yet.

There’s an old post on LaTeX for the iPad, but nothing really looked that appealing.

Anyone else have some recommendations?

ISIT 2012 : the Shannon Award!

I am pleased as punch that Katalin Marton won the 2013 Shannon Award. She made seminal contributions in information theory in the earlier years but has been active in combinatorics and stochastic processes since. Many people start doing information theory because of an interest in digital communications — my own academic background reflects this — we get some blinders on regarding the role information theory plays in other fields, such as statistics (c.f. Rissanen) and combinatorics (c.f. Ahlswede). It’s great to see Marton’s work recognized, and I hope she talks about some of her more recent interests. Finally, I find it inspirational that a woman has finally won the Shannon Award — it’s a sign of the change that is happening in engineering more broadly, I hope.