WIFS 2014

This week I took a quick jaunt down to Atlanta to attend part of WIFS 2014 (co-located with GlobalSIP 2014). Kamalika and I were invited to give a talk on differential privacy and machine learning, based on our IEEE Signal Processing Magazine article. I’ve uploaded the slides of the tutorial to my website and we’re planning on making a video (audio over slides) version for SigView as well as on YouTube.

Much like last year, GlobalSIP had a somewhat disjointed, semi-chaotic feel (exacerbated by tiredness, I am sure) — it’s really a collection of semi-interacting workshops in the same space, and I knew people in several of the other workshops. Since I was there for a day and giving a tutorial at WIFS, I decided to stick with WIFS for the day. To give a sense of how confusing it all was, here’s a picture of the guide to deciphering the program book:

Overly-complicated rules for encoding sessions

Overly-complicated rules for encoding sessions

The keynote for GlobalSIP was given by Vince Poor on information-theoretic privacy via rate distortion (this is the work with Lalitha). Vince did a good job of not over-IT-ing it I think, which was good because the audience was pretty diverse and it’s not clear that many of the people there had even taken a course on information theory. This seems to be the big challenge in multi-disciplinary conferences like GlobalSIP (or large signal processing conferences in general) — everyone is in signal processing, but it’s a big tent and it’s hard to reach everyone.

Min Wu was the keynote speaker for the WIFS workshop on the day I attended. Her talk, on “Exploring Power Network Signatures for Information Forensics” was about how to glean information from power fluctuations in networks, or electronic network frequency (ENF). Different processes or operations have different power demands — by matching these signatures to an observed signal (e.g. a video), one can make inferences about the time/location/integrity of the data. For example, were the audio and visual tracks in a video taken at the same time or merged later? This whole area is quite interesting, and while I was sort of aware of this work I hadn’t really read up on much of it.

Perhaps it was the end of the semester kicking in, but I sort of took terrible notes on most of the talks and poster sessions at the conference, so I can’t really write coherently about the papers I saw. Unfortunately I had to run back to teach the penultimate lecture in my class. I guess now that I have a “real job” this is going to be the way it works from now on. Kind of sad, really.

An exercise in careful misreading

A recent article was passed along to me:

Jane Bambauer, Krishnamurty Muralidhar, and Rathindra Sarathy
Fool’s Gold: An Illustrated Critique of Differential Privacy
Vanderbilt Journal of Entertainment and Technology Law 16(4):701-755.

The article is aimed at the legal community, which has seen in differential privacy a potential technological solution for data privacy issues. The goal of the article is to throw some cold water on some law scholars’ embrace of differential privacy as a solution concept. I’m not a one-method-fixes-all kind of person, but this article is sort of relentlessly negative about differential privacy based solely on a single mechanism: output perturbation. The authors appear to be laboring under the impression that this is really the only way to provide differential privacy, “an assumption that contorts the rest of [their] analysis,” the charge that they level at one proponent of differential privacy.

In the example with which they open the paper, they claim that “even knowing the distribution of noise that is randomly added to each cell, the internist has no hope of interpreting the response. The true values could be almost anything.” While technically true, it’s quite misleading. Indeed, by knowing the distribution, one can create bounds on the accuracy of the answer — this is, contra the authors’ claims, the “tension between utility and privacy” that differential privacy researchers do “toil” with. They manage to explain the statistics fairly reasonably in the middle of the paper but ignore that in the introduction and conclusion in favor of some ascerbic bons mots. Now, perhaps to them, privacy should be an almost-sure guarantee. There is a critique in that: differential privacy can only make probabilistic guarantees, and if your legal standard is stricter than that, then it’s probably not a good way to go. But the misleading rhetoric employed here is meant to stir emotions rather than sway the intellect.

The language in the article is quite florid: “Differential privacy has been rocking the computer science world for over ten years and is fast becoming a crossover hit among privacy scholars and policymakers.” I suppose this sort of prose may be what constitutes scholarly writing in law, but it lacks the measured tones that one might want in more objective criticism. Perhaps they read academic writing in science and engineering in an equally emotional register. They use some strong language to conclude “differential privacy is either not practicable or not novel.” I find such blanket statements both puzzling and vacuous. If you set up a straw-man of what differential privacy is, I suppose you can derive such dichotomies, but is that the best argument one can make?

One thing that comes out of this reading is that most people don’t really appreciate how technology progresses from academic research to practical solutions. Perhaps some legal scholars have overstated the case for differential privacy based on the state of the technology now. But whose to say how things will look a few years down the line? We’ll have better algorithms, different database structures, and different data sharing mechanisms and platforms. Perhaps differential privacy is not ready for prime time, although Google seems to disagree. The authors’ main point (hidden in the in the breathless indignation) is that it’s probably not the best solution for every data sharing problem, a statement with which I can completely agree.

In their effort to discredit differential privacy, the authors ignore both the way in which scientific and academic research works as well as contemporary work that seeks to address the very problems they raise: context-awareness via propose-test-release, methods for setting \epsilon in practical scenarios, and dealing with multiple disclosures via stronger composition rules. They further ignore real technical hurdles in realizing “pure” differential privacy in favor of “illustrations” with the goal of painting proponents of differential privacy as ideologues and hucksters. Of course context and judgement are important in designing query mechanisms and privacy-preserving analysis systems. Furthermore, in many cases microdata have to be released for legal reasons. I think few people believe that differential privacy is a panacea, but it at least provides a real quantifiable approach to thinking about these privacy problems that one can build theories and algorithms around. The key is to figure out how to make those work on real data, and there’s a lot more research to be done on that front.

Research Linkage

I’ve been a bit bogged down upon getting back from traveling, but here are a few interesting technical tidbits that came through.

Aaron Roth and Cynthia Dwork’s Foundation and Trends monograph on differential privacy is now available.

Speaking of differential privacy, Shiva Kasiviswanathan and Adam Smith have a paper in the Journal of Privacy and Confidentiality on Bayesian interpretations of differential privacy risk.

Deborah Mayo has a post up on whether p-values are error probabilities.

Raymond Yeung is offering a Coursera course on information theory (via the IT Society).

A CS Theory take on Fano’s inequality from Suresh over at the GeomBlog.

yet more not-so-recent hits from ArXiV

Some shorter takes on these papers, some of which I should read in more detail later. I figure I’ll use the blog for some quick notes and to see if any readers have any comments/ideas about these:

Differentially Private Convex Optimization with Piecewise Affine Objectives (Shuo Han, Ufuk Topcu, George J. Pappas) — arXiv:1403.6135 [math.OC]. The idea here is to look at minimizing functions of the form
f(x) = \max_{i = 1,2, \ldots, m} \{ a_i^{\top} x + b_i \}
subject to x belonging to some convex polytope \mathcal{P}. This is a bit different than the kind of convex programs I’ve been looking at (which are more ERM-like). Such programs occur often in resource allocation problems. Here the private information of users are the offsets b_i. They propose a number of methods for generating differentially private approximations to this problem. Analyzing the sensitivity of this optimization is tricky, so they use an upper bound based on the diameter of the feasible set $\mathcal{P}$ to find an appropriate noise variance. The exponential mechanism also gives a feasible mechanism, although the exact dependence of the suboptimality gap on \epsilon is unclear. They also propose a noisy subgradient method where, instead of using SGD, they alter the sampling distribution using the exponential mechanism to choose a gradient step. Some preliminary experiments are also given (although none exploring the dependence on \epsilon, which would also be very interesting)!

Assisted Common Information with an Application to Secure Two-Party Sampling (Vinod M. Prabhakaran, Manoj M. Prabhakaran) — arXiv:1206.1282 [cs.IT]. This is the final version of the journal version of a few conference papers that Vinod and Manoj have done on an interesting variant of the Gács-Körner problem. The motivation is from secure multiparty computation — the problem also touches on some work Vinod and I started but is sadly languishing due to the utter overwhelmingness of starting a new job. Hopefully I can get back to it this summer.

Analysis of Distributed Stochastic Dual Coordinate Ascent (Tianbao Yang, Shenghuo Zhu, Rong Jin, Yuanqing Lin) — arXiv:1312.1031 [cs.DC]. The title pretty much sums it up. I’m interested in looking a bit more at the analysis method, since I had a similar algorithm bouncing around my head that I would like to analyze. The main idea is also update the primal variables to achieve a speedup/use a larger step size.

Convergence of Stochastic Proximal Gradient Algorithm (Lorenzo Rosasco, Silvia Villa, Bang Công Vũ) — arXiv:1403.5074 [math.OC]. This is a similar setup as my last post, with a convex objective that has a smooth and non-smooth component. They show convergence in expectation and almost surely. The key here is that they show convergence in an infinite-dimentionsal Hilbert space instead of, say, \mathbb{R}^d.

Differential privacy and the AUC

One of the things I’m always asked when giving a talk on differential privacy is “how should we interpret \epsilon?” There a lot of ways of answering this but one way that seems to make more sense to people who actually think about risk, hypothesis testing, and prediction error is through the “area under the curve” metric, or AUC. This post came out of a discussion from a talk I gave recently at Boston University, and I’d like to thank Clem Karl for the more detailed questioning.

Continue reading

More ArXiV skims

Assumptionless consistency of the Lasso
Sourav Chatterjee
The title says it all. Given p-dimensional data points \{ \mathbf{x}_i : i \in [n] \} the Lasso tries to fit the model \mathbb{E}( y_i | \mathbf{x_i}) = \boldsymbol{\beta} \mathbf{x}_i by minimizing the \ell^1 penalized squared error
\sum_{i=1}^{n} (y_i - \boldsymbol{\beta} \mathbf{x}_i)^2 + \lambda \| \boldsymbol{\beta} \|_1.
The paper analyzes the Lasso in the setting where the data are random, so there are n i.i.d. copies of a pair of random variables (\mathbf{X},Y) so the data is \{(\mathbf{X}_i, Y_i) : i \in [n] \}. The assumptions are on the random variables (\mathbf{X},Y) : (1) each coordinate |X_i| \le M is bounded, the variable Y = (\boldsymbol{\beta}^*)^T \mathbf{X} + \varepsilon, and \varepsilon \sim \mathcal{N}(0,\sigma^2), where \boldsymbol{\beta}^* and \sigma are unknown constants. Basically that’s all that’s needed — given a bound on \|\boldsymbol{\beta}\|_1, he derives a bound on the mean-squared prediction error.

On Learnability, Complexity and Stability
Silvia Villa, Lorenzo Rosasco, Tomaso Poggio
This is a handy survey on the three topics in the title. It’s only 10 pages long, so it’s a nice fast read.

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression
Francis Bach
A central challenge in stochastic optimization is understanding when the convergence rate of the excess loss, which is usually O(1/\sqrt{n}), can be improved to O(1/n). Most often this involves additional assumptions on the loss functions (which can sometimes get a bit baroque and hard to check). This paper considers constant step-size algorithms but where instead they consider the averaged iterate $\latex \bar{\theta}_n = \sum_{k=0}^{n-1} \theta_k$. I’m trying to slot this in with other things I know about stochastic optimization still, but it’s definitely worth a skim if you’re interested in the topic.

On Differentially Private Filtering for Event Streams
Jerome Le Ny
Jerome Le Ny has been putting differential privacy into signal processing and control contexts for the past year, and this is another paper in that line of work. This is important because we’re still trying to understand how time-series data can be handled in the differential privacy setting. This paper looks at “event streams” which are discrete-valued continuous-time signals (think of count processes), and the problem is to design a differentially private filtering system for such signals.

Gossips and Prejudices: Ergodic Randomized Dynamics in Social Networks
Paolo Frasca, Chiara Ravazzi, Roberto Tempo, Hideaki Ishii
This appears to be a gossip version of Acemoglu et al.’s work on “stubborn” agents in the consensus setting. They show similar qualitative behavior — opinions fluctuate but their average over time converges (the process is ergodic). This version of the paper has more of a tutorial feel to it, so the results are a bit easier to parse.

Generating vector-valued noise for differential privacy

A distribution that appears frequently in differential privacy is the Laplace distribution. While in the scalar case we have seen that Laplace noise may not be the best, it’s still the easiest example to start with. Suppose we have n scalars x_i \in [0,1] and we want to compute the average \frac{1}{n} \sum_{i} x_i in a differentially private way. One way to do this is to release Y = \frac{1}{n} \sum_{i} x_i + Z, where Z has a Laplace distribution:

p(z) = \frac{\lambda}{2} \exp( - \lambda |z| ).

To see that this is differentially private, note that by changing one value of x_i the average can change by at most \pm \frac{1}{n}. Let \bar{x} and \bar{x}' be the average of the original data and the data with one element changed. The output density in these two cases has distribution p( y - \bar{x}) and p(y - \bar{x}'), so for a

\frac{ p(y - \bar{x}) }{ p(y - \bar{x}') } \le \exp( \lambda ( |y - \bar{x}'| - |y - \bar{x}| ) ) \le \exp( \lambda |\bar{x} - \bar{x}|' ) \le \exp( \lambda/n )

So we can see by choosing \lambda = n \epsilon we get an \epsilon-differentially private approximation to the average.

What if we now have n vectors \mathbf{x}_i \in [0,1]^d? Well, one candidate is release a differentially private version of the mean by computing \mathbf{Y} = \frac{1}{n} \sum_{i} \mathbf{X}_i + \mathbf{Z}, where \mathbf{Z} has a distribution that looks Laplace-like but in higher dimensions:

p(\mathbf{z}) \propto \exp( - \lambda \| \mathbf{z} \| )

Now we can do the same calculation with means \bar{\mathbf{x}} and \bar{\mathbf{x}}'

\frac{ p(\mathbf{y} - \bar{\mathbf{x}}) }{ p(\mathbf{y} - \bar{\mathbf{x}}') } \le \exp( \lambda ( \|\mathbf{y} - \bar{\mathbf{x}}'\| - \|\mathbf{y} - \bar{\mathbf{x}}\| ) ) \le \exp( \lambda \|\bar{\mathbf{x}} - \bar{\mathbf{x}}'\| )

Now the Euclidean norm of the average vector can change by a most \sqrt{d}/n (by replacing x_i = \mathbf{0} with x_i' = \mathbf{1}, for example), so the overall bound is \exp(\lambda \sqrt{d}/n), which means choosing \lambda = n \epsilon / \sqrt{d} we get \epsilon-differential privacy.

Sampling from exponentials is easy, but what about sampling from this distribution? Here’s where people can fall into a trap because they are not careful about transformations of random variables. It’s tempting (if you are rusty on your probability) to say that

p(\mathbf{z}) = C(\lambda) \exp( - \lambda \| \mathbf{z} \| )

and then say “well, the norm looks exponentially distributed and the direction is isotropic so we can just sample the norm with an exponential distribution and the uniform direction by taking i.i.d. Gaussians and normalizing them.” But that’s totally wrong because that is implicitly doing a change of variables without properly adjusting the density function. The correct thing to do is to change from Euclidean coordinates to spherical coordinates. We have a map T : (z_1, z_2, \ldots, z_d) \to (r, \phi_1, \phi_2, \ldots, \phi_{d-1}) whose Jacobian is

J(r, \phi_1, \phi_2, \ldots, \phi_{d-1}) = r^{d-1} \sin^{d-2}(\phi_1) \ldots, \sin(\phi_{d-2}).

Plugging this in and noting that r = \|\mathbf{z}\| we get

p(r, \phi_1, \phi_2, \ldots, \phi_{d-1}) = C'(\lambda,\phi_1,\ldots, \phi_{d-1}) \exp( - \lambda r ).

So now we can see that the distribution factorizes and indeed the radius and direction are independent. The radius is not exponentially distributed, it’s Erlang with parameters (d,\lambda). We can generate this by taking the sum of d exponential variables with parameter \lambda. The direction we can pick uniformly by sampling d i.i.d. Gaussians and normalizing them.

In general sampling distributions for differentially private mechanisms can be complicated — for example in our work on PCA we had to use an MCMC procedure in our experiments to sample from the distribution in our algorithm. This means we could really only approximate our algorithm in the experiments, of course. There are also places to slip up in sampling from simple-looking distributions, and I’d be willing to bet that in some implementations out there people are not sampling from the correct distribution.

(Thanks to Kamalika Chaudhuri for inspiring this post.)

Some not-so-recent ArXiV skims

I tend to flag papers on ArXiV that I want to take a look at in (soon to be defunct, *sniff*) Google Reader. Here are some papers from the last month that I found interesting. I’ll post a few more of these as I work through my backlog…

Local Privacy and Statistical Minimax Rates (John C. Duchi, Michael I. Jordan, Martin J. Wainwright) — this is a paper proving minimax lower bounds for differential privacy. The approach is based on the Fano/Le Cam style of getting minimax bounds by constructing a packing of instances of the problem.

Bernstein – von Mises Theorem for growing parameter dimension (Vladimir Spokoiny) — I’m generally interested in the consistency properties of Bayesian procedures, and this looks at the effect of asymptotically growing the problem size to see how fast the problem can grow while still getting the same consistency from the BvM theorem.

On the problem of reversibility of the entropy power inequality (Sergey G. Bobkov, Mokshay M. Madiman) — More results on the EPI. Reversing it is the same as reversing the Brunn-Minkowski inequality (consider uniform distributions), but there is an interesting impossibility result here (Theorem 1.3): “For any constant C, there is a convex probability distribution \mu on the real line with a finite entropy, such that \min \{ H(X+Y), H(X-Y) \} \ge C H(X), where X and Y are independent random variables, distributed according to \mu.” The distribution they use is a truncated Pareto distribution but the calculations seem hairy.

A universal, operational theory of unicast multi-user communication with fidelity criteria (Mukul Agarwal, Sanjoy Mitter, Anant Sahai) — This is the culmination of Mukul’s work starting from a very nice paper I cite all the time from Allerton. There are several results and commentary in here — there’s a fair bit of philosophy, so it’s worth a more patient read than I could give it so far (only so many hours in the day, after all!)

The Convergence Rate of Majority Vote under Exchangeability (Miles E. Lopes) — The title says it all, really. The bounds are actually in terms of the mixture distribution of the exchangeable sequence of Bernoulli votes.