Retraction for Symmetric Matrix Perturbation for Differentially-Private Principal Component Analysis, ICASSP 2016

This is somewhat difficult to write, so I’ll cut to the chase. The proposed differentially private PCA method from the ICASSP 2016 paper by my student Hafiz Imtiaz and myself is incorrect and so should be retracted:

H. Imtiaz, A.D. Sarwate, Symmetric Matrix Perturbation for Differentially-Private Principal Component Analysis, Proceedings of the 2016 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2339–2343, March 2016.

As a side-result, the paper of Jiang et al. from AAAI 2016 has as similar incorrect result. I’d like to thank my collaborator Kamalika Chaudhuri, who brought this up to me after a conversation with Kunal Talwar. The fact that I have taken so long to make this public announcement was due to a misguided effort to adapt the algorithm and run new experiments. Unfortunately, this was not to be.

What is the algorithm? Well, the basic idea in all of these papers is to add Wishart noise to produce a noisy estimate of the second-moment matrix of a data set. We start with some data vectors \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n \in \mathbb{R}^d and compute the second moment matrix A = \sum_{i} \mathbf{x}_i \mathbf{x}_i^{\top}. If we want to compute the SVD in a differentially private way, one approach is to add noise E to the second moment matrix to form \hat{A} = A + E such that \hat{A} is a differentially private version of A. Then by postprocessing invariance, the SVD of \hat{A} is also differentially private.

The motivation for our work was twofold: 1) we wanted to see how theoretically “optimal” algorithms performed on real data sets, especially for suggested settings of the parameters, and 2) we wanted to propose a good (\epsilon,0)-differentially private algorithm? What is wrong with \delta > 0? In some application domains, I have found that the notion of a “failure probability” is not compatible with practitioners’t (often lawyer-driven) policy requirements. Concentrated DP and newer concepts still keep the \delta > 0, so they might not fly either. Plus, it’s a slightly more challenging and interesting mathematical problem.

A very good (\epsilon,\delta) choice for E is to choose the entries to be i.i.d. Gaussian on the diagonal and above and copy entries below the diagonal such that E is symmetric. This is the Analyze Gauss (AG) algorithm of Dwork et al. Although \hat{A} is symmetric, it’s no longer positive semidefinite (PSD). However, this gives order-optimal guarantees on utility. Taking that as a cue, we wanted to preserve the PSD property in \hat{A}, so we proposed adding noise with a Wishart distribution, which corresponds to generating an i.i.d. rectangular d \times p Gaussian matrix $\latex Z$ and settting E = Z Z^{\top}. This generates a PSD perturbation fo \hat{A} and has pretty good utility guarantees as well.

Unfortunately, the Wishart perturbation is blatantly non-private because the image under the perturbation is a PSD cone starting from A, and for A and A' coming from neighboring databases, the cones are not the same. As a trivial example, suppose A = \mathbf{x} \mathbf{x}^{\top} and A' = \mathbf{0}. Hafiz has written up a more formal description. Because differential privacy requires the support of the output distribution under different inputs to coincide, it is difficult to create a (data-independent) additive perturbation that preserves the PSD property. The exponential mechanism works, but is not additive.

In the end, I was rather excited about the empirical performance (c.f. goal 1) above) and we were close to the deadline so I did not do a proper job checking the technical results in our rush to submit. I’m making this a teachable moment for myself but wanted to make sure there was a public, findable version of this retraction so that people don’t continue to build on our results. What we learned from 1) is that designing practical PCA methods for reasonable values of (\epsilon,\delta) is still challenging — many data sets do not have the sample sizes needed to make differential privacy practical. The solution may be to design more complex multi-stage pipeline, revisit what we can assume about the data space and neighborhood relations, or come up with better (\epsilon,0) algorithms that pass muster for the applications when nonzero \delta is unacceptable.

Advertisements

DIMACS Workshop on Distributed Optimization, Information Processing, and Learning

My colleague Waheed Bajwa, Alejandro Ribeiro, and Alekh Agarwal are organizing a Workshop on Distributed Optimization, Information Processing, and Learning from August 21 to August 23, 2017 at Rutgers DIMACS. The purpose of this workshop is to bring together researchers from the fields of machine learning, signal processing, and optimization for cross-pollination of ideas related to the problems of distributed optimization, information processing, and learning. All in all, we are expecting to have 20 to 26 invited talks from leading researchers working in these areas as well as around 20 contributed posters in the workshop.

Registration is open from now until August 14 — hope to see some of you there!

Postdoctoral Associate at DIMACS

DIMACS, the Center for Discrete Mathematics and Theoretical Computer Science, invites applications for various postdoctoral associate positions for 2017-18. Applicants should be recent Ph.D.’s with interest in DIMACS areas, such as computer science, discrete mathematics, statistics, physics, operations research, and their applications. There are four positions available:

  1. a one-year postdoctoral associateship investigating modeling of anomaly detection in multi-layer networks,
  2. a two-year associateship in collaboration with the Institute for Advanced Study (IAS) in Princeton, NJ emphasizing theoretical computer science and discrete mathematics,
  3. a position associated with the Simons Collaboration on Algorithms and Geometry which also emphasizes theoretical computer science and discrete mathematics and could be hosted at Rutgers/DIMACS,
  4. a two-year associateship in theoretical machine learning in the Department of Computer Science at Rutgers.

See the DIMACS website for application information.

Applications have various deadlines, beginning December 1, 2016. See website for details.
DIMACS Center, Rutgers University, 96 Frelinghuysen Road, Piscataway, NJ 08854-8018;
Tel: 848-445-5928; Email: postdoc at dimacs.rutgers.edu. DIMACS is an EO/AA employer.

Problems with the KDDCup99 Data Set

I’ve used the KDDCup99 data set in a few papers for experiments, primarily because it has a large sample size and preprocessing is not too onerous. However, I recently learned (from Rebecca Wright) that for applications to network security, this data set has been discredited as unrepresentative. The paper by John McHugh from ACM TISSEC details the charges. Essentially there was little validation done with regards to checking how representative the data set is.

Why do I bring this up? Firstly, I suppose I should stop using this data set to make claims about anomaly detection (which may be a problem for AISec coming up at the end of the month). However, it’s not clear, from a machine learning perspective, whether the claims one can make about a particular application will generalize within an application domain, given the lack of standardization of data sets even within a particular application. I could do a bunch of experiments on mixtures of Gaussians which might tell me that the convergence rate is what the theory said it should be, but validating on a variety of “non-synthetic” data sets can at least show how performance varies with data sets properties (regardless of the accuracy with respect to the application). So should I stop using the data set entirely?

Secondly, if we want to develop new models and algorithms for machine learning on security applications, we need data sets, and preferably public data sets. This is a real challenge for anyone trying to develop theoretical frameworks that don’t sound too bogus: practice could drive theory, but there is a kind of security through obscurity model in the data gathering/sharing world which makes it hard to understand what the problems are.

Signal boost: DPCOMP.ORG is live

I got the following email from Gerome Miklau:

Dear colleagues:

We are writing to inform you of the launch of DPCOMP.ORG.

DPCOMP.ORG is a public website designed with the following goals in mind: (1) to increase the visibility and transparency of state-of-the-art differentially private algorithms and (2) to present a principled and comprehensive empirical evaluation of these algorithms. The intended audience is both researchers who study privacy algorithms and practitioners who might deploy these algorithms.

Currently DPComp includes algorithms for answering 1- and 2-dimensional range queries. We thoroughly study algorithm accuracy and the factors that influence it and present our findings using interactive visualizations. We follow the evaluation methodology from the paper “Principled Evaluation of Differentially Private Algorithms using DPBench”. In the future we plan to extend it to cover other analysis tasks (e.g., higher dimensional data, private regression).

Our hope is that the research community will contribute to improving DPCOMP.ORG so that practitioners are exposed to emerging research developments. For example: if you have datasets which you believe would distinguish the performance of tested algorithms, new algorithms that could be included, alternative workloads, or even a new error metric, please let us know — we would like to include them.

Please share this email with interested colleagues and students. And we welcome any feedback on the website or findings.

Sincerely,

Michael Hay (Colgate University)
Ashwin Machanavajjhala (Duke University)
Gerome Miklau (UMass Amherst)

Multiple Postdoc Openings at USC

Prof. Urbashi Mitra is looking for multiple postdocs. Given that this is the time of year when the future looks murkiest, these are great opportunities!

I am seeking multiple post-doctoral researchers are sought with expertise in one or more areas: Communication Theory, (Statistical) Signal Processing, Controls, Information Theory, and Machine Learning. In particular, the following expertises are of interest: structured inference (sparse approximation, low rank matrix completion, tensor signal processing, graph signal processing); multi-terminal information theory, or information theory at the boundaries of control or signal processing; distributed control, consensus methods and partially observable Markov Decision Process modeling and algorithms; modern optimization methods; or biological communications, signal processing or information theory.

The successful applicants will be expected to perform innovative translational research, mentor PhD students, give oral presentations, write journal papers, and participate in grant writing and project management. There will be significant opportunities for research leadership and interaction with funding agencies.

Ideally, the successful applicants will start in Summer 2016.

Please have your interested graduate students apply using the following portal:

https://jobs.usc.edu/postings/63539

In addition to a cv and research statement, the applicants are requested to have three letters of reference uploaded to the system as well.

UCSD Data Science Postdocs

A bit of a delayed posting due to pre-spring break crunch time, but my inimitable collaborator and ex-colleague Kamalika Chaudhuri passed along the following announcement.

I write with the exciting news that UCSD has up to four postdoctoral fellowship openings in data science and machine learning.

The fellowships will prepare outstanding researchers for academic careers. The fellows will be affiliated with the CSE or ECE Departments, will enjoy broad freedom to work with any of or faculty, they will be allocated a research budget, and will teach one class per year.

If you know anyone who might be interested, please encourage them to apply!

The program is co-sponsored by UCSD’s CSE and ECE departments, the Interdisciplinary Qualcomm Institute, and the Information Theory and Applications Center.

More information is available at the UCSD Data Science site. Review begins March 21, so get your applications in!