Linkage

Some interesting stuff has passed my way while being in India (and one or two things from before). Might as well post them before I forget, no?

Slavoj Žižek may be a curmudgeonly Marxist, but the animation helps soften it, I think. I don’t think I fully agree with him, but there’s stuff in there to chew on.

The Purdue anonymization project won a big NSF award.

Tips for tasks related to graduating (h/t Bobak).

Some interesting news about the future of the textbook market. It’s doubly interesting since I am in Pune, a treasure-trove of cheaper editions of technical books.

Apparently I sometimes wear a lab coat.

ISIT 2010 : databases

I attended one talk on databases and unfortunately missed the second one, so the writeup I have is based on the paper…

IDENTIFICATION AND LOSSY RECONSTRUCTION IN NOISY DATABASES
(Ertem Tuncel; University of California, Riverside and Deniz Gunduz; Centre Tecnologic de Telecomunicacions de Catalunya)
This was the talk I did attend. The goal was to information-theoritize the notion of compressing databases. You get M i.i.d. individuals with characteristics distributed according to P_X and features Y conditioned on the individuals according to P_{Y | X}. The identification problem is this: given a noisy feature vector Z generated from a particular individual X(i) retrieve the index of the user corresponding to Z. This paper looked at the case where you want to recover a distorted version of Y(i), so they are trading off identification and distortion. The solution is to code Wyner-Zivly, and the result recover the existing results on the identification problem alone as well as the Wyner-Ziv problem.

A THEORY OF UTILITY AND PRIVACY OF DATA SOURCES
Lalitha Sankar; Princeton University, S. Raj Rajagopalan; HP Labs, H. Vincent Poor; Princeton University
This talk I missed, which is all the more sad because I work on this stuff. But I did get to see a version of it at ITA and also at the IPAM workshop on privacy (as a poster). This is again an attempt to information-theoretize the problem of data disclosure: given a database d of n individuals, release a noisy version of this database. In doing so, they assume that each individual has K features and the n rows are generated iid according to some distribution P. So, as they put in the paper, “the primary contribution is demonstrating the equivalence between the database privacy problem and a source coding problem with additional privacy constraints.” The notion of utility is captured by a distortion measure: the encoder maps each database to a string of bits W. A user with side information Z^n can attempt to reconstruct the database using the a lossy source code decoder. The notion of privacy by an equivocation constraint:
\frac{1}{n} H( X^n | W, Z^n) \ge E - \epsilon
They then set about finding the best utility-privacy (= distortion-equivocation) tradeoff. This is a different model than differential privacy, and rests on a number of information-theoretic assumptions that may be relaxed in the future. For example, assuming individuals are i.i.d., that distributions are known, that the side information sequence is known, and so on would give pause to the differential privacy crowd. Much like the epsilon in differential privacy, the equivocation has a meaning but its syntactic effect is hard to parse. I guess this is why syntactic definitions of privacy like k-anonymity are popular — it’s easy to see what they mean “on the ground.”

Differentially Private ERM

Right after Memorial Day, I submitted a paper with Kamalika Chaudhuri and Claire Monteleoni to the Journal of Machine Learning Research on differential privacy and empirical risk minimization. This work looks at how to learn a classifier from training data in such a way that an adversary with access to the classifier and full knowledge of all but one of the training data points would still have a difficult time inferring the values of the last data point.

Ben Rubenstein has a nice post on the differential privacy model, and Adam Smith has more to say on sample secrecy. Adam and his colleagues (Dan Kifer and Abhradeep Guha Thakurta) gave us useful feedback on an earlier draft, which prompted me to learn some new facts about matrix perturbations, symmetric functions, and eigenvalues. Perhaps I’ll blog about this a bit more in the future, but I seem to be going in fits and starts here, so I don’t want to promise anything.

On a related note, ArXiV has dramatically changed its interface for submissions, and it is soooooo much better than before.

Talk at USC Wednesday

In case you’re at USC or in the area, I’m giving a talk tomorrow there on some of the work I’ve been doing with Kamalika Chaudhuri (whose website seems to have moved) and Claire Monteleoni on privacy-preserving machine learning.

Learning from sensitive data – balancing accuracy and privacy

Wednesday, March 24th, 2010
2:00pm-3:00pm
EEB 248

The advent of electronic databases has made it possible to perform data mining and statistical analyses of populations with applications from public health to infrastructure planning. However, the analysis of individuals’ data, even for aggregate statistics, raises questions of privacy which in turn require formal mathematical analysis. A recent measure called differential privacy provides a rigorous statistical privacy guarantee to every individual in the database. We develop privacy-preserving support vector machines (SVMs) that give an improved tradeoff between misclassification error and the privacy level. Our techniques are an application of a more general method for ensuring privacy in convex optimization problems.

Joint work with Kamalika Chaudhuri (UCSD) and Claire Monteleoni (Columbia)

Privacy and Google Web History

Posted on ArXiV last night: Private Information Disclosure from Web Searches. (The case of Google Web History), by Claude Castelluccia, Emiliano De Cristofaro, Daniele Perito.

Our report was sent to Google on February 23rd, 2010. Google is investigating the problem and has decided to temporarily suspend search suggestions from Search History. Furthermore, Google Web History page is now offered over HTTPS only. Updated information about this project is available at: this http URL

The link above has some more details of their back and forth with Google on the matter, and at least it looks like Google’s on the losing end of it.

Search histories have a lot of information in them, since searches correlated with local events, such as disease spread (related and interesting is Twitter’s tracking of earthquakes). Since user sessions can be compromised by someone hijacking the cookies that maintain the session, Google requires HTTPS for many services, like GMail, but not for the “automatic suggestion” for searches. The authors implemented an attack called The Historiographer:

The Historiographer uses the fact that users signed in any Google service receive personalized suggestions for their search queries based on previously-searched keywords. Since Google Web Search transmits authentication cookies in clear, the Historiographer monitoring the network can capture such a cookie and exploit the search suggestions to reconstruct a user’s search history.

This attack is not looking at a short time-window of browsing history, but essentially the entire search history as stored by Google. They did real experiments, and found:

Results show that almost one third of monitored users were signed in their Google accounts and, among them, a half had Web History enabled, thus being vulnerable to our attack. Finally, we show that data from several other Google services can be collected with a simple session hijacking attack.

So how does it work? The program hijacks the SID cookie from the user by eavesdropping, and then issues prefixes to the suggestion services; that is, it simulates a user typing in the first few letters of a search query. Prefixes have to be at least 2-3 letters to trigger the suggestion, and the top 3 completions are given. Of course 26^3 is a lot of prefixes to try, so the system has to sample effectively. The system just queries the top 10% of most frequent 3-letter prefixes (based on the statistics of English), which amounts to 121 queries to the system. If a particular 2-letter prefix (e.g. “pr”) is a prefix for many 3-letter prefixes (e.g. “pre”, “pra”, “pro”) which result in 3 completions, they will proceed greedily to look at longer prefixes in that direction. Note that this is the same principle behind Dasher (or arithmetic coding, really).

Based on this, the system can reconstruct the search history for the hijacked user. By using Google’s personalized results service, they can also get more information about the user’s preferences. A little more worrying is this observation:

In fact, a malicious entity could set up a Tor exit node to hijack cookies and reconstruct search histories. The security design underlying the Tor network guarantees that the malicious Tor exit node, although potentially able to access unencrypted traffic, is not able to learn the origin of such traffic. However, it may take the malicious node just one Google SID cookie to reconstruct a user’s search history, the searched locations, the default location, etc., thus signi cantly increasing the probability of identifying a user.

It’s an interesting paper, and worth a read if you are interested in these issues.

Netflix Prize II is cancelled

Via John Langford, I learned that the sequel to the Netflix Prize has been cancelled due to privacy concerns. The paper by Narayanan and Shmatikov (also at the Oakland Security and Privacy Conference, 2008) showed that by combining the public information available via IMDB and the Netflix data, certain individuals could be re-identified. Netflix was sued over the privacy problems, and they’ve settled the suits and decided not to release the new dataset (which was to have demographic information).

Demographic information is known to be pretty valuable in re-identification. The most famous example is Latanya Sweeney’s re-identification of the Governor of Massachusetts by linking (free) hospital discharge records and ($20) voter registration records. In the healthcare field, these kind of disclosures violate HIPAA, but this Netflix case raises an interesting question with regards to privacy promises. When a company assures you that your private information will only be used internally for quality control purposes, what are they actually promising? If they issue summary statistics and give those to third parties, is that privacy preserving?

The answer is no. However, people seem quite loath to worry about these kind of disclosures unless there is a public (and dangerous) privacy breach. This is why Narayanan and Shmatikov’s paper is important — the way to get the public (and hence policymakers) to take privacy seriously is to demonstrate that existing methods are insufficient.

some privacy humor

Ever since I started working on privacy problems (better living through statistics!), I am struck by the generally fatalistic view most people have about privacy. “The credit card companies know everything about me already,” “Google could easily steal my identity,” and so on. When sentiments like this become so widespread, they are fodder for humorists. (via Celeste LeCompte).

Privacy Workshop at IPAM

I’m at the Institute for Pure and Applied Math at a workshop on Statistical and Learning-Theoretic Challenges in Data Privacy. It’s a rather diverse group of computer scientists, statisticians, medical informatics, and policy researchers, and I feel a bit like I’m the only electrical engineer here. It’sa been pretty educational in the “learning about new problems” way, but I think by the time Friday morning rolls around I’ll be suffering from information overload. The nice thing is that the slides are posted online so I can refresh my memory when I get back. There’s also a poster session for some more recent results.

Most of the speakers have been talking about either (a) applications of the differential privacy model to some problems (e.g. data release, function computation, classification, PAC-learning, auctions, or Google’s MapReduce, the Census Bureau’s OnTheMap, and PINQ)or (b) areas in which privacy is a real problem (hospital discharge data and the dangers of re-identification, genome-wide association studies (GWAS), query logs from search engines, or (c) bridges between fields and their privacy definitions and models.

I’ve just started working in this area, so I’m still processing things (the talks range from high level to technical, and I often lack the background to understand fully what’s going on). I might blog a bit more about it as things come up.

NPR on the end of privacy

NPR’s All Things Considered is running a 4-part series on privacy issues in the modern digital era. Since I’ve started working on privacy research (specifically related to privacy in machine learning problems and in medical databases) these popular news stories are a good insight into how people in general think about privacy. The first segment is on data companies and their increasing lack of disclosure, and today’s was mostly about facebook. I’m looking forward to the rest of it — I’ve already had one or two nebulous research ideas, which is always a nice feeling.

Privacy for prescriptions

The NY TImes has an article on how the information on our prescriptions is “a commodity bought and sold in a murky marketplace, often without the patients’ knowledge or permission.” I was informed by UC Berkeley in the spring that some of my information may have been compromised, although only “Social Security numbers, health insurance information and non-treatment medical information,” and not “diagnoses, treatments and therapies.” But in that case it was theft, not out-and-out sale. The Times article suggests that the new health care bill will tighten up some of the information leakage, but I am unconvinced.

Of more interest is the second half of the article, on privacy in the data mining of medical information, which is a topic which is a strong motivator for some of the research I’m working on now. I’m not too comforted by pronouncements from industry people:

“Data stripped of patient identity is an important alternative in health research and managing quality of care,” said Randy Frankel, an IMS vice president. As for the ability to put the names back on anonymous data, he said IMS has “multiple encryptions and various ways of separating information to prevent a patient from being re-identified”

IMS Health reported operating revenue of $1.05 billion in the first half of 2009, down 10.6 percent from the period a year earlier. Mr. Frankel said he did not expect growing awareness of privacy issues to affect the business.

There’s no incentive to develop real privacy-preservation systems if you make money like that and don’t think that pressure is going to change your model. As far as the vague handwaving of “multiple encryptions and… separating information,” color me unconvinced again.

I think it’s time for a new take on privacy laws and technologies.