Posted on ArXiV last night: Private Information Disclosure from Web Searches. (The case of Google Web History), by Claude Castelluccia, Emiliano De Cristofaro, Daniele Perito.
Our report was sent to Google on February 23rd, 2010. Google is investigating the problem and has decided to temporarily suspend search suggestions from Search History. Furthermore, Google Web History page is now offered over HTTPS only. Updated information about this project is available at: this http URL
The link above has some more details of their back and forth with Google on the matter, and at least it looks like Google’s on the losing end of it.
Search histories have a lot of information in them, since searches correlated with local events, such as disease spread (related and interesting is Twitter’s tracking of earthquakes). Since user sessions can be compromised by someone hijacking the cookies that maintain the session, Google requires HTTPS for many services, like GMail, but not for the “automatic suggestion” for searches. The authors implemented an attack called The Historiographer:
The Historiographer uses the fact that users signed in any Google service receive personalized suggestions for their search queries based on previously-searched keywords. Since Google Web Search transmits authentication cookies in clear, the Historiographer monitoring the network can capture such a cookie and exploit the search suggestions to reconstruct a user’s search history.
This attack is not looking at a short time-window of browsing history, but essentially the entire search history as stored by Google. They did real experiments, and found:
Results show that almost one third of monitored users were signed in their Google accounts and, among them, a half had Web History enabled, thus being vulnerable to our attack. Finally, we show that data from several other Google services can be collected with a simple session hijacking attack.
So how does it work? The program hijacks the SID cookie from the user by eavesdropping, and then issues prefixes to the suggestion services; that is, it simulates a user typing in the first few letters of a search query. Prefixes have to be at least 2-3 letters to trigger the suggestion, and the top 3 completions are given. Of course 26^3 is a lot of prefixes to try, so the system has to sample effectively. The system just queries the top 10% of most frequent 3-letter prefixes (based on the statistics of English), which amounts to 121 queries to the system. If a particular 2-letter prefix (e.g. “pr”) is a prefix for many 3-letter prefixes (e.g. “pre”, “pra”, “pro”) which result in 3 completions, they will proceed greedily to look at longer prefixes in that direction. Note that this is the same principle behind Dasher (or arithmetic coding, really).
Based on this, the system can reconstruct the search history for the hijacked user. By using Google’s personalized results service, they can also get more information about the user’s preferences. A little more worrying is this observation:
In fact, a malicious entity could set up a Tor exit node to hijack cookies and reconstruct search histories. The security design underlying the Tor network guarantees that the malicious Tor exit node, although potentially able to access unencrypted traffic, is not able to learn the origin of such traffic. However, it may take the malicious node just one Google SID cookie to reconstruct a user’s search history, the searched locations, the default location, etc., thus signicantly increasing the probability of identifying a user.
It’s an interesting paper, and worth a read if you are interested in these issues.
7 thoughts on “Privacy and Google Web History”
Remember a few months ago (December 30, to be exact — a week before the Technology Review example) when you IMed me with “EARTHQUAKE!” and then proceeded to wait a few minutes before you got around to telling me that you were okay, if a little shaken (literally)? In those minutes, while I was trying to figure out if you were in danger or what, I totally used Twitter to determine that you were probably okay.
That was the first (and so far only) time I saw a point to Twitter.
Nice! Thanks for posting this.
I find it shocking that Google could care about this kind of security issue and yet not have thought of it themselves – there really is not much to it. I think I was instinctively scared when I saw my search history first start popping up in Google and immediately clicked the button that deactivates that feature.
Seriously, though, isn’t it time for some kind of a secure cookie protocol? My garage door opener has a rotating key system – why is that not in my web browser? Is there something out there that I am not aware of?
Well, if you look at their responses to the authors, it’s kind of clear that they either don’t care or don’t understand the problem.
When I read the abstract I thought the attach would be rather sophisticated, but it’s (at least from the theory perspective) ridiculously simple.
I get the impression that most search companies (Google, Yahoo) are only just beginning to become aware of privacy issues. I’m just as baffled as you are.
Another issue with search suggestion is that when you type in a person’s name, it is suggested with a suffix which may not be
I think that someone might be able to launch a similar attack on facebook unless their sessions are secure. When I type in a name I get all the autocompletions, which allows a snooper to read my friends list even if I have not made it public.
The authors make the point that Google tries to keep that information only available on the HTTPS part of their site. Facebook does not restrict your friend list to an HTTPS page, so I think anyone sniffing packets on your network can see it.
On the Pololu.com web site, we use an insecure session cookie for normal operations but require a secure (HTTPS-only) cookie for some parts of the site.