# ResearchGate: spam scam, or…?

I’ve been getting fairly regular automated emails lately from ResearchGate, which has pull-quotes from Forbes and NPR saying it’s changing the way we do research blah blah blah. However, all empirical reports I have heard indicate that once you join, it repeatedly spams all of your co-authors with requests to join, which makes it feel a bit more like Heaven’s Gate.

On a less grim note, the site’s promise to make your research “more visible” sounds a bit like SEO spam. Given the existence of Google Scholar, which is run by the SE that one would like to O, it seems slightly implausible.

Any readers want to weigh in on whether ResearchGate has been useful to them? Or is this mostly for people who don’t know how to make their own homepage with their papers on it (which is probably most faculty).

# B-log on IT

Via Tara Javidi I heard about a new blog on information theory: the Information Theory b-log, which has been going for a few months now but I guess in more “stealth mode.” It’s mostly posts by Sergio Verdú, with some initial posting by Thomas Courtade, but the most recent post is by Tara on how to compare random variables from a decision point of view. However, as Max noted:

All researchers work­ing on infor­ma­tion the­ory are invited to par­tic­i­pate by post­ing items to the blog. Both orig­i­nal mate­r­ial and point­ers to the web are welcome.

Most of these are stolen from MetaFilter.

Welcome back to public blogging, Dan.

Musical instrument samples. My first UROP at MIT was at the Media Lab, where I helped record instrumentalists as part of a musical instrument identification system. Paris Smaragdis was there at the time, and now he is at UIUC where he has a lot of cool audio demos. There are also some great clips Inside the Music Library at the BBC.

Ridiculous computer interfaces from movies.

I’m blogging from Chicago, where it is a balmy 42 degrees but sunny. Whither spring, I ask! Actually, I’m not blogging so much as linking to a bunch of stuff.

For San Diegans, the SD Asian Film Festival Spring Showcase is going on. It looks like I’ll miss a lot of it but I might try to catch something at the end of the week.

Less Pretentious & More Accurate Titles For Literary Masterworks — funny but possibly NSFW.

This home-scanning program seems creepy, regardless of the constitutionality issues.

I really like scallion pancakes. I’ll have to try this out when I get back to San Diego.

I agree that this video is awesome. Yo-Yo Ma and Lil Buck. I think that dude is made of rubber. And steel.

Tom Waits was induced into the Rock and Roll Hall of Fame. I just hope I get to see him live some day.

Some things to skim or read from ArXiV when I get the chance:
Sequential Analysis in High Dimensional Multiple Testing and Sparse Recovery (Matt Malloy, Robert Nowak)
Differential Privacy: on the trade-off between Utility and Information Leakage (Mário S. Alvim, Miguel E. Andrés, Konstantinos Chatzikokolakis, Pierpaolo Degano, Catuscia Palamidessi)
Capacity of Byzantine Consensus with Capacity-Limited Point-to-Point Links (Guanfeng Liang, Nitin Vaidya)
Settling the feasibility of interference alignment for the MIMO interference channel: the symmetric square case (Guy Bresler, Dustin Cartwright, David Tse)
Decentralized Online Learning Algorithms for Opportunistic Spectrum Access (Yi Gai, Bhaskar Krishnamachari)
Online and Batch Learning Algorithms for Data with Missing Features (Afshin Rostamizadeh, Alekh Agarwal, Peter Bartlett)
Nonuniform Coverage Control on the Line (Naomi Ehrich Leonard, Alex Olshevsky)
Degree Fluctuations and the Convergence Time of Consensus Algorithms (Alex Olshevsky, John Tsitsiklis)

# Privacy and entropy (needs improvement)

A while ago, Alex Dimakis sent me an EFF article on information theory and privacy, which starts out with an observation of Latanya Sweeney’s that gender, ZIP code, birthdate are uniquely identifying for a large portion of the population (an updated observation was made in 2006).

What’s weird is that the article veers into “how many bits of do you need to uniquely identify someone” based on self-information or surprisal calculations. It paints a bit of a misleading picture about how to answer the question. I’d probably start with taking $\log_2(6.625 \times 10^9)$ and then look at the variables in question.

However, the mere existence of this article raises a point : here is a situation where ideas from information theory and probability/statistics can be made relevant to a larger population. It’s a great opportunity to popularize our field (and demonstrate good ways of thinking about it). Why not do it ourselves?

# A new scam?

I received a scam email recently asking me if I wanted to “take up a job” for a few months while someone is goes on leave. The salary is not huge but pretty decent ($3.5k/month + some sort of commission) and I’m instructed to contact someone in Taiwan to make the arrangements. I assume the next step is that I will give them my bank account etc (for direct deposit), which then will get cleared out. This seems like a particularly pernicious type of internet fraud because it doesn’t involve egregious amounts of money (vs. Nigerian bank scams) and there is so much unemployment in the US right now that people would be happy to take that job for 3 months. Is this a new kind of scam? It seems different from the “work from home and make$70k” things I’ve seen before.

# Privacy and Google Web History

Posted on ArXiV last night: Private Information Disclosure from Web Searches. (The case of Google Web History), by Claude Castelluccia, Emiliano De Cristofaro, Daniele Perito.

Our report was sent to Google on February 23rd, 2010. Google is investigating the problem and has decided to temporarily suspend search suggestions from Search History. Furthermore, Google Web History page is now offered over HTTPS only. Updated information about this project is available at: this http URL

The link above has some more details of their back and forth with Google on the matter, and at least it looks like Google’s on the losing end of it.

Search histories have a lot of information in them, since searches correlated with local events, such as disease spread (related and interesting is Twitter’s tracking of earthquakes). Since user sessions can be compromised by someone hijacking the cookies that maintain the session, Google requires HTTPS for many services, like GMail, but not for the “automatic suggestion” for searches. The authors implemented an attack called The Historiographer:

The Historiographer uses the fact that users signed in any Google service receive personalized suggestions for their search queries based on previously-searched keywords. Since Google Web Search transmits authentication cookies in clear, the Historiographer monitoring the network can capture such a cookie and exploit the search suggestions to reconstruct a user’s search history.

This attack is not looking at a short time-window of browsing history, but essentially the entire search history as stored by Google. They did real experiments, and found:

Results show that almost one third of monitored users were signed in their Google accounts and, among them, a half had Web History enabled, thus being vulnerable to our attack. Finally, we show that data from several other Google services can be collected with a simple session hijacking attack.

So how does it work? The program hijacks the SID cookie from the user by eavesdropping, and then issues prefixes to the suggestion services; that is, it simulates a user typing in the first few letters of a search query. Prefixes have to be at least 2-3 letters to trigger the suggestion, and the top 3 completions are given. Of course 26^3 is a lot of prefixes to try, so the system has to sample effectively. The system just queries the top 10% of most frequent 3-letter prefixes (based on the statistics of English), which amounts to 121 queries to the system. If a particular 2-letter prefix (e.g. “pr”) is a prefix for many 3-letter prefixes (e.g. “pre”, “pra”, “pro”) which result in 3 completions, they will proceed greedily to look at longer prefixes in that direction. Note that this is the same principle behind Dasher (or arithmetic coding, really).

Based on this, the system can reconstruct the search history for the hijacked user. By using Google’s personalized results service, they can also get more information about the user’s preferences. A little more worrying is this observation:

In fact, a malicious entity could set up a Tor exit node to hijack cookies and reconstruct search histories. The security design underlying the Tor network guarantees that the malicious Tor exit node, although potentially able to access unencrypted traffic, is not able to learn the origin of such traffic. However, it may take the malicious node just one Google SID cookie to reconstruct a user’s search history, the searched locations, the default location, etc., thus signi cantly increasing the probability of identifying a user.

It’s an interesting paper, and worth a read if you are interested in these issues.