# Is there an incentive for misrepresentation?

I was recently reading a paper on ArXiV that is from the VLDB 2012 conference:

Functional Mechanism: Regression Analysis under Differential Privacy
Jun Zhang, Zhenjie Zhang, Xiaokui Xiao, Yin Yang, Marianne Winslett

The idea of the paper is to make a differentially private approximation to an optimization by perturbing a Taylor series expansion of the objective function. Which is an interesting idea at that. However, what caught my eye was how they referred to an earlier paper of mine (with Kamalika Chaudhuri and Claire Monteleoni) on differentially private empirical risk minimization. What we did in that paper was look at the problem of training classifiers via ERM and the particular examples we used for experiments were logistic regression and SVM.

In the VLDB paper, the authors write:

The algorithm, however, is inapplicable for standard logistic regression, as the cost function of logistic regression does not satisfy convexity requirement. Instead, Chaudhuri et al. demonstrate that their algorithm can address a non-standard type of logistic regression with a modified input (see Section 3 for details). Nevertheless, it is unclear whether the modified logistic regression is useful in practice.

This is just incorrect. What we look at is a fairly standard formulation of logistic regression with labels in {-1,+1}, and do the standard machine learning approach, namely regularized empirical risk minimization. The objective function is, in fact, convex. We further do experiments using that algorithm on standard datasets. Perhaps the empirical performance was not as great as they might like, but then they should make a claim of some sort instead of saying it’s “unclear.”

They further claim:

In particular, they assume that for each tuple t_i, its value on Y is not a boolean value that indicates whether t_i satisfies certain condition; instead, they assume y_i equals the probability that a condition is satisfied given x_i… Furthermore, Chaudhuri et al.’s method cannot be applied on datasets where Y is a boolean attribute…

Firstly, we never make this “assumption.” Secondly, we do experiments using that algorithm on standard datasets where the label is binary. Reading this description was like being in a weird dream-world in which statements are made up and attributed to you.

Naturally, I was a bit confused about this rather blatant misrepresentation of our paper, so I emailed the authors, who essentially said that they were confused by the description in our paper and that more technical definitions are needed because we are from “different communities.” They claimed that they emailed questions about it but we could not find any such emails. Sure, sometimes papers can be confusing if they are out of your area, but to take “I don’t understand X” to “let me make things up about X” requires a level of gumption that I don’t think I could really muster.

In a sense, the publication incentives are stacked in favor of this kind of misrepresentation. VLDB is a very selective conference, so in order to make your contribution seem like a big deal, you have to make it seem that alternative approaches to the problem are severely lacking. However, rather than making a case against the empirical performance of our method, this paper just invented “facts” about our paper. The sad thing is that it seems completely unnecessary, since their method is quite different.

# NIPS 2012 : day one

I am attending NIPS this year for the first time, and so I figured it would be good to blog about some of it here. I totally dropped the ball on Allerton, so maybe I’ll make up for it by writing more about the actual talks here. Fortunately, or unfortunately, most of the conference is about things I have almost no experience with, so I am having a bit of an explore/exploit tradeoff in my selection process.

Every day of the conference has a poster session from 7 to midnight — there are 90+ posters in a single room and people go in and out of hanging out with friends and looking at posters. My poster (a paper with Kamalika Chaudhuri and Kaushik Sinha on differentially private approximations to PCA) was last night, so I was on the presenting end of things. I gave up at 10:30 because I was getting hoarse and tired, but even then there were a fair number of people milling about. Since I was (mostly) at my poster I missed out on the other works.

During the day the conference is a single-track affair with invited and highlighted talks. There are two kinds of highlighted talks — some papers are marked for oral prevention (ETA: presentation), and some are marked as “spotlights,” which means that the authors get to make a 5 minute elevator pitch for their poster in front of the whole conference. Those start today, and I’m looking forward to it.

In the meantime, here is a picture from the hike I took yesterday with Erin:

Mountain Range on a hike near Lake Lily.

An animation of integer factorizations. Goes well with music. (h/t BK).

Graphics from the Chicago L (via Chicagoist)

Tony Kushner is kind of a tool. I find this unfortunate. But I still want to see Lincoln.

Aaron Roth reports that the DIMACS tutorial videos have been posted. A perfect time to brush up on your differential privacy!

An analysis of the Thai government’s menu served to President Obama.

A Choose Your Own Adventure version of Hamlet, from the creator of Dinosaur Comics.