Differential privacy and the AUC

One of the things I’m always asked when giving a talk on differential privacy is “how should we interpret \epsilon?” There a lot of ways of answering this but one way that seems to make more sense to people who actually think about risk, hypothesis testing, and prediction error is through the “area under the curve” metric, or AUC. This post came out of a discussion from a talk I gave recently at Boston University, and I’d like to thank Clem Karl for the more detailed questioning.

For the purposes of this post, a randomized algorithm A is \epsilon-differentially private if

\mathbb{P}( A(\mathcal{D}) \in \mathcal{S} ) = e^{\epsilon} \mathbb{P}( A(\mathcal{D}')  \in \mathcal{S})

for all pairs of data sets \mathcal{D} = (x_1, x_2, \ldots, x_{n-1}, x_n) and \mathcal{D}' = (x_1, x_2, \ldots, x_{n-1},x'_n) containing n individuals and differing in a single individual, and all measurable sets \mathcal{S}.

For cases where the output of A has a density, we can interpret this as saying the log-likelihood ratio for the output of the distribution is bounded by \epsilon:

\frac{ f( s | \mathcal{D}) }{ f(s | \mathcal{D}' ) } \le \epsilon.

For those familiar with hypothesis testing, this guarantee is saying something about the hypothesis test between \mathcal{D} and \mathcal{D}' being “hard.” One interpretation of differential privacy is that an adversary observing the output of the algorithm will have a difficult time inferring if the data of individual n is x_n or x'_n, even if they know all n-1 other data points.

Wasserman and Zhou showed that the parameter \epsilon controls power of this hypothesis test. Oh and Viswanath write this more explicitly as a pair of inequalities governing the tradeoff between the false-alarm and missed-detection probabilities:

P_{\mathrm{FA}} + e^{\epsilon} P_{\mathrm{MD}} \ge 1
e^{\epsilon} P_{\mathrm{FA}} + P_{\mathrm{MD}} \ge 1.

If we plot these against each other we get a picture like this:

Probability of missed detection vs. probability of false alarm

Probability of missed detection vs. probability of false alarm

The receiver operating characteristic (ROC) is defined as the true positive rate as a function of the false positive rate. That is, it’s 1 - P_{\mathrm{MD}} on the y-axis and P_{\mathrm{FA}} on the x-axis. So this is the same plot as above, only flipped along the y-axis. To calculate the AUC we just integrate. The point where equality holds in both of the above inequalities is where P_{\mathrm{FA}} = P_{\mathrm{MD}}, or

P_{\mathrm{FA}} = \frac{1}{1 + e^{\epsilon}}.

That’s the “corner point” in the previous figure. So the AUC is just

(1 - \frac{1}{1 + e^{\epsilon}})^2 + 2 \int_{0}^{\frac{1}{1 + e^{\epsilon}}} e^{\epsilon} P_{\mathrm{FA}} d P_{\mathrm{FA}} = \frac{ e^{\epsilon} + e^{2 \epsilon}}{ (1 + e^{\epsilon})^2}.

We can plot the AUC as a function of \epsilon pretty easily:

ROC curve for epsilon-differential privacy

AUC versus epsilon for differential privacy

So we can now see how the privacy parameter affects the AUC for this hypothesis test. Depending on how comfortable you are with the risk (and also the threat model), you can assess for yourself what kind of \epsilon you would prefer. Oh and Viswanath also calculate the tradeoff for (\epsilon,\delta)-differential privacy, but maybe I’ll leave that for another post. In the end, I don’t find this AUC plot so illuminating, but then again, I don’t have a visceral “feel” for how the AUC corresponds to the quality of the prediction.