# ISIT 2010 : databases

I attended one talk on databases and unfortunately missed the second one, so the writeup I have is based on the paper…

IDENTIFICATION AND LOSSY RECONSTRUCTION IN NOISY DATABASES
(Ertem Tuncel; University of California, Riverside and Deniz Gunduz; Centre Tecnologic de Telecomunicacions de Catalunya)
This was the talk I did attend. The goal was to information-theoritize the notion of compressing databases. You get $M$ i.i.d. individuals with characteristics distributed according to $P_X$ and features $Y$ conditioned on the individuals according to $P_{Y | X}$. The identification problem is this: given a noisy feature vector $Z$ generated from a particular individual $X(i)$ retrieve the index of the user corresponding to $Z$. This paper looked at the case where you want to recover a distorted version of $Y(i)$, so they are trading off identification and distortion. The solution is to code Wyner-Zivly, and the result recover the existing results on the identification problem alone as well as the Wyner-Ziv problem.

A THEORY OF UTILITY AND PRIVACY OF DATA SOURCES
Lalitha Sankar; Princeton University, S. Raj Rajagopalan; HP Labs, H. Vincent Poor; Princeton University
This talk I missed, which is all the more sad because I work on this stuff. But I did get to see a version of it at ITA and also at the IPAM workshop on privacy (as a poster). This is again an attempt to information-theoretize the problem of data disclosure: given a database $d$ of $n$ individuals, release a noisy version of this database. In doing so, they assume that each individual has $K$ features and the $n$ rows are generated iid according to some distribution $P$. So, as they put in the paper, “the primary contribution is demonstrating the equivalence between the database privacy problem and a source coding problem with additional privacy constraints.” The notion of utility is captured by a distortion measure: the encoder maps each database to a string of bits $W$. A user with side information $Z^n$ can attempt to reconstruct the database using the a lossy source code decoder. The notion of privacy by an equivocation constraint:
$\frac{1}{n} H( X^n | W, Z^n) \ge E - \epsilon$
They then set about finding the best utility-privacy (= distortion-equivocation) tradeoff. This is a different model than differential privacy, and rests on a number of information-theoretic assumptions that may be relaxed in the future. For example, assuming individuals are i.i.d., that distributions are known, that the side information sequence is known, and so on would give pause to the differential privacy crowd. Much like the epsilon in differential privacy, the equivocation has a meaning but its syntactic effect is hard to parse. I guess this is why syntactic definitions of privacy like k-anonymity are popular — it’s easy to see what they mean “on the ground.”

Advertisements