Problems with the KDDCup99 Data Set

I’ve used the KDDCup99 data set in a few papers for experiments, primarily because it has a large sample size and preprocessing is not too onerous. However, I recently learned (from Rebecca Wright) that for applications to network security, this data set has been discredited as unrepresentative. The paper by John McHugh from ACM TISSEC details the charges. Essentially there was little validation done with regards to checking how representative the data set is.

Why do I bring this up? Firstly, I suppose I should stop using this data set to make claims about anomaly detection (which may be a problem for AISec coming up at the end of the month). However, it’s not clear, from a machine learning perspective, whether the claims one can make about a particular application will generalize within an application domain, given the lack of standardization of data sets even within a particular application. I could do a bunch of experiments on mixtures of Gaussians which might tell me that the convergence rate is what the theory said it should be, but validating on a variety of “non-synthetic” data sets can at least show how performance varies with data sets properties (regardless of the accuracy with respect to the application). So should I stop using the data set entirely?

Secondly, if we want to develop new models and algorithms for machine learning on security applications, we need data sets, and preferably public data sets. This is a real challenge for anyone trying to develop theoretical frameworks that don’t sound too bogus: practice could drive theory, but there is a kind of security through obscurity model in the data gathering/sharing world which makes it hard to understand what the problems are.

IHP “Nexus” Workshop on Privacy and Security: Days 4-5

Wrapping this up, finally. Maybe conference blogging has to go by the wayside for me… my notes got a bit sketchier so I’ll just do a short rundown of the topics.

Days 4-5 were a series of “short” talks by Moni Naor, Kobbi Nissim, Lalitha Sankar, Sewoong Oh, Delaram Kahrobaie, Joerg Kliewer, Jon Ullman, and Sasho Nikolov on a rather eclectic mix of topics.

Moni’s talk was on secret sharing in an online setting — parties arrive one by one and the qualified sets (who can decode the secret) is revealed by all parties. The shares have to be generated online as well. Since the access structure is evolving, what kinds of systems can we support? As I understood it, the idea is to use something similar to threshold scheme and a “doubling trick”-like argument by dividing the users/parties into generations. It’s a bit out of area for me so I had a hard time keeping up with the connections to other problems. Kobbi talked about reconstruction attacks based on observing traffic from outsourced database systems. A user wants to get the records but the server shouldn’t be able to reconstruct: it knows how many records were returned from a query and knows if the same record was sent on subsequent queries — this is a sort of access pattern leakage. He presented attacks based on this information and also based on just knowing the volume (e.g. total size of response) from the queries.

Lalitha talked about mutual information privacy, which was quite a bit different than the differential privacy models from the CS side, but more in line with Ye Wang’s talk earlier in the week. Although she didn’t get to spend as much time on it, the work on interactive communication and privacy might have been interesting to folks earlier in the workshop studying communication complexity. In general, the connection between communication complexity problems and MPC, for example, are elusive to me (probably from lack of trying).

Sewoong talked about optimal mechanisms for differentially private composition — I had to miss his talk, unfortunately. Delaram talked about cryptosystems based on group theory and I had to try and check back in all the things I learned in 18.701/702 and the graduate algebra class I (mistakenly) took my first year of graduate school. I am not sure I could even do justice to it, but I took a lot of notes. Joerg talked about using polar codes to enable private function computation — initially privacy was measured by equivocation but towards the end he made a connection to differential privacy. Since most folks (myself included) are not experts on polar codes, he gave a rather nice tutorial (I thought) on polar coding. It being the last day of the workshop, the audience had unfortunately thinned out a bit.

Jon spoke about estimating marginal distributions for high-dimensional problems. There were some nice connections to composite hypothesis testing problems that came out of the discussion during the talk — the model seems a bit complex to get into based on my notes, but I think readers who are experts on hypothesis testing might want to check out his work. Sasho rounded off the workshop with a talk about the sensitivity polytope of linear queries on a statistical database and connections to Gaussian widths. The main result was on the sample complexity of answering the queries in time polynomial in the number of individuals, size of the universe, and size of the query set.

IHP “Nexus” Workshop on Privacy and Security: Day 2

Verrrrrry belated blogging on the rest of the workshop, more than a month later. Day 2 had 5 talks instead of the tutorial plus talks, and the topics were a bit more varied (this was partly because of scheduling issues that prevented us from being strictly thematic).

Amos Beimel started out with a talk on secret sharing, which had a very nice tutorial/introduction to the problem, including the connection between Reed-Solomon codes and Shamir’s t-out-of-n scheme. For professional (and perhaps personal) reasons I found myself wondering how much more the connection between secret sharing and coding theory was — after all, this was a workshop about communication between information theory and theoretical CS. Not being a coding theory expert myself, I could only speculate. What I didn’t know about was the more general secret sharing structures and the results of Ito-Saito-Nishizeki scheme (published in Globecom!). Amos also talked about monotone span programs, which were new to me, and how to prove lower bounds. He concluded with more recent work on the related distribution design problem: how can we construct a distribution on n variables given constraints that specify subsets which should have identical marginals and subsets which should have disjoint support? The results appeared in ICTS.

Ye Wang talked about his work on common information and how it appears in privacy and security problems from an information theoretic perspective. In particular he talked about secure sampling, multiparty computation, and data release problems. The MPC and sampling results were pretty technical in terms of notions of completeness of primitives (conditional distributions) and triviality (a way of categorizing sources). For the data release problem he focused on problems where a sanitizer has access to a pair $(X,Y)$ where $X$ is private and $Y$ is “useful” — the goal is to produce a version of the data which reveals less about $X$ (privacy) and more about $Y$ (utility). Since they are correlated, there is a tension. The question he addressed is when having access to Y alone as as good as both X and Y.

Manoj, after giving his part of the tutorial (and covering for Vinod), gave his own talk on what he called “cryptographic complexity,” which is an analogy to computational complexity, but for multiparty functions. This was also a talk about definitions and reductions: if you can build a protocol for securely computing $f(\cdot)$ using a protocol for $g(\cdot)$, then $f(\cdot)$ reduces to $g(\cdot)$. A complete function is one for which everything reduces to it, and a trivial function reduces to everything. So with the concepts you can start to classify and partition out functions like characterizing all complete functions for 2 parties, or finding trivial functions under different security notions. He presented some weird facts, like an $n$ bit XOR doesn’t reduce to an $(n-1)$ bit XOR. It was a pretty interesting talk, and I learned quite a bit!

Elette Boyle gave a great talk on Oblivious RAM, a topic about which I was completely oblivious myself. The basic idea in oblivious RAM is (as I understood it) that an adversary can observe the accesses to a RAM and therefore infer what program is being executed (and the input). To obfuscate that, you introduce a bunch of spurious accesses. So if you have a program $\latex \Pi$ whose access pattern is fixed prior to execution, you can randomize the accesses and gain some security. The overhead is the ratio of the total accesses to the required accesses. After this introduction to the problem, she talked about lower bounds on the overhead (e.g. you need this much overhead) for a case where you have parallel processing. I admit that I didn’t quite understand the arguments, but the problem was pretty interesting.

Hoeteck Wee gave the last (but quite energetic) talk of the afternoon, on what he called “functional encryption.” The ideas is that Alice has $(x,M)$ and Bob has $y$. They both send messages to a third party, Charlie. There is a 0-1 function (predicate) $P(x,y)$ such that if $P(x,y) = 1$ then Charlie can decode the message $M$. Otherwise, they cannot. An example would be the predicate $P(x,y) = \mathbf{1}(x = y)$. In this case, Alice can send $h(x) \oplus M$ and Bob can send $h(y)$ for some 2-wise independent hash function, and then Charlie can recover $M$ if the hashes match. I think there is a question in this scheme about whether Charlie needs to know that they got the right message, but I guess I can read the paper for that. The kinds of questions they want to ask are what kinds of predicates have nice encoding schemes? What is the size of message that Alice and Bob have to send? He made a connection/reduction to a communication complexity problem to get a bound on the message sizes in terms of the communication complexity of computing the predicate $P$. It really was a very nice talk and pretty understandable even with my own limited background.

At a DARPA PI meeting recently, I met some folks from Cybernetica who told me about the hot new startup CountryOS! (EDIT: it’s not their startup).

A recent 99% Invisible episode describes the history of the SIGSALY, a secure communication system developed during WWII that used white noise one-time pads printed on vinyl to analog-encrypt communications lines.

Thanks to The Allusionist, I learned about EuroSpeak and discovered this guide on Misused English words and expressions in EU publications, which is hilarious.

Mathematical Tools of Information-Theoretic Security Workshop: Days 2-3

I took sketchier notes as the workshop progressed, partly due to the ICASSP deadline, but also because jet lag started to hit me. The second day was a half day, which started with Zhenjie Zhang giving a tutorial on differential privacy from a databases/data mining perspective and my talk on more machine learning aspects. In between us was a talk by Ben Smyth on building automatic verification for security protocols. Basically you write the protocol as a program and then the ProVerif verifier will go and try to break your protocol. As an example, it can automatically find/generate a man-in-the-middle attack if one exists. I thought it was pretty neat, especially after having recently talked to someone about automatic proof systems. It’s based on something called the applied pi calculus, which I did not understand at all, but hey, I learned something new, which was great. The last two talks of the day were by Lalitha Sankar and Mari Kobayashi. Lalitha talked about mutual information based measures of privacy leakage in an interactive communication setting that is the information-theoretic analogue of communication complexity models in CS. Mari talked about the broadcast channel with state feedback. This is trying to find secure analogues of these opportunistic multicast settings where you need to also generate a secret key.

The last day was on quantum! I learned a lot and took few notes, unfortunately. Andreas Winter gave a tutorial on quantum (the slides for most talks are online and his are as well) and Ciara Morgan discussed the challenges in proving a strong converse for the the capacity of quantum channels. Damian Markham talked about secret sharing in quantum systems. Masahito Hayashi gave a very densely-packed talk surveying a large number of results based on secure randomness extraction and hash functions using Rényi information measures. I think privacy amplification is really interesting but I think I need a tutorial on it before I can really get the research results. The last non-overview talk I have notes on was by David Elkouss (apologies to the remaining speakers): this was a really interesting presentation on how to decide which of two channels is better from a quantum communication sense. The slides are a little engimatic, but the papers are online.

Shlomo Shamai made it to the last day of the workshop (the intersection with High Holidays was unfortunate) — he talked about the layered secrecy view of the broadcast channel: rather than thinking only of the secret message as carrying information, one can think of certain layers (c.f. superposition coding) as being secured based on the channel to the non-legitimate receiver. For example, in a degraded broadcast channel, the strong receiver’s message can sometimes be thought of as secret from the weak receiver. This leads to a raft of models and setups based on who wants to keep what secret from whom, shedding some light on standard superposition, rate splitting, binning, and embedding constructions. The talk was largely based on a paper in the current issues of the Proceedings of the IEEE.

All in all, this was a really great workshop, and the organizers were very generous in the organization.

Mathematical Tools of Information-Theoretic Security Workshop: Day 1

It’s been a while since I have conference-blogged but I wanted to set aside a little time for it. Before going to Allerton I went to a lovely workshop in Paris on the Mathematical Tools of Information-Theoretic Security thanks to a very kind invitation from Vincent Tan and Matthieu Bloch. This was a 2.5 day workshop covering a rather wide variety of topics, which was good for me since I learned quite a bit. I gave a talk on differential privacy and machine learning with a little more of a push on the mathematical aspects that might be interesting from an information-theory perspective. Paris was appropriately lovely, and it was great to see familiar and new faces there. Now that I am at Rutgers I should note especially our three distinguished alumnae, Şennur Ulukuş, Aylin Yener, and Lalitha Sankar.

Postdoc in privacy and security at Imperial College London

Denis Gündüz is looking for a postdoctoral researcher in the areas of privacy and security in cyber-physical systems, particularly for smart metering applications in smart grids. The position is in the Intelligent Systems and Networks Group within the Electrical and Electronic Engineering Department of Imperial College London.

Previous research experience and a strong track record in information theory, signal processing, and/or optimisation theory is required. This position will be supported through an international project, and will provide an excellent opportunity to work within an interdisciplinary team spanning top European institutions: Imperial College London, KTH, ETHZ and INRIA.
The position is available immediately for one year, with a potential to be extended another year depending on candidate’s performance.

Contact Dr. Gündüz directly if interested.

WIFS 2014

This week I took a quick jaunt down to Atlanta to attend part of WIFS 2014 (co-located with GlobalSIP 2014). Kamalika and I were invited to give a talk on differential privacy and machine learning, based on our IEEE Signal Processing Magazine article. I’ve uploaded the slides of the tutorial to my website and we’re planning on making a video (audio over slides) version for SigView as well as on YouTube.

Much like last year, GlobalSIP had a somewhat disjointed, semi-chaotic feel (exacerbated by tiredness, I am sure) — it’s really a collection of semi-interacting workshops in the same space, and I knew people in several of the other workshops. Since I was there for a day and giving a tutorial at WIFS, I decided to stick with WIFS for the day. To give a sense of how confusing it all was, here’s a picture of the guide to deciphering the program book:

Overly-complicated rules for encoding sessions

The keynote for GlobalSIP was given by Vince Poor on information-theoretic privacy via rate distortion (this is the work with Lalitha). Vince did a good job of not over-IT-ing it I think, which was good because the audience was pretty diverse and it’s not clear that many of the people there had even taken a course on information theory. This seems to be the big challenge in multi-disciplinary conferences like GlobalSIP (or large signal processing conferences in general) — everyone is in signal processing, but it’s a big tent and it’s hard to reach everyone.

Min Wu was the keynote speaker for the WIFS workshop on the day I attended. Her talk, on “Exploring Power Network Signatures for Information Forensics” was about how to glean information from power fluctuations in networks, or electronic network frequency (ENF). Different processes or operations have different power demands — by matching these signatures to an observed signal (e.g. a video), one can make inferences about the time/location/integrity of the data. For example, were the audio and visual tracks in a video taken at the same time or merged later? This whole area is quite interesting, and while I was sort of aware of this work I hadn’t really read up on much of it.

Perhaps it was the end of the semester kicking in, but I sort of took terrible notes on most of the talks and poster sessions at the conference, so I can’t really write coherently about the papers I saw. Unfortunately I had to run back to teach the penultimate lecture in my class. I guess now that I have a “real job” this is going to be the way it works from now on. Kind of sad, really.

SPCOM 2014: some talks

Relevance Singular Vector Machine for Low-­rank Matrix Sensing
Martin Sundin; Saikat Chatterjee; Magnus Jansson; Cristian Rojas
This talk was on designing Bayesian priors for sparse-PCA problems — the key is to find a prior which induces a low-rank structure on the matrix. The model was something like $y = A \mathrm{vec}(X) + n$ where $X$ is a low-rank matrix and $n$ is noise. The previous state of the art is by Babacan et al., a paper which I obviously haven’t read, but the method they propose here (which involved some heavy algebra/matrix factorizations) appears to be competitive in several regimes. Probably more of interest to those working on Bayesian methods…

Non-Convex Sparse Estimation for Signal Processing
David Wipf
More Bayesian methods! Although David (who I met at ICML) was not trying to say that the priors are particularly “correct,” but rather that the penalty functions that they induce on the problems he is studying actually make sense. More of an algorithmist’s approach, you might say. He set up the problem a bit more generally, to minimize problems of the form
$\min_{X_i} \sum_{i} \alpha_i \mathrm{rank}[X_i] \ \ \ \ \ \ \ Y = \sum_{i} A_i(X_i)$
where $A_i$ are some operators. He made the case that convex relaxations of many of these problems, while analytically beautiful, have restrictions which are not satisfied in practice, and indeed they often have poor performance. His approach is via Empirical Bayes, but this leads to non-convex problems. What he can show is that the algorithm he proposes is competitive with any method that tries to separate the error from the “low-rank” constraint, and that the new optimization is “smoother.” I’m sure more details are in his various papers, for those who are interested.

PCA-HDR: A Robust PCA Based Solution to HDR Imaging
My apologies for taking fewer notes on this one, but I don’t know much about HDR imaging, so this was mostly me learning about HDR image processing. There are several different ways of doing HDR, from multiple exposures to flash/no-flash, and so on. The idea is that artifacts introduced by the camera can be modeled using the robust PCA framework and that denoting in HDR imaging may be better using robust PCA. I think that looking at some of the approaches David mentioned may be good in this domain, since it seems unlikely to me that these images will satisfy the conditions necessary for convex relaxations to work…

On Communication Requirements for Secure Computation
Vinod M Prabhakaran
Vinod showed some information theoretic approaches to understanding how much communication is needed for secure computation protocols like remote oblivious transfer: Xavier has $\{X_0, X_1\}$, Yvonne has $Y \in \{0,1\}$ and Zelda wants $Z = X_Y$, but nobody should be able to infer each other’s values. Feige, Killian, and Naor have a protocol for this, which Vinod and Co. can show is communication-optimal. There were several ingredients here, including cut-set bounds, distribution switching, data processing inequalities, and special bounds for 3-party protocols. More details in his CRYPTO paper (and others).

Artificial Noise Revisited: When Eve Has More Antennas Than Alice
Shuiyin Liu; Yi Hong; Emanuele Viterbo
In a MIMO wiretap setting, if the receiver has more antennas than the transmitter, then the transmitter can send noise in the nullspace of the channel matrix of the direct channel — as long as the eavesdropper has fewer antennas than the transmitter then secure transmission is possible. In this paper they show that positive secrecy capacity is possible even when the eavesdropper has more antennas, but as the number of eavesdropper antennas grows, the achievable rate goes to $0$. Perhaps a little bit of a surprise here!

Help researchers figure out how to make better passwords

S. Raj Rajagopalan and collaborators at Honeywell are doing some security research on making better passwords. They are looking for some people to do a quick study on password design.

Along with a couple of Honeywell security researchers I am running a study on a rather familiar problem for most of us – creating memorable but secure passwords, i.e. how to generate passwords that are both suitably random and memorable. We have just launched a simple user study that asks volunteers to participate in an interactive session that lets them choose password candidates and see how well they remember them. Needless to say, these are not actual passwords used by any system, only strings that could be used as passwords.

No personal information is collected in the study and the system only stores the data that is actually provided by the user. To that end, you may choose to not provide any bit of information as you choose. The study takes only a couple of minutes to finish. You may run it multiple times if you wish (and you will likely get different use cases) but you will have to clear the cache on your browsers to get a fresh configuration.

We need at least 300 participants to get statistical significance, so we would appreciate it if you could participate in the study.