Problems with the KDDCup99 Data Set

I’ve used the KDDCup99 data set in a few papers for experiments, primarily because it has a large sample size and preprocessing is not too onerous. However, I recently learned (from Rebecca Wright) that for applications to network security, this data set has been discredited as unrepresentative. The paper by John McHugh from ACM TISSEC details the charges. Essentially there was little validation done with regards to checking how representative the data set is.

Why do I bring this up? Firstly, I suppose I should stop using this data set to make claims about anomaly detection (which may be a problem for AISec coming up at the end of the month). However, it’s not clear, from a machine learning perspective, whether the claims one can make about a particular application will generalize within an application domain, given the lack of standardization of data sets even within a particular application. I could do a bunch of experiments on mixtures of Gaussians which might tell me that the convergence rate is what the theory said it should be, but validating on a variety of “non-synthetic” data sets can at least show how performance varies with data sets properties (regardless of the accuracy with respect to the application). So should I stop using the data set entirely?

Secondly, if we want to develop new models and algorithms for machine learning on security applications, we need data sets, and preferably public data sets. This is a real challenge for anyone trying to develop theoretical frameworks that don’t sound too bogus: practice could drive theory, but there is a kind of security through obscurity model in the data gathering/sharing world which makes it hard to understand what the problems are.

IHP “Nexus” Workshop on Privacy and Security: Days 4-5

Wrapping this up, finally. Maybe conference blogging has to go by the wayside for me… my notes got a bit sketchier so I’ll just do a short rundown of the topics.

Days 4-5 were a series of “short” talks by Moni Naor, Kobbi Nissim, Lalitha Sankar, Sewoong Oh, Delaram Kahrobaie, Joerg Kliewer, Jon Ullman, and Sasho Nikolov on a rather eclectic mix of topics.

Moni’s talk was on secret sharing in an online setting — parties arrive one by one and the qualified sets (who can decode the secret) is revealed by all parties. The shares have to be generated online as well. Since the access structure is evolving, what kinds of systems can we support? As I understood it, the idea is to use something similar to threshold scheme and a “doubling trick”-like argument by dividing the users/parties into generations. It’s a bit out of area for me so I had a hard time keeping up with the connections to other problems. Kobbi talked about reconstruction attacks based on observing traffic from outsourced database systems. A user wants to get the records but the server shouldn’t be able to reconstruct: it knows how many records were returned from a query and knows if the same record was sent on subsequent queries — this is a sort of access pattern leakage. He presented attacks based on this information and also based on just knowing the volume (e.g. total size of response) from the queries.

Lalitha talked about mutual information privacy, which was quite a bit different than the differential privacy models from the CS side, but more in line with Ye Wang’s talk earlier in the week. Although she didn’t get to spend as much time on it, the work on interactive communication and privacy might have been interesting to folks earlier in the workshop studying communication complexity. In general, the connection between communication complexity problems and MPC, for example, are elusive to me (probably from lack of trying).

Sewoong talked about optimal mechanisms for differentially private composition — I had to miss his talk, unfortunately. Delaram talked about cryptosystems based on group theory and I had to try and check back in all the things I learned in 18.701/702 and the graduate algebra class I (mistakenly) took my first year of graduate school. I am not sure I could even do justice to it, but I took a lot of notes. Joerg talked about using polar codes to enable private function computation — initially privacy was measured by equivocation but towards the end he made a connection to differential privacy. Since most folks (myself included) are not experts on polar codes, he gave a rather nice tutorial (I thought) on polar coding. It being the last day of the workshop, the audience had unfortunately thinned out a bit.

Jon spoke about estimating marginal distributions for high-dimensional problems. There were some nice connections to composite hypothesis testing problems that came out of the discussion during the talk — the model seems a bit complex to get into based on my notes, but I think readers who are experts on hypothesis testing might want to check out his work. Sasho rounded off the workshop with a talk about the sensitivity polytope of linear queries on a statistical database and connections to Gaussian widths. The main result was on the sample complexity of answering the queries in time polynomial in the number of individuals, size of the universe, and size of the query set.

IHP “Nexus” Workshop on Privacy and Security: Day 2

Verrrrrry belated blogging on the rest of the workshop, more than a month later. Day 2 had 5 talks instead of the tutorial plus talks, and the topics were a bit more varied (this was partly because of scheduling issues that prevented us from being strictly thematic).

Amos Beimel started out with a talk on secret sharing, which had a very nice tutorial/introduction to the problem, including the connection between Reed-Solomon codes and Shamir’s t-out-of-n scheme. For professional (and perhaps personal) reasons I found myself wondering how much more the connection between secret sharing and coding theory was — after all, this was a workshop about communication between information theory and theoretical CS. Not being a coding theory expert myself, I could only speculate. What I didn’t know about was the more general secret sharing structures and the results of Ito-Saito-Nishizeki scheme (published in Globecom!). Amos also talked about monotone span programs, which were new to me, and how to prove lower bounds. He concluded with more recent work on the related distribution design problem: how can we construct a distribution on n variables given constraints that specify subsets which should have identical marginals and subsets which should have disjoint support? The results appeared in ICTS.

Ye Wang talked about his work on common information and how it appears in privacy and security problems from an information theoretic perspective. In particular he talked about secure sampling, multiparty computation, and data release problems. The MPC and sampling results were pretty technical in terms of notions of completeness of primitives (conditional distributions) and triviality (a way of categorizing sources). For the data release problem he focused on problems where a sanitizer has access to a pair (X,Y) where X is private and Y is “useful” — the goal is to produce a version of the data which reveals less about X (privacy) and more about Y (utility). Since they are correlated, there is a tension. The question he addressed is when having access to Y alone as as good as both X and Y.

Manoj, after giving his part of the tutorial (and covering for Vinod), gave his own talk on what he called “cryptographic complexity,” which is an analogy to computational complexity, but for multiparty functions. This was also a talk about definitions and reductions: if you can build a protocol for securely computing f(\cdot) using a protocol for g(\cdot), then f(\cdot) reduces to g(\cdot). A complete function is one for which everything reduces to it, and a trivial function reduces to everything. So with the concepts you can start to classify and partition out functions like characterizing all complete functions for 2 parties, or finding trivial functions under different security notions. He presented some weird facts, like an n bit XOR doesn’t reduce to an (n-1) bit XOR. It was a pretty interesting talk, and I learned quite a bit!

Elette Boyle gave a great talk on Oblivious RAM, a topic about which I was completely oblivious myself. The basic idea in oblivious RAM is (as I understood it) that an adversary can observe the accesses to a RAM and therefore infer what program is being executed (and the input). To obfuscate that, you introduce a bunch of spurious accesses. So if you have a program $\latex \Pi$ whose access pattern is fixed prior to execution, you can randomize the accesses and gain some security. The overhead is the ratio of the total accesses to the required accesses. After this introduction to the problem, she talked about lower bounds on the overhead (e.g. you need this much overhead) for a case where you have parallel processing. I admit that I didn’t quite understand the arguments, but the problem was pretty interesting.

Hoeteck Wee gave the last (but quite energetic) talk of the afternoon, on what he called “functional encryption.” The ideas is that Alice has (x,M) and Bob has y. They both send messages to a third party, Charlie. There is a 0-1 function (predicate) P(x,y) such that if P(x,y) = 1 then Charlie can decode the message M. Otherwise, they cannot. An example would be the predicate P(x,y) = \mathbf{1}(x = y). In this case, Alice can send h(x) \oplus M and Bob can send h(y) for some 2-wise independent hash function, and then Charlie can recover M if the hashes match. I think there is a question in this scheme about whether Charlie needs to know that they got the right message, but I guess I can read the paper for that. The kinds of questions they want to ask are what kinds of predicates have nice encoding schemes? What is the size of message that Alice and Bob have to send? He made a connection/reduction to a communication complexity problem to get a bound on the message sizes in terms of the communication complexity of computing the predicate P. It really was a very nice talk and pretty understandable even with my own limited background.

Linkage

Badass women cartographers!

Looking back on Shoes.

At a DARPA PI meeting recently, I met some folks from Cybernetica who told me about the hot new startup CountryOS! (EDIT: it’s not their startup).

A recent 99% Invisible episode describes the history of the SIGSALY, a secure communication system developed during WWII that used white noise one-time pads printed on vinyl to analog-encrypt communications lines.

Thanks to The Allusionist, I learned about EuroSpeak and discovered this guide on Misused English words and expressions in EU publications, which is hilarious.

Mathematical Tools of Information-Theoretic Security Workshop: Days 2-3

I took sketchier notes as the workshop progressed, partly due to the ICASSP deadline, but also because jet lag started to hit me. The second day was a half day, which started with Zhenjie Zhang giving a tutorial on differential privacy from a databases/data mining perspective and my talk on more machine learning aspects. In between us was a talk by Ben Smyth on building automatic verification for security protocols. Basically you write the protocol as a program and then the ProVerif verifier will go and try to break your protocol. As an example, it can automatically find/generate a man-in-the-middle attack if one exists. I thought it was pretty neat, especially after having recently talked to someone about automatic proof systems. It’s based on something called the applied pi calculus, which I did not understand at all, but hey, I learned something new, which was great. The last two talks of the day were by Lalitha Sankar and Mari Kobayashi. Lalitha talked about mutual information based measures of privacy leakage in an interactive communication setting that is the information-theoretic analogue of communication complexity models in CS. Mari talked about the broadcast channel with state feedback. This is trying to find secure analogues of these opportunistic multicast settings where you need to also generate a secret key.

The last day was on quantum! I learned a lot and took few notes, unfortunately. Andreas Winter gave a tutorial on quantum (the slides for most talks are online and his are as well) and Ciara Morgan discussed the challenges in proving a strong converse for the the capacity of quantum channels. Damian Markham talked about secret sharing in quantum systems. Masahito Hayashi gave a very densely-packed talk surveying a large number of results based on secure randomness extraction and hash functions using Rényi information measures. I think privacy amplification is really interesting but I think I need a tutorial on it before I can really get the research results. The last non-overview talk I have notes on was by David Elkouss (apologies to the remaining speakers): this was a really interesting presentation on how to decide which of two channels is better from a quantum communication sense. The slides are a little engimatic, but the papers are online.

Shlomo Shamai made it to the last day of the workshop (the intersection with High Holidays was unfortunate) — he talked about the layered secrecy view of the broadcast channel: rather than thinking only of the secret message as carrying information, one can think of certain layers (c.f. superposition coding) as being secured based on the channel to the non-legitimate receiver. For example, in a degraded broadcast channel, the strong receiver’s message can sometimes be thought of as secret from the weak receiver. This leads to a raft of models and setups based on who wants to keep what secret from whom, shedding some light on standard superposition, rate splitting, binning, and embedding constructions. The talk was largely based on a paper in the current issues of the Proceedings of the IEEE.

All in all, this was a really great workshop, and the organizers were very generous in the organization.

Mathematical Tools of Information-Theoretic Security Workshop: Day 1

It’s been a while since I have conference-blogged but I wanted to set aside a little time for it. Before going to Allerton I went to a lovely workshop in Paris on the Mathematical Tools of Information-Theoretic Security thanks to a very kind invitation from Vincent Tan and Matthieu Bloch. This was a 2.5 day workshop covering a rather wide variety of topics, which was good for me since I learned quite a bit. I gave a talk on differential privacy and machine learning with a little more of a push on the mathematical aspects that might be interesting from an information-theory perspective. Paris was appropriately lovely, and it was great to see familiar and new faces there. Now that I am at Rutgers I should note especially our three distinguished alumnae, Şennur Ulukuş, Aylin Yener, and Lalitha Sankar.

Continue reading

Postdoc in privacy and security at Imperial College London

Denis Gündüz is looking for a postdoctoral researcher in the areas of privacy and security in cyber-physical systems, particularly for smart metering applications in smart grids. The position is in the Intelligent Systems and Networks Group within the Electrical and Electronic Engineering Department of Imperial College London.

Previous research experience and a strong track record in information theory, signal processing, and/or optimisation theory is required. This position will be supported through an international project, and will provide an excellent opportunity to work within an interdisciplinary team spanning top European institutions: Imperial College London, KTH, ETHZ and INRIA.
The position is available immediately for one year, with a potential to be extended another year depending on candidate’s performance.

Contact Dr. Gündüz directly if interested.