I’ve used the KDDCup99 data set in a few papers for experiments, primarily because it has a large sample size and preprocessing is not too onerous. However, I recently learned (from Rebecca Wright) that for applications to network security, this data set has been discredited as unrepresentative. The paper by John McHugh from ACM TISSEC details the charges. Essentially there was little validation done with regards to checking how representative the data set is.
Why do I bring this up? Firstly, I suppose I should stop using this data set to make claims about anomaly detection (which may be a problem for AISec coming up at the end of the month). However, it’s not clear, from a machine learning perspective, whether the claims one can make about a particular application will generalize within an application domain, given the lack of standardization of data sets even within a particular application. I could do a bunch of experiments on mixtures of Gaussians which might tell me that the convergence rate is what the theory said it should be, but validating on a variety of “non-synthetic” data sets can at least show how performance varies with data sets properties (regardless of the accuracy with respect to the application). So should I stop using the data set entirely?
Secondly, if we want to develop new models and algorithms for machine learning on security applications, we need data sets, and preferably public data sets. This is a real challenge for anyone trying to develop theoretical frameworks that don’t sound too bogus: practice could drive theory, but there is a kind of security through obscurity model in the data gathering/sharing world which makes it hard to understand what the problems are.