A (now not-so-recent) paper by Homer et al. made a splash by showing that one could take a DNA sample from a person and detect whether they were part of the Human Genome Project (HGP) based on looking at the SNP variations from that individual together with the reported allele variations in the HGP data. More recently, a paper in PNAS by Franzosa et al. showed reidentification of individuals in the Human Microbiome Project.
Color me unsurprised. Given the richness of the data, from a purely informational point of view it seems pretty clear that people should be identifiable. As with many machine learning problems, however, the secret is in the feature encoding. Many approaches to comparing metagenomes, especially for bacterial ecologies, try to assess the variability in the population of bacteria, perhaps through mapping the to known strains. As mentioned in the Methods section, “reads were additionally mapped to a database of 649 microbial reference genomes using the Burrows-Wheeler aligner.” However, in addition to these mapping statistics, they used a few other more complicated features to help gain some additional robustness in their identification procedure.
Somehow being able to be identified by your microbiome seems less scary than being able to be identified by your genome, perhaps because we have a sense that genes are more “determining” than microbiomes. After all, you could get a fecal transplant and change your gut flora significantly. Is it the same as burning off your fingerprints? Probably not. But perhaps in the future, perpetrators of certain campus shenanigans may be easier to catch.