Allerton 2012 : David Tse’s plenary on sequencing

I’ll follow up with some blogging about the talks at Allerton a bit later — my notes are somewhat scattershot, so I might give a more cursory view than usual. Overall, I thought the conference was more fun than in previous years: the best Allerton to date. For now though, I’ll blog about the plenary.

David Tse gave the “director’s cut” of his recent work (with Motahari, Bresler, Bresler, and Ramchandran) on a information theoretic model for next-generation sequencing (NGS). In NGS, many copies of a single genome are chopped up into short chunks (say 100 base pairs) of reads. The information theory model is a very simplified abstraction of this process — a read is generated by choosing uniformly a location in the genome and producing the 100 bases following that position. In NGS, the reads overlap, and each nucleotide of the original genome may appear in many reads. The number of reads in which a base appears is called the coverage.

So there are three parameters, G, the length of the genome, L, the length of a read, and N, the number of reads. The questions is how should these depend on each other? David presented an theoretical analysis of the reconstruction question under rather restrictive assumptions (bases are i.i.d., reads are noiseless) and showed that there is a threshold on the number of reads for successful reconstruction with high probability. That is, there is a number C such that N = G/C reads can reconstruct the sequence with high probability. The number C depends on L via L/\log G.

This model is very simple and is clearly not an accurate model for genome sequencing. David began his talk by drawing a grand analogy to the DMC — his proposition is that this approach will be the information theory of genome sequencing. I have to say that this sounds appealing, but it looks at a rather specific problem that arises in NGS, namely assembly. This is a first step towards building one abstract theory for sequencing and while the model may be simple, the results are non-trivial. David also presented some evidence about how real DNA sequences have features (long repeats) which make problems for greedy assemblers but can be handled by more complex assemblers based on deBruijn graphs. They can also handle noise in the form of i.i.d. erasures. What this analysis seems to do is point to features about the data that are problematic for assembly from an information-theoretic standpoint — this is an analysis of the technological process of NGS rather than saying that much about biology.

I’ve been working for the last year (more off than on, alas) on how to understand certain types of NGS data from a statistical viewpoint. I’ll probably write more about that later when I get some actual understanding. But a central lesson I’ve taken from this is that the situation is quite a bit different than it was when Shannon made a theory of communication that abstracted existing communication systems. We don’t have nearly as good an understanding NGS data from an engineering standpoint, and the questions we want to answer from this data are also unclear. Assembly is one thing, but if nothing else, this theoretical analysis shows that the kind of data we have is often insufficient for “real” assembly. This correlated with practice, as many assemblers produce large chunks of DNA, called contigs, rather than the full organism genome. There are many interesting statistical questions to explore in this data — what can we answer from the data without assembling organisms?

Allerton margin woes

Apparently the specifications for the Allerton camera-ready copy insist on different page margins for the first page: 1 inch on top and 0.75 on the sides and bottom, whereas it’s 0.75 all around for the other pages. I submitted the paper and was told I had margin errors so I downloaded the draft copy and lo and behold I got an annotated PDF with gray boxes and arrows showing that every page had margin violations. How could that be, I thought?

It seemed that this is different than the default \documentclass[conference,letterpaper]{IEEEtran} options. After hacking around a bit I came across this hack but when I used \usepackage[showframe,paper=letterpaper,margin=0.75in]{geometry} the line-frames around each page showed that all of my text was inside the margins.

Either PaperCept is altering my PDF or it is not capable of calculating margins correctly, or the geometry package is not working. Given the problems I’ve had with PaperCept in the past, I’m guessing that it’s not the latter. My only hack was to use the geometry package to set the margins of all pages to 1 inch.

Did anyone else have these weird margin issues?

Update: Upon further checking, it appears that the margins enforced by PaperCept are not the margins for \documentclass[conference,letterpaper]{IEEEtran} but instead most likely for IEEEconf, which has (I believe) been deprecated. Yikes!

Update 2: This is, in fact, my fault, since the documentation says to use IEEEconf.cls. I am still confused as to why that’s the standard for Allerton. Also, this is the 2002 version of IEEEconf.cls, and there are even newer versions than that. Sigh.