About a year ago I started collaborating with a friend in the Armbrust Lab at the University of Washington on some bioinformatics problems, and as a part of that I am trying to give myself a primer on sequencing technologies and how they work. I came across this video recently, and despite its atrocious music and jargonized description, I actually found it quite helpful in thinking about how this particular sequencing technology works:
- Acoustic waves shatter the DNA.
- Things (ligases) get attached to the end.
- The fragments get washed over a “lawn” and the ligases stick the sequences to the lawn.
- The strands get amplified into larger spots.
- Single nucleotides with phosphorescent tags are washed on, they are hit with a laser to reveal the color, the tag is sheared, and then the next nucleotide is washed on.
A simple model for the data we get is to say that a position in the genome is selected uniformly at random, and then the read is the sequence of size , starting from that position. Just a brief glance at the physical process above shows how simple that model is. For the purposes of statistics, it may be enough, but here are some complications that I can see, from knowing almost no physics and biology:
- The places at which the DNA fragments are not uniformly distributed — in fact, they should be sequence-dependent.
- The ligases may have some preferential attachment characteristics. Ditto for the oligos on the lawn in the flowcell.
- The amplification may be variable, spot by spot. This will affect the brightness of the flash and therefore the reliability of the read assessment.
- The ability of single nucleotides to bind will vary as more and more bases are read, so the gain in the optical signal (or noise) will vary as the read goes on.
Some of these effects are easier to model than others, but what is true from the real data is that these variations in the technology can cause noticeable effects in the data that deviate from the simple model. More fun work to do!