Problems in modeling Illumina sequencing

About a year ago I started collaborating with a friend in the Armbrust Lab at the University of Washington on some bioinformatics problems, and as a part of that I am trying to give myself a primer on sequencing technologies and how they work. I came across this video recently, and despite its atrocious music and jargonized description, I actually found it quite helpful in thinking about how this particular sequencing technology works:

  • Acoustic waves shatter the DNA.
  • Things (ligases) get attached to the end.
  • The fragments get washed over a “lawn” and the ligases stick the sequences to the lawn.
  • The strands get amplified into larger spots.
  • Single nucleotides with phosphorescent tags are washed on, they are hit with a laser to reveal the color, the tag is sheared, and then the next nucleotide is washed on.

A simple model for the data we get is to say that a position in the genome is selected uniformly at random, and then the read is the sequence of size L, starting from that position. Just a brief glance at the physical process above shows how simple that model is. For the purposes of statistics, it may be enough, but here are some complications that I can see, from knowing almost no physics and biology:

  • The places at which the DNA fragments are not uniformly distributed — in fact, they should be sequence-dependent.
  • The ligases may have some preferential attachment characteristics. Ditto for the oligos on the lawn in the flowcell.
  • The amplification may be variable, spot by spot. This will affect the brightness of the flash and therefore the reliability of the read assessment.
  • The ability of single nucleotides to bind will vary as more and more bases are read, so the gain in the optical signal (or noise) will vary as the read goes on.

Some of these effects are easier to model than others, but what is true from the real data is that these variations in the technology can cause noticeable effects in the data that deviate from the simple model. More fun work to do!


ITA Workshop 2012 : Talks

The ITA Workshop finished up today, and I know I promised some blogging, but my willpower to take notes kind of deteriorated during the week. For today I’ll put some pointers to talks I saw today which were interesting. I realize I am heavily blogging about Berkeley folks here, but you know, they were interesting talks!

Nadia Fawaz talked about differential privacy for continuous observations : in this model you see x_1, x_2, x_3, \ldots causally and have to estimate the running sum. She had two modifications, one in which you only want a windowed running sum, say for W past values, and one in which the privacy constraint decays and expires after a window of time W, so that values W time steps in the past do not have to be protected at all. This yields some differences in the privacy-utility tradeoff in terms of the accuracy of computing the function.

David Tse gave an interesting talk about sequencing DNA via short reads as a communication problem. I had actually had some thoughts along these lines earlier because I am starting to collaborate with my friend Tony Chiang on some informatics problems around next generation sequencing. David wanted to know how many (noiseless) reads N you need to take of a genome of of length G using reads of length L. It turns out that the correct scaling in this model is L/\log G. Some scaling results were given in a qualitative way, but I guess the quantitative stuff is being written up still.

Michael Jordan talked about the “big data bootstrap” (paper here). You have n data points, where n is huge. The idea is to subsample a set of size b and then do bootstrap estimates of size n on the subsample. I have to read the paper on this but it sounds fascinating.

Anant Sahai talked about how to look at some decentralized linear control problems as implicitly doing some sort of network coding in the deterministic model. One way to view this is to identify unstable modes in the plant as communicating with each other using the controllers as relays in the network. By structurally massaging the control problem into a canonical form, they can make this translation a bit more formal and can translate results about linear stabilization from the 80s into max-flow min-cut type results for network codes. This is mostly work by Se Yong Park, who really ought to have a more complete webpage.

Paolo Minero talked about controlling a linear plant over a rate-limited communication link whose capacity evolves according to a Markov chain. What are the conditions on the rate to ensure stability? He made a connection to Markov jump linear systems that gives the answer in the scalar case, but the necessary and sufficient conditions in the vector case don’t quite match. I always like seeing these sort of communication and control results, even though I don’t work in this area at all. They’re just cool.

There were three talks on consensus in the morning, which I will only touch on briefly. Behrouz Touri gave a talk about part of his thesis work, which was on the Hegselman-Krause opinion dynamics model. It’s not possible to derive a Lyapunov function for this system, but he found a time-varying Lyapunov function, leading to an analysis of the convergence which has some nice connections to products of random stochastic matrices and other topics. Ali Jadbabaie talked about work with Pooya Molavi on non-Bayesian social learning, which combines local Bayesian updating with DeGroot consensus to do distributed learning of a parameter in a network. He had some new sufficient conditions involving disconnected networks that are similar in flavor to his preprint. José Moura talked about distributed Kalman filtering and other consensus meets sensing (consensing?) problems. The algorithms are similar to ones I’ve been looking at lately, so I will have to dig a bit deeper into the upcoming IT Transactions paper.