eduroam is awesome

At Allerton I finally set up the eduroam network on my phone and laptop. It was great — with the UIUC system you had to log in with a special temporary ID and password each time, but with eduroam it would automatically connect and authenticate like any other normal wireless network.

Basically you use the same login/password as for other authenticated services on your campus. At Chicago it’s called your CNetID, but the credentials will be different from place to place. The key is that you validate to the network using those credentials and not some locally-given account.

It seems that the system has been expanding — if your institution doesn’t support it then you should ask them to do so. Of course, maybe we should just have more open networks, but at least with this you can get wifi on many campuses without having to deal with the bureaucratic overhead of the IT services.

Allerton 2012 : David Tse’s plenary on sequencing

I’ll follow up with some blogging about the talks at Allerton a bit later — my notes are somewhat scattershot, so I might give a more cursory view than usual. Overall, I thought the conference was more fun than in previous years: the best Allerton to date. For now though, I’ll blog about the plenary.

David Tse gave the “director’s cut” of his recent work (with Motahari, Bresler, Bresler, and Ramchandran) on a information theoretic model for next-generation sequencing (NGS). In NGS, many copies of a single genome are chopped up into short chunks (say 100 base pairs) of reads. The information theory model is a very simplified abstraction of this process — a read is generated by choosing uniformly a location in the genome and producing the 100 bases following that position. In NGS, the reads overlap, and each nucleotide of the original genome may appear in many reads. The number of reads in which a base appears is called the coverage.

So there are three parameters, G, the length of the genome, L, the length of a read, and N, the number of reads. The questions is how should these depend on each other? David presented an theoretical analysis of the reconstruction question under rather restrictive assumptions (bases are i.i.d., reads are noiseless) and showed that there is a threshold on the number of reads for successful reconstruction with high probability. That is, there is a number C such that N = G/C reads can reconstruct the sequence with high probability. The number C depends on L via L/\log G.

This model is very simple and is clearly not an accurate model for genome sequencing. David began his talk by drawing a grand analogy to the DMC — his proposition is that this approach will be the information theory of genome sequencing. I have to say that this sounds appealing, but it looks at a rather specific problem that arises in NGS, namely assembly. This is a first step towards building one abstract theory for sequencing and while the model may be simple, the results are non-trivial. David also presented some evidence about how real DNA sequences have features (long repeats) which make problems for greedy assemblers but can be handled by more complex assemblers based on deBruijn graphs. They can also handle noise in the form of i.i.d. erasures. What this analysis seems to do is point to features about the data that are problematic for assembly from an information-theoretic standpoint — this is an analysis of the technological process of NGS rather than saying that much about biology.

I’ve been working for the last year (more off than on, alas) on how to understand certain types of NGS data from a statistical viewpoint. I’ll probably write more about that later when I get some actual understanding. But a central lesson I’ve taken from this is that the situation is quite a bit different than it was when Shannon made a theory of communication that abstracted existing communication systems. We don’t have nearly as good an understanding NGS data from an engineering standpoint, and the questions we want to answer from this data are also unclear. Assembly is one thing, but if nothing else, this theoretical analysis shows that the kind of data we have is often insufficient for “real” assembly. This correlated with practice, as many assemblers produce large chunks of DNA, called contigs, rather than the full organism genome. There are many interesting statistical questions to explore in this data — what can we answer from the data without assembling organisms?

Allerton margin woes

Apparently the specifications for the Allerton camera-ready copy insist on different page margins for the first page: 1 inch on top and 0.75 on the sides and bottom, whereas it’s 0.75 all around for the other pages. I submitted the paper and was told I had margin errors so I downloaded the draft copy and lo and behold I got an annotated PDF with gray boxes and arrows showing that every page had margin violations. How could that be, I thought?

It seemed that this is different than the default \documentclass[conference,letterpaper]{IEEEtran} options. After hacking around a bit I came across this hack but when I used \usepackage[showframe,paper=letterpaper,margin=0.75in]{geometry} the line-frames around each page showed that all of my text was inside the margins.

Either PaperCept is altering my PDF or it is not capable of calculating margins correctly, or the geometry package is not working. Given the problems I’ve had with PaperCept in the past, I’m guessing that it’s not the latter. My only hack was to use the geometry package to set the margins of all pages to 1 inch.

Did anyone else have these weird margin issues?

Update: Upon further checking, it appears that the margins enforced by PaperCept are not the margins for \documentclass[conference,letterpaper]{IEEEtran} but instead most likely for IEEEconf, which has (I believe) been deprecated. Yikes!

Update 2: This is, in fact, my fault, since the documentation says to use IEEEconf.cls. I am still confused as to why that’s the standard for Allerton. Also, this is the 2002 version of IEEEconf.cls, and there are even newer versions than that. Sigh.

Domingos on what you should know about machine learning

Dhruv Batra forwarded this Communications of the ACM article by Pedro Domingos, entitled “A Few Useful Things to Know about Machine Learning” [free version] The main point from the abstract is:

However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

The article focuses on the classification problem to illustrate these “key lessons.” It’s well-worth reading, especially for people who don’t work on machine learning because it explains a number of important issues.

  1. It illustrates the gap between what the theory/research works on and the nitty-gritty of applying these algorithms to real data.
  2. It gives people who want to implement an ML method important fundamental questions to ask before starting : how do I represent my data? How do I evaluate performance? How do I do things efficiently? These have to get squared away first.
  3. Domain knowledge and feature engineering are the keys to success.

Since I’m guessing there are 2 machine learners who read this blog, go read it (unless you are one of my friends who doesn’t care about all of these technical posts).

Tracks : language is overrated

  1. Cliché Intro — Prefuse 73
  2. Nanorobot Tune — Tomas Dvorak, Machinarium Soundtrack
  3. Endorphin — Burial
  4. Missionary Ridge — William Tyler
  5. Hey-Hee-Hi-Ho — Medeski, Martin & Wood
  6. Soutoukou — Mamadou Diabate
  7. Rustem — Taraf de Haidouks
  8. Snowden’s Jig — Carolina Chocolate Drops
  9. Hashmal — Masada
  10. Captain Hook — Mar Caribe
  11. Black Unstoppable — Nicole Mitchell
  12. Stop Time — Horace Silver
  13. Pickin’ Up The Cabbage — Cab Calloway
  14. Smedley’s Melody — Squarepusher
  15. Baraat To Nowhere — Red Baraat
  16. Lou courut — Véronique Gens w/Orchestre National de
  17. Lille-Région Nord
  18. Saudade Dada — Arrigo Barnabé
  19. Watermelon Man — Mongo Santamaria
  20. Greensleeves — Matthew Shipp
  21. Clapping Music — Steve Reich/The Sixteen
  22. music for morning people — Kid Koala

Allerton 2012 : Karl J. Åström’s Jubilee Lecture

It’s the fall again, and this year it is the 50th anniversary of the Allerton Conference. Tonight was a special Golden Jubilee lecture by Karl Johan Åström from the Lund University. He gave an engaging view of the pre-history, history, present, and future of control systems. Control is a “hidden technology” he said — it’s everywhere and is what makes all the technology that we use work, but remains largely unknown and unnoticed except during catastrophic failures. He exhorted the young’uns to do a better job at letting people know how important control systems are in everyday life.

The main message of Åström’s talk is that control theory and control practice need to get back together so that we can develop new control theories for emerging areas, including biology and physics. He called this the “holistic” view and pointed out that it really emerged out of the war effort during WWII, when control systems had to be developed for all sorts of military tasks. This got the mathematicians in the same room as the “real” engineers, and led to a lot of new theory. I guess I had always known that was a big driver, but I guess I hadn’t thought of how control really was the glue that tied things together.