Postdoc opportunity at Princeton/ASU on smart grid

There is a joint postdoc opportunity between Princeton and Arizona State University (working with H Vincent Poor and Lalitha Sankar). They are looking for a postdoc with a strong background in information theory and/or statistical signal processing. The postdoctoral position provides two distinct and unique opportunities:

(i) to work on mathematical models for the large data sets generated in the smart grid along with communication and compression algorithms for secure and privacy-guaranteed distributed processing of data. This opportunity will include working closely with NAE members in power systems at ASU as well as information systems researchers.

(ii) opportunity to work with researchers at Princeton University for the other half of the postdoctoral tenure.

If interested, please contact Lalitha Sankar with a CV (lalithasankar@asu.edu) for more details.

The Data Map : a map of information flows

One of the things Latanya Sweeney mentioned during her talk at the iDash workshop is a new project called theDataMap, which is trying to visualize how personal information about individuals flows through institutions. One of the starting points is an older map which shows how a putative hospital patient Alice’s personal information is accessed and used by a number of entities of whom she is likely unaware, including pharma companies, her employer, and medical researchers.

This is analogous to a map Lee Tien sent me, also from a report a few years ago, on how private medical information flows look in California.

It’s worth looking at and thinking a bit about how we balance privacy and utility/profit at the moment, and whether generally erring on the side of sharing is the best way to go.

ICML reviewing absurdity

I’m a reviewer for ICML 2013, which has a novel submission format this year. Papers for the first cycle were due October 1. They received more than they thought (by a significant factor), but I was only assigned papers to review today, more than 2 weeks later. We have been given 2 weeks to submit reviews — given my stack, that’s 2 weeks notice to review ~60 pages of material.

I may be going out on a limb here, but I think that the review quality is not going to be that high this time. Perhaps this is a Mechanical Turk approach to the problem — get a bunch of cheap noisy labels and then hope that you can get a good label by majority vote?

Update: We’ve been given another week, hooray.

Linkage

An initiative to prevent irreproducible science.

A video about Graham’s number.

I don’t tweet, but all of this debate seems ridiculous to me. I think the real issue is who follows twitter? I know Sergio is on Twitter, but is anyone else?

Food : An Atlas is a book project on kickstarter by people who do “guerrilla cartography.” It is about food, broadly construed. $25 gets you a copy of the book, and it looks awesome, especially if you like maps. And who doesn’t like maps?

I remember reading about the demise of the American Chestnut tree, but apparently it may make a comeback!

eduroam is awesome

At Allerton I finally set up the eduroam network on my phone and laptop. It was great — with the UIUC system you had to log in with a special temporary ID and password each time, but with eduroam it would automatically connect and authenticate like any other normal wireless network.

Basically you use the same login/password as for other authenticated services on your campus. At Chicago it’s called your CNetID, but the credentials will be different from place to place. The key is that you validate to the network using those credentials and not some locally-given account.

It seems that the system has been expanding — if your institution doesn’t support it then you should ask them to do so. Of course, maybe we should just have more open networks, but at least with this you can get wifi on many campuses without having to deal with the bureaucratic overhead of the IT services.

Allerton 2012 : David Tse’s plenary on sequencing

I’ll follow up with some blogging about the talks at Allerton a bit later — my notes are somewhat scattershot, so I might give a more cursory view than usual. Overall, I thought the conference was more fun than in previous years: the best Allerton to date. For now though, I’ll blog about the plenary.

David Tse gave the “director’s cut” of his recent work (with Motahari, Bresler, Bresler, and Ramchandran) on a information theoretic model for next-generation sequencing (NGS). In NGS, many copies of a single genome are chopped up into short chunks (say 100 base pairs) of reads. The information theory model is a very simplified abstraction of this process — a read is generated by choosing uniformly a location in the genome and producing the 100 bases following that position. In NGS, the reads overlap, and each nucleotide of the original genome may appear in many reads. The number of reads in which a base appears is called the coverage.

So there are three parameters, G, the length of the genome, L, the length of a read, and N, the number of reads. The questions is how should these depend on each other? David presented an theoretical analysis of the reconstruction question under rather restrictive assumptions (bases are i.i.d., reads are noiseless) and showed that there is a threshold on the number of reads for successful reconstruction with high probability. That is, there is a number C such that N = G/C reads can reconstruct the sequence with high probability. The number C depends on L via L/\log G.

This model is very simple and is clearly not an accurate model for genome sequencing. David began his talk by drawing a grand analogy to the DMC — his proposition is that this approach will be the information theory of genome sequencing. I have to say that this sounds appealing, but it looks at a rather specific problem that arises in NGS, namely assembly. This is a first step towards building one abstract theory for sequencing and while the model may be simple, the results are non-trivial. David also presented some evidence about how real DNA sequences have features (long repeats) which make problems for greedy assemblers but can be handled by more complex assemblers based on deBruijn graphs. They can also handle noise in the form of i.i.d. erasures. What this analysis seems to do is point to features about the data that are problematic for assembly from an information-theoretic standpoint — this is an analysis of the technological process of NGS rather than saying that much about biology.

I’ve been working for the last year (more off than on, alas) on how to understand certain types of NGS data from a statistical viewpoint. I’ll probably write more about that later when I get some actual understanding. But a central lesson I’ve taken from this is that the situation is quite a bit different than it was when Shannon made a theory of communication that abstracted existing communication systems. We don’t have nearly as good an understanding NGS data from an engineering standpoint, and the questions we want to answer from this data are also unclear. Assembly is one thing, but if nothing else, this theoretical analysis shows that the kind of data we have is often insufficient for “real” assembly. This correlated with practice, as many assemblers produce large chunks of DNA, called contigs, rather than the full organism genome. There are many interesting statistical questions to explore in this data — what can we answer from the data without assembling organisms?

Allerton margin woes

Apparently the specifications for the Allerton camera-ready copy insist on different page margins for the first page: 1 inch on top and 0.75 on the sides and bottom, whereas it’s 0.75 all around for the other pages. I submitted the paper and was told I had margin errors so I downloaded the draft copy and lo and behold I got an annotated PDF with gray boxes and arrows showing that every page had margin violations. How could that be, I thought?

It seemed that this is different than the default \documentclass[conference,letterpaper]{IEEEtran} options. After hacking around a bit I came across this hack but when I used \usepackage[showframe,paper=letterpaper,margin=0.75in]{geometry} the line-frames around each page showed that all of my text was inside the margins.

Either PaperCept is altering my PDF or it is not capable of calculating margins correctly, or the geometry package is not working. Given the problems I’ve had with PaperCept in the past, I’m guessing that it’s not the latter. My only hack was to use the geometry package to set the margins of all pages to 1 inch.

Did anyone else have these weird margin issues?

Update: Upon further checking, it appears that the margins enforced by PaperCept are not the margins for \documentclass[conference,letterpaper]{IEEEtran} but instead most likely for IEEEconf, which has (I believe) been deprecated. Yikes!

Update 2: This is, in fact, my fault, since the documentation says to use IEEEconf.cls. I am still confused as to why that’s the standard for Allerton. Also, this is the 2002 version of IEEEconf.cls, and there are even newer versions than that. Sigh.

Domingos on what you should know about machine learning

Dhruv Batra forwarded this Communications of the ACM article by Pedro Domingos, entitled “A Few Useful Things to Know about Machine Learning” [free version] The main point from the abstract is:

However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

The article focuses on the classification problem to illustrate these “key lessons.” It’s well-worth reading, especially for people who don’t work on machine learning because it explains a number of important issues.

  1. It illustrates the gap between what the theory/research works on and the nitty-gritty of applying these algorithms to real data.
  2. It gives people who want to implement an ML method important fundamental questions to ask before starting : how do I represent my data? How do I evaluate performance? How do I do things efficiently? These have to get squared away first.
  3. Domain knowledge and feature engineering are the keys to success.

Since I’m guessing there are 2 machine learners who read this blog, go read it (unless you are one of my friends who doesn’t care about all of these technical posts).

Tracks : language is overrated

  1. Cliché Intro — Prefuse 73
  2. Nanorobot Tune — Tomas Dvorak, Machinarium Soundtrack
  3. Endorphin — Burial
  4. Missionary Ridge — William Tyler
  5. Hey-Hee-Hi-Ho — Medeski, Martin & Wood
  6. Soutoukou — Mamadou Diabate
  7. Rustem — Taraf de Haidouks
  8. Snowden’s Jig — Carolina Chocolate Drops
  9. Hashmal — Masada
  10. Captain Hook — Mar Caribe
  11. Black Unstoppable — Nicole Mitchell
  12. Stop Time — Horace Silver
  13. Pickin’ Up The Cabbage — Cab Calloway
  14. Smedley’s Melody — Squarepusher
  15. Baraat To Nowhere — Red Baraat
  16. Lou courut — Véronique Gens w/Orchestre National de
  17. Lille-Région Nord
  18. Saudade Dada — Arrigo Barnabé
  19. Watermelon Man — Mongo Santamaria
  20. Greensleeves — Matthew Shipp
  21. Clapping Music — Steve Reich/The Sixteen
  22. music for morning people — Kid Koala