October 2012


At DIMACS, I got a notice about a workshop here that is coming up in November with a deadline ofr November 5 to register: the DIMACS Workshop on Information-Theoretic Network Security organized by Yingbin Liang and Prakash Narayan. Should be worth checking out — they have a nice slate of talks.

If you do come though, don’t stay at the Holiday Inn — go for The Heldrich or a Hyatt or something that is anywhere near walking distance to restaurants or something. I think I almost got run over going to Walgreens yesterday in this land of strip malls…

The Harvard Center for Research on Computation and Society (CRCS) solicits applications for its Postdoctoral Fellows and Visiting Scholars Programs for the 2013-2014 academic year. Postdoctoral Fellows are given an annual salary of approximately $60,000 for one year (with the possibility of renewal) to engage in a program of original research, and are provided with additional funds for travel and research support. Visiting Scholars often come with their own support, but CRCS can occasionally offer supplemental funding.

We seek researchers who wish to interact with both computer scientists and colleagues from other disciplines, and have a demonstrated interest in connecting their research agenda with societal issues.  We are particularly interested in candidates with interests in Economics and Computer Science, Health Care Informatics, Privacy & Security, and/or Technology & Accessibility, and those who may be interested in engaging in one of our ongoing/upcoming projects:

  • Intelligent, Adaptive Systems for Health Care Informatics
  • Language-Based Security
  • Personalized Accessibility
  • Privacy and Security in Targeted Advertising
  • Privacy Tools for Sharing Research Data
  • Trustworthy Crowdsourcing

Harvard University is an Affirmative Action/Equal Opportunity Employer. We are particularly interested in attracting women and underrepresented groups to participate in CRCS.  For further information about the Center and its activities, see http://crcs.seas.harvard.edu/.

Application Procedure

A cover letter, CV, research statement, copies of up to three research papers, and up to three letters of reference should be sent to:

Postdoctoral Fellows and Visiting Scholars Programs
Center for Research on Computation and Society
crcs-apply@seas.harvard.edu

References for postdoctoral fellows should send their letters directly, and Visiting Scholar applicants may provide a list of references rather than having letters sent. The application deadline for full consideration is December 16, 2012.

There is a joint postdoc opportunity between Princeton and Arizona State University (working with H Vincent Poor and Lalitha Sankar). They are looking for a postdoc with a strong background in information theory and/or statistical signal processing. The postdoctoral position provides two distinct and unique opportunities:

(i) to work on mathematical models for the large data sets generated in the smart grid along with communication and compression algorithms for secure and privacy-guaranteed distributed processing of data. This opportunity will include working closely with NAE members in power systems at ASU as well as information systems researchers.

(ii) opportunity to work with researchers at Princeton University for the other half of the postdoctoral tenure.

If interested, please contact Lalitha Sankar with a CV (lalithasankar@asu.edu) for more details.

I’m at DIMACS for the Workshop on Differential Privacy. Given the lack of blogging about Allerton talks that I did, we’ll see what I manage to write about here, but stay tuned…

One of the things Latanya Sweeney mentioned during her talk at the iDash workshop is a new project called theDataMap, which is trying to visualize how personal information about individuals flows through institutions. One of the starting points is an older map which shows how a putative hospital patient Alice’s personal information is accessed and used by a number of entities of whom she is likely unaware, including pharma companies, her employer, and medical researchers.

This is analogous to a map Lee Tien sent me, also from a report a few years ago, on how private medical information flows look in California.

It’s worth looking at and thinking a bit about how we balance privacy and utility/profit at the moment, and whether generally erring on the side of sharing is the best way to go.

I’m a reviewer for ICML 2013, which has a novel submission format this year. Papers for the first cycle were due October 1. They received more than they thought (by a significant factor), but I was only assigned papers to review today, more than 2 weeks later. We have been given 2 weeks to submit reviews — given my stack, that’s 2 weeks notice to review ~60 pages of material.

I may be going out on a limb here, but I think that the review quality is not going to be that high this time. Perhaps this is a Mechanical Turk approach to the problem — get a bunch of cheap noisy labels and then hope that you can get a good label by majority vote?

Update: We’ve been given another week, hooray.

An initiative to prevent irreproducible science.

A video about Graham’s number.

I don’t tweet, but all of this debate seems ridiculous to me. I think the real issue is who follows twitter? I know Sergio is on Twitter, but is anyone else?

Food : An Atlas is a book project on kickstarter by people who do “guerrilla cartography.” It is about food, broadly construed. $25 gets you a copy of the book, and it looks awesome, especially if you like maps. And who doesn’t like maps?

I remember reading about the demise of the American Chestnut tree, but apparently it may make a comeback!

At Allerton I finally set up the eduroam network on my phone and laptop. It was great — with the UIUC system you had to log in with a special temporary ID and password each time, but with eduroam it would automatically connect and authenticate like any other normal wireless network.

Basically you use the same login/password as for other authenticated services on your campus. At Chicago it’s called your CNetID, but the credentials will be different from place to place. The key is that you validate to the network using those credentials and not some locally-given account.

It seems that the system has been expanding — if your institution doesn’t support it then you should ask them to do so. Of course, maybe we should just have more open networks, but at least with this you can get wifi on many campuses without having to deal with the bureaucratic overhead of the IT services.

I’ll follow up with some blogging about the talks at Allerton a bit later — my notes are somewhat scattershot, so I might give a more cursory view than usual. Overall, I thought the conference was more fun than in previous years: the best Allerton to date. For now though, I’ll blog about the plenary.

David Tse gave the “director’s cut” of his recent work (with Motahari, Bresler, Bresler, and Ramchandran) on a information theoretic model for next-generation sequencing (NGS). In NGS, many copies of a single genome are chopped up into short chunks (say 100 base pairs) of reads. The information theory model is a very simplified abstraction of this process — a read is generated by choosing uniformly a location in the genome and producing the 100 bases following that position. In NGS, the reads overlap, and each nucleotide of the original genome may appear in many reads. The number of reads in which a base appears is called the coverage.

So there are three parameters, G, the length of the genome, L, the length of a read, and N, the number of reads. The questions is how should these depend on each other? David presented an theoretical analysis of the reconstruction question under rather restrictive assumptions (bases are i.i.d., reads are noiseless) and showed that there is a threshold on the number of reads for successful reconstruction with high probability. That is, there is a number C such that N = G/C reads can reconstruct the sequence with high probability. The number C depends on L via L/\log G.

This model is very simple and is clearly not an accurate model for genome sequencing. David began his talk by drawing a grand analogy to the DMC — his proposition is that this approach will be the information theory of genome sequencing. I have to say that this sounds appealing, but it looks at a rather specific problem that arises in NGS, namely assembly. This is a first step towards building one abstract theory for sequencing and while the model may be simple, the results are non-trivial. David also presented some evidence about how real DNA sequences have features (long repeats) which make problems for greedy assemblers but can be handled by more complex assemblers based on deBruijn graphs. They can also handle noise in the form of i.i.d. erasures. What this analysis seems to do is point to features about the data that are problematic for assembly from an information-theoretic standpoint — this is an analysis of the technological process of NGS rather than saying that much about biology.

I’ve been working for the last year (more off than on, alas) on how to understand certain types of NGS data from a statistical viewpoint. I’ll probably write more about that later when I get some actual understanding. But a central lesson I’ve taken from this is that the situation is quite a bit different than it was when Shannon made a theory of communication that abstracted existing communication systems. We don’t have nearly as good an understanding NGS data from an engineering standpoint, and the questions we want to answer from this data are also unclear. Assembly is one thing, but if nothing else, this theoretical analysis shows that the kind of data we have is often insufficient for “real” assembly. This correlated with practice, as many assemblers produce large chunks of DNA, called contigs, rather than the full organism genome. There are many interesting statistical questions to explore in this data — what can we answer from the data without assembling organisms?

Apparently the specifications for the Allerton camera-ready copy insist on different page margins for the first page: 1 inch on top and 0.75 on the sides and bottom, whereas it’s 0.75 all around for the other pages. I submitted the paper and was told I had margin errors so I downloaded the draft copy and lo and behold I got an annotated PDF with gray boxes and arrows showing that every page had margin violations. How could that be, I thought?

It seemed that this is different than the default \documentclass[conference,letterpaper]{IEEEtran} options. After hacking around a bit I came across this hack but when I used \usepackage[showframe,paper=letterpaper,margin=0.75in]{geometry} the line-frames around each page showed that all of my text was inside the margins.

Either PaperCept is altering my PDF or it is not capable of calculating margins correctly, or the geometry package is not working. Given the problems I’ve had with PaperCept in the past, I’m guessing that it’s not the latter. My only hack was to use the geometry package to set the margins of all pages to 1 inch.

Did anyone else have these weird margin issues?

Update: Upon further checking, it appears that the margins enforced by PaperCept are not the margins for \documentclass[conference,letterpaper]{IEEEtran} but instead most likely for IEEEconf, which has (I believe) been deprecated. Yikes!

Update 2: This is, in fact, my fault, since the documentation says to use IEEEconf.cls. I am still confused as to why that’s the standard for Allerton. Also, this is the 2002 version of IEEEconf.cls, and there are even newer versions than that. Sigh.

Next Page »

Follow

Get every new post delivered to your Inbox.

Join 865 other followers