Some old links I meant to post a while back but still may be of interest to some…

I prefer my okra less slimy, but to each their own.

Via Erin, A tour of the old homes of the Mission.

Also via Erin, Women and Crosswords and Autofill.

A statistician rails against computer science’s intellectual practices.

Nobel Laureate Randy Schekman is boycotting Nature, Science, and Cell. Retraction Watch is skeptical.

Harvard Business Review’s underhanded game

For our first-year seminar, we wanted to get the students to read some the hyperbolic articles on data science. A classic example is the Harvard Business Review’s Data Scientist: The Sexiest Job of the 21st Century. However, when we downloaded the PDF version through the library proxy, we were informed:

Harvard Business Review and Harvard Business Publishing Newsletter content on EBSCOhost is licensed for the private individual use of authorized EBSCOhost users. It is not intended for use as assigned course material in academic institutions nor as corporate learning or training materials in businesses. Academic licensees may not use this content in electronic reserves, electronic course packs, persistent linking from syllabi or by any other means of incorporating the content into course resources

Harvard Business Publishing will be pleased to grant permission to make this content available through such means. For rates and permission, contact

So it seems that for a single article we’d have to pay extra, and since “any other means of incorporating the content” is also a violation, we couldn’t tell the students that they can go to the library website and look up an article in a publication whose name sounds like “Schmarbard Fizzness Enqueue” on sexy data science.

My first thought on seeing this restriction is that it would definitely not pass the fair use test, but then the fine folks at the American Library Association say that it’s a little murky:

Is There a Fair Use Issue? Despite any stated restrictions, fair use should apply to the print journal subscriptions. With the database however, libraries have signed a license that stipulates conditions of use, so legally are bound by the license terms. What hasn’t really been fully tested is whether federal law (i.e. copyright law) preempts a license like this. While librarians may like to think it does, there is very little case law. Also, it is possible that if Harvard could prove that course packs and article permission fees are a major revenue source for them, it would be harder to declare fair use as an issue and fail the market effect factor. In other cases as in Georgia State, the publishers could not prove their permissions business was that significant which worked against them. Remember that if Harvard could prove that schools were abusing the restrictions on use, they could sue.

Part of the ALA’s advice is to use “alternate articles to the HBR 500 supplied by other vendors that do not have these restrictions.” Luckily for us, there is no absence of hype on data science, so we could avoid it.

Given Harvard’s well-publicized open access policy and general commitment to sharing scholarly materials, the educational restriction on using materials strikes me as rank hypocrisy. Of course, maybe HBR is not really a venue for scholarly articles. Regardless, I would urge anyone considering including HBR material in their class to think twice before playing their game. Or to indulge in some civil disobedience, but this might end up hurting the libraries and not HBR, so it’s hard to figure out what to do.


A taste test for fish sauces.

My friend Ranjit is working on this Crash Course in Psychology. Since I’ve never taken psychology, I am learning a lot!

Apparently the solution for lax editorial standards is to scrub away the evidence. (via Kevin Chen).

Some thoughts on high performance computing vs. Map Reduce. I think about this a fair bit, since some of my colleagues work on HPC, which feels like a different beast than a lot of the problems I’ve been thinking about.

A nice behind-the-scenes on Co-Op Sauce, a staple at Chicagoland farmers’ markets.


A map of racial segregation in the US.

Vi Hart explains serial music (h/t Jim CaJacob).

More adventures in trolling scam journals with bogus papers (h/t my father).

Brighten does some number crunching on his research notebook.

Jerry takes “disruptive innovation” to task.

Vladimir Horowitz plays a concert at the Carter White House. Also Jim Lehrer looks very young. The program (as cribbed from YouTube)

  • The Star-Spangled Banner
  • Chopin: Sonata in B-flat minor, opus 35, n°2
  • Chopin: Waltz in a minor, opus 34, n°2
  • Chopin: Waltz in C-sharp minor, opus 64, n° 2
  • Chopin: Polonaise in A-flat major, opus 53 ,Héroïque
  • Schumann: Träumerei, Kinderszene n°7
  • Rachmaninoff: Polka de W.R
  • Horowitz: Variations on a theme from Bizet’s Carmen

The Simons Institute is going strong at Berkeley now. Moritz Hardt has some opinions about what CS theory should say about “big data,” and how it might be require some adjustments to ways of thinking. Suresh responds in part by pointing out some of the successes of the past.

John Holbo is reading Appiah and makes me want to read Appiah. My book queue is already a bit long though…

An important thing to realize about performance art that makes a splash is that it can be often exploitative.

Mimosa shows us what she sees.

Eisen’s comments on the future of scholarly publishing

Michael Eisen gave a talk at the Commonwealth Club in San Francisco recently. Eisen is the founder of the Public Library of Science (PLoS), which publishes a large number of open-access journals in the biosciences, including the amazingly named PLoS Neglected Tropical Diseases. His remarks begin with the background on the “stranglehold existing journals have on academic publishing.” But he also has this throwaway remark:

One last bit of introduction. I am a scientist, and so, for the rest of this talk, I am going to focus on the scientific literature. But everything I will say holds equally true for other areas of scholarship.

This is simply not true — one cannot generalize from one domain of scholarship to all areas of scholarship. In fact, it is in the differences between dysfunctions of academic communication across areas that we can understand what to do about it. It’s not just that this is a lazy generalization, but rather that the as Eisen paints it, in science the journals are more or less separate from the researchers and parasitic entities. As such, there are no reasons that people should publish with academic publishers except for some kind of Stockholm syndrome.

In electrical engineering and computer science the situation is a bit different. IEEE and ACM are not just publishing conglomerates, but are supposed to be the professional societies for their respective fields. People gain professional brownie points for winning IEEE or ACM awards, they can “level up” by becoming Senior Members, and so on. Because disciplinary boundaries are a little more fluid, there are several different Transactions in which a given researcher may publish. At least on paper, IEEE and ACM are not-for-profit corporations. This is not to say that engineering researchers are not suffering from a Stockholm syndrome effect with these professional societies. It’s just that the nature of the beast is different, and when we talk about how IEEExplore or ACM Digital Library is overpriced, that critique should be coupled with one of IEEE’s policy requiring conferences to have a certain profit level. These things are related.

The second issue I had is with Eisen’s proposed solution:

There should be no journal hierarchy, only broad journals like PLOS ONE. When papers are submitted to these journals, they should be immediately made available for free online – clearly marked to indicate that they have not yet been reviewed, but there to be used by people in the field capable of deciding on their own if the work is sound and important.

So… this already exists for large portions of mathematics and mathematical sciences and engineering in the form of ArXiV. The added suggestion is a layer of peer-review on top, so maybe ArXiV plus a StackExchange thing. Perhaps this notion is a radical shift for life sciences where Science and Nature are so dominant, but what I learn myself from looking at the ArXiV RSS feed is that the first drafts of papers that get put up there are usually not the clearest exposition of the work, and without some kind of community sanction (in the form of rejection), there is little incentive for authors to actually go back and make a cleaner version of their proof. If someone has a good idea or result but a confusing presentation they are not going to get downvoted. If someone is famous they are unlikely to get downvoted.

In the end what PLoS ONE and the ArXiV-only model for publishing does is reify and retrench the existing tit-for-tat “clubbiness” that exists in smaller academic communities. In a lot of CS conferences reviewing is double-blind as a way to address this very issue. When someone says “all academic publishing has the same problems” this misses the point, because the problems is not always with publishing but with communication. We need to understand the how the way we communicate the products scholarly knowledge is broken. In some fields, I bet you could argue that papers are inefficient and bad ways of communicating results. In this sense, academic publishing and its rapacious nature are just symptoms of a larger problem.

RAR : a cry of rage

I’ve been trying to get a camera-ready article for the Signal Processing Magazine and the instructions from IEEE include the following snippet:

*VERY IMPORTANT: All source files ( .tex, .doc, .eps, .ps, .bib, .db, .tif, .jpeg, …) may be uploaded as a single .rar archived file. Please do not attempt to upload files with extensions .shs, .exe, .com, .vbs, .zip as they are restricted file types.

While I have encountered .rar files before, I was not very familiar with the file format or its history. I didn’t know it’s a proprietary format — that seems like a weird choice for IEEE to make (although no weirder than PDF perhaps).

What’s confusing to me is that ArXiV manages to handle .zip files just fine. Is .tgz so passé now? My experience with RAR is that it is good for compressing (and splitting) large files into easier-to-manage segments. All of that efficiency seems wasted for a single paper with associated figures and bibliography files and whatnot.

I was trying to find the actual compression algorithm, but like most modern compression software, the innards are a fair bit more complex than the base algorithmic ideas. The Wikipedia article suggests it does a blend of Lempel-Ziv (a variant of LZ77) and prediction by partial matching, but I imagine there’s a fair bit of tweaking. What I couldn’t figure out is if there is a new algorithmic idea in there (like in the Burrows-Wheeler Transform (BWT)), or it’s more a blend of these previous techniques.

Anyway, this silliness means I have to find some extra software to help me compress. SimplyRAR for MacOS seems to work pretty well.

Results on petition for Increasing Public Access to the Results of Scientific Research

I signed a petition to the White House a while ago about increasing public access to government-funded research — if a petition gets 100,000 signatures then they White House will draft a response. Some of the petitions are silly, but generate amusing responses, c.f. This Isn’t the Petition Response You’re Looking For on government construction of a Death Star. The old threshold was 60K, which the petition I signed passed. On Friday I got the official response from John Holdren, the Director of the White House Office of Science and Technology Policy. The salient bit is this one:
Continue reading