# Harvard Business Review’s underhanded game

For our first-year seminar, we wanted to get the students to read some the hyperbolic articles on data science. A classic example is the Harvard Business Review’s Data Scientist: The Sexiest Job of the 21st Century. However, when we downloaded the PDF version through the library proxy, we were informed:

Harvard Business Review and Harvard Business Publishing Newsletter content on EBSCOhost is licensed for the private individual use of authorized EBSCOhost users. It is not intended for use as assigned course material in academic institutions nor as corporate learning or training materials in businesses. Academic licensees may not use this content in electronic reserves, electronic course packs, persistent linking from syllabi or by any other means of incorporating the content into course resources

Harvard Business Publishing will be pleased to grant permission to make this content available through such means. For rates and permission, contact permissions@harvardbusiness.org.

So it seems that for a single article we’d have to pay extra, and since “any other means of incorporating the content” is also a violation, we couldn’t tell the students that they can go to the library website and look up an article in a publication whose name sounds like “Schmarbard Fizzness Enqueue” on sexy data science.

My first thought on seeing this restriction is that it would definitely not pass the fair use test, but then the fine folks at the American Library Association say that it’s a little murky:

Is There a Fair Use Issue? Despite any stated restrictions, fair use should apply to the print journal subscriptions. With the database however, libraries have signed a license that stipulates conditions of use, so legally are bound by the license terms. What hasn’t really been fully tested is whether federal law (i.e. copyright law) preempts a license like this. While librarians may like to think it does, there is very little case law. Also, it is possible that if Harvard could prove that course packs and article permission fees are a major revenue source for them, it would be harder to declare fair use as an issue and fail the market effect factor. In other cases as in Georgia State, the publishers could not prove their permissions business was that significant which worked against them. Remember that if Harvard could prove that schools were abusing the restrictions on use, they could sue.

Part of the ALA’s advice is to use “alternate articles to the HBR 500 supplied by other vendors that do not have these restrictions.” Luckily for us, there is no absence of hype on data science, so we could avoid it.

Given Harvard’s well-publicized open access policy and general commitment to sharing scholarly materials, the educational restriction on using materials strikes me as rank hypocrisy. Of course, maybe HBR is not really a venue for scholarly articles. Regardless, I would urge anyone considering including HBR material in their class to think twice before playing their game. Or to indulge in some civil disobedience, but this might end up hurting the libraries and not HBR, so it’s hard to figure out what to do.

# Ethical questions in research funding: the case of ethics centers

I read a piece in Inside Higher Ed today on the ethics of accepting funds from different sources. In engineering, this is certainly an important issue, but the article focused Cynthia Jones, an ethics professor at UT-Pan American who directs the PACE ethics center. Jones had this stunningly ignorant thing to say about Department of Defense funding:

“What the hell are we going to use lasers for except to kill people?” Jones said. “But scientists get cut the slack.”

I’m flabbergasted that someone who works on philosophy applied to a technological field, namely biomedical ethics, believes that the only use of lasers is to kill people. Perhaps she thinks that using lasers in surgery is unethical. Or, more likely, she is unaware of how basic research in science is actually funded in this country.

Certainly, there’s been a definite shift over time in how defense-related agencies have targeted their funds — they fund much less basic research (or basic applied research) and have focused more on deliverables and technologies that more directly support combat, future warriors, and the like. This presents important ethical questions for researchers who may oppose the use of military force (or how it has been used recently) but who are interested in problems that could be “spun” towards satisfying these new objectives from DARPA, ARO, ONR, and AFOSR. Likewise, there are difficult questions about the line between independent research and consulting work for companies who may fund your graduate students. Drawing sharp distinctions in these situations is hard — everybody has their own comfort zone.

Jones wrote an article on “Dirty Money” that tries to develop rules for when money is tainted and when it is not. She comes up with a checklist at the end of the article that says funds should not be accepted if they

1- are illegal or that operate illegally in one’s country, or when the funding violates a generally accepted doctrine signed by one’s country (keeping in mind there is sometimes a distinction between legally acceptable and morally acceptable); or
2- originate from a donor who adds controls that would conflict with the explicit or implicit goals of the project to be funded or that would conflict with the proper functioning of the project or the profession’s ethical guidelines.

This, she says, is “the moral minimum.” This framing (and the problem in general of funding centers) that she addresses sidesteps the ethical questions around research that is funded by writing proposals, and indeed the question of soliciting funds. Even in the world of charitable giving, the idea that funders wander through the desert with bags of money searching for fundees seems odd. I think the more difficult ethical quandary is that of solicitation. At a “moral minimum” the fundee has to think about these questions, but I think point 2 needs a lot more unpacking because of the chicken-and-egg question of matching proposed research to program goals.

I don’t want to sound so super-negative! I think it’s great that someone is looking at the ethics of the economics of how we fund research. It’s just that there’s a whole murkier lake beyond the murky pond of funding centers, and the moral issues of science/engineering funding are not nearly as simple as Jones’s remark indicates.

# Towards multi-sensor characterizations of pianos

As an undergraduate I became interested in how timbre can be used to identify musical instruments. This was largely due to my first UROP (undergraduate research gig) with Keith Martin at the MIT Media Lab. Keith’s thesis was on identifying musical instruments from spectral features, and I worked a bit on this under Ryan Rifkin in a later UROP. I’ve been catching up on podcasts during my commute to campus this week, and a semi-recent Science Friday piece on the Steinway factory was on deck for this morning.

The piece talks about work in Agnieszka Roginska‘s lab at NYU, and in particular work from a paper from last year on measuring radiation patterns in piano soundboards. The radiation patterns are pretty but a bit hard to interpret, largely because I’m way out of the acoustical signal processing world. However, what’s interesting to me is that we’re still largely focused on overtones/cepstral coefficients. I wonder about how one might discover more interesting features to characterize this data. (I know someone will suggest deep learning but color me a little skeptical).

As a side note, one of the recent popular articles from JASA is on the acoustics of coffee roasting.

I’ve been a bit bogged down upon getting back from traveling, but here are a few interesting technical tidbits that came through.

Aaron Roth and Cynthia Dwork’s Foundation and Trends monograph on differential privacy is now available.

Speaking of differential privacy, Shiva Kasiviswanathan and Adam Smith have a paper in the Journal of Privacy and Confidentiality on Bayesian interpretations of differential privacy risk.

Deborah Mayo has a post up on whether p-values are error probabilities.

Raymond Yeung is offering a Coursera course on information theory (via the IT Society).

A CS Theory take on Fano’s inequality from Suresh over at the GeomBlog.

# Teaching bleg: articles on “data” suitable for first-year undergraduates

My colleague Waheed Bajwa and I are teaching a Rutgers Byrne Seminar for first-year undergraduates this fall. The title of the course is Data: What is it Good For? (Absolutely Something), a reference which I am sure will be completely lost on the undergrads. The point of the course is to talk about “data” (what is it, exactly?), how it gets turned into “information,” and then perhaps even “knowledge,” with all of the pitfalls along the way. So it’s a good opportunity to talk about philosophy (e.g. epistemology), mathematics/statistics (e.g. undersampling, bias, analysis), engineering (e.g. storage, transmission), science (e.g. reduplication, retraction), and policy (e.g. privacy). It’s supposed to be a seminar class with lots of discussion, and the students can be expected to do a little reading outside of class. We have a full roster of 20 signed up, so managing the discussion might be a bit tricky, of course.

We’re in the process of collecting reading materials — magazine articles, book chapters, blog posts, etc. for the students to read. We explicitly didn’t want it to be for “technical” students only. Do any readers of the blog have great articles suitable for first-year undergrads across all majors?

As the class progresses I will post materials here, as well as some snapshot of the discussion. It’s my first time teaching a class of this type (or indeed any undergraduates at Rutgers) so I’m excited (and perhaps a bit nervous).

On a side note, Edwin Starr’s shirt is awesome and I want one.

# ResearchGate: spam scam, or…?

I’ve been getting fairly regular automated emails lately from ResearchGate, which has pull-quotes from Forbes and NPR saying it’s changing the way we do research blah blah blah. However, all empirical reports I have heard indicate that once you join, it repeatedly spams all of your co-authors with requests to join, which makes it feel a bit more like Heaven’s Gate.

On a less grim note, the site’s promise to make your research “more visible” sounds a bit like SEO spam. Given the existence of Google Scholar, which is run by the SE that one would like to O, it seems slightly implausible.

Any readers want to weigh in on whether ResearchGate has been useful to them? Or is this mostly for people who don’t know how to make their own homepage with their papers on it (which is probably most faculty).

Inverted World [Christopher Priest]. A science-fiction novel, but of a piece with a writer like M. John Harrison — there’s a kind of disconnect and a focus on the conceptual world building rather than the nitty-gritty you get with Iain M. Banks. To avoid spoilers, I’ll just say it’s set in a city which moves through the world, always trying to be at a place called optimum. The city is on rails — it constantly builds fresh tracks ahead of it and winches itself forward a tenth of a mile per day. The city is run by a guild system of track layers, traction experts, bridge builders, surveyors, and the like. The protagonist, Helward Mann, takes an oath and joins a guild as an apprentice. The book follows his progress as he learns, and we learn, more about the strange world through which the city moves. Recommended if you like heady, somewhat retro, post-apocalyptic conceptual fiction.

Luka and the Fire of Life [Salman Rushdie]. A re-read for me, this didn’t hold up as well the second time around. I much prefer Haroun and the Sea of Stories, which I can read over and over again.

Boxers and Saints [Gene Luen Yang]. A great two-part graphic novel about the Boxer Rebellion in China. Chances are you don’t know much about this history. You won’t necessarily get a history lesson from this book, but you will want to learn more about it.

The Adventures of Augie March [Saul Bellow]. After leaving Chicago I have decided to read more books set in Chicago so that I can miss it more. I had read this book before but it was a rushed job. This time I let myself longer a bit more over Bellow’s language. It’s epic and scope and gave me a view of Chicago and the Great Depression that I hadn’t had before. Indeed, given our current economic woes, it was an interesting comparison to see the similarities (the rich are still pretty rich, and if you can get employed by them, you may do ok) and the dissimilarities.

The Idea Factory: Bell Labs and the Great Age of American Innovation [John Gertner]. A history of Bell Labs and a must-read for researchers who work on anything related to computing, communications, or applied physics and chemistry. It’s not all rah-rah, and while Gertner takes the “profiles of the personalities” approaches to writing about the place, I am sure there will be things in there that would surprise even the die-hard Shannonistas who may read this blog…

# SPCOM 2014: some more talks (and a plenary)

I did catch Greg Wornell’s plenary at SPCOM, which was called When Bits Absolutely, Positively, Have to be There as Soon as Possible, a riff on this FedEx commercial, which is older than I am. The talk was on link-aware PHY-layer design– basically looking at how ARQ enables incremental redundancy, and how to do a sort of layered superposition + incremental redundancy scheme in the sequential setting as well as a “multi-path” setting where blocks can arrive out of order. This was really digging into the signal issues in a way that a lot of non-communication engineering information theorists may get squeamish about. The nice thing is that I think the engineering problem is approachable without knowing a lot of heavy-duty math, but still requires some careful analysis.

Communication and Compression Via Sparse Linear Regression
Ramji Venkataramanan
This was on building codewords and codebooks out of a lower-complexity code dictionary $A \in \mathbb{R}^{n \times ML}$ where each codeword is a superposition of $L$ columns, one each from groups of size $M$. Thus encoding is $A \beta$ where $\beta$ is a sparse vector. I saw a talk by Barron and Joseph from a previous ISIT about this, but the framework extends to rate distortion (achieving the rate distortion function), and channel coding. The main point is to lower the complexity of the code at the expense of the gap to optimal rate — encoding and decoding are polynomial time but the rate gap for rate-distortion goes to zero as $1/\log n$. Ramji gave a really nice and clear talk on this — I hope he puts the slides up!

An Optimal Varentropy Bound for Log-Concave Distributions
Mokshay’s talk was also really clear and excellent. For a distribution $f(X)$ on $\mathbb{R}^n$, we can define $\tilde{h}(X) = - \log f(X)$. The entropy is the expectation of this random variable, and the varentropy is the variance. Their main result is a upper bound on the varentropu of log concave distributions $f(X)$. To wit, $\mathrm{Var}(\tilde{h}) \le n$. This bound doesn’t depend on the distribution and is sharp if $f$ is a product of exponentials. They then use this to prove a universal bound on the deviation of $\tilde{h}$ from its expectation, which gives a AEP that doesn’t really assume anything about the joint distribution of the variables except for log-concavity. There was more in the talk, but I eagerly await the paper.

Event-triggered Sampling and Reconstruction of Sparse Real-valued Trigonometric Polynomials
Neeraj Sharma; Thippur V. Sreenivas
This was on non-uniform sampling where the sampler tries to detect level crossings of the analog signal and samples at that point — the rate may not be uniform enough to use existing nonuniform sampling techniques. They come up with a method for reconstructing signals which are real-valued trigonometric polynomials with a few nonzero coefficients (e.g. sparse) and it seems to work pretty decently in experiments.

Removing Sampling Bias in Networked Stochastic Approximation
Vivek Borkar; Raaz Dwivedi
In networked stochastic approximation, the intermittent communication between nodes may mean that the system tracks a different ODE than the one we want. By modifying the method to account for “local clocks” on each edge, we can correct for this, but we end up with new conditions on the step size to make things work. I am pretty excited about this paper, but as usual, my notes were not quite up to getting the juicy bits. That’s what paper reading is for.

On Asymmetric Insertion and Deletion Errors
Ankur A. Kulkarni
The insertion/deletion channel model is notoriously hard. Ankur proposed a new model where $0$‘s are “indestructible” — they cannot be inserted or deleted. This asymmetric model leads to new asymptotic bounds on the capacity. I don’t really work on this channel model so I can’t get the finer points of the results, but once nice takeaway was that asymptotically, each indestructible $0$ in the codeword lets us correct around $1/2$ a deletion more.

# A teaser for ITAVision 2015

As part of ITAVision 2015 we are soliciting individuals and groups to submit videos documenting their love of information theory and/or its applications. During ISIT we put together a little example with our volunteers (it sounded better in rehearsal than at the banquet, alas). The song was Entropy is Awesome based on this, obviously. If you want to sing along, here is the Karaoke version:

The lyrics (so far) are:

Entropy is awesome!
Entropy is sum minus p log p
Entropy is awesome!
When you work on I.T.

Blockwise error vanishes as n gets bigger
Maximize I X Y
Polarize forever
Let’s party forever

I.I.D.
I get you, you get me
Communicating at capacity

Entropy is awesome…

This iteration of the lyrics is due to a number of contributors — truly a group effort. If you want to help flesh out the rest of the song, please feel free to email me and we’ll get a group effort going.

More details on the contest will be forthcoming!

# SPCOM 2014: some talks

Relevance Singular Vector Machine for Low-­rank Matrix Sensing
Martin Sundin; Saikat Chatterjee; Magnus Jansson; Cristian Rojas
This talk was on designing Bayesian priors for sparse-PCA problems — the key is to find a prior which induces a low-rank structure on the matrix. The model was something like $y = A \mathrm{vec}(X) + n$ where $X$ is a low-rank matrix and $n$ is noise. The previous state of the art is by Babacan et al., a paper which I obviously haven’t read, but the method they propose here (which involved some heavy algebra/matrix factorizations) appears to be competitive in several regimes. Probably more of interest to those working on Bayesian methods…

Non-Convex Sparse Estimation for Signal Processing
David Wipf
More Bayesian methods! Although David (who I met at ICML) was not trying to say that the priors are particularly “correct,” but rather that the penalty functions that they induce on the problems he is studying actually make sense. More of an algorithmist’s approach, you might say. He set up the problem a bit more generally, to minimize problems of the form
$\min_{X_i} \sum_{i} \alpha_i \mathrm{rank}[X_i] \ \ \ \ \ \ \ Y = \sum_{i} A_i(X_i)$
where $A_i$ are some operators. He made the case that convex relaxations of many of these problems, while analytically beautiful, have restrictions which are not satisfied in practice, and indeed they often have poor performance. His approach is via Empirical Bayes, but this leads to non-convex problems. What he can show is that the algorithm he proposes is competitive with any method that tries to separate the error from the “low-rank” constraint, and that the new optimization is “smoother.” I’m sure more details are in his various papers, for those who are interested.

PCA-HDR: A Robust PCA Based Solution to HDR Imaging
Vinod showed some information theoretic approaches to understanding how much communication is needed for secure computation protocols like remote oblivious transfer: Xavier has $\{X_0, X_1\}$, Yvonne has $Y \in \{0,1\}$ and Zelda wants $Z = X_Y$, but nobody should be able to infer each other’s values. Feige, Killian, and Naor have a protocol for this, which Vinod and Co. can show is communication-optimal. There were several ingredients here, including cut-set bounds, distribution switching, data processing inequalities, and special bounds for 3-party protocols. More details in his CRYPTO paper (and others).
In a MIMO wiretap setting, if the receiver has more antennas than the transmitter, then the transmitter can send noise in the nullspace of the channel matrix of the direct channel — as long as the eavesdropper has fewer antennas than the transmitter then secure transmission is possible. In this paper they show that positive secrecy capacity is possible even when the eavesdropper has more antennas, but as the number of eavesdropper antennas grows, the achievable rate goes to $0$. Perhaps a little bit of a surprise here!