Truth in surveying

A few weeks ago I attended Scott Kominers‘s class on Market Design. They were talking about mechanism design and differential privacy so I felt like it would be fun to attend that session. In the class Scott mentioned some interesting work by Nicholas Lambert and Yoav Shoham on Truthful Surveys that appeared at WINE 2008. There’s also some recent work by Aaron Roth and Grant Schoenebeck up on ArXiV.

In Lambert and Shoham’s set up, the opinion distribution of a population is given by some CDF F(x) (with a density) on the unit interval [0,1]. We can think of x as a level of approval (say of a politician) and F(x) as the proportion of the population which has approval less than x. A surveyor selects n agents \{x_i\} i.i.d. from F and asks them to report their opinion. They can report anything they like, however, so they will report \{r_i\}. In order to incentivize them, the surveyor will issue a payment \Pi_i( r_1, \ldots, r_n ) to each agent i. How should we structure the payments to incentivize truthful reporting? In particular, can we make a mechanism in which being truthful is a Nash equilibrium (“accurate”) or the only Nash equilibrium (“strongly accurate”)?

Let A_i = |\{j : r_i  r_j \}|. They propose partitioning the agents into k groups with \mathcal{G}(i) denoting the group of agent $i$, and \tilde{F}_i(x) as an unbiased estimator of F(x) that uses the points \{r_j : \mathcal{G}_j \ne \mathcal{G}_i \}. The payments are:

\Pi_i(\{r_j\}) = \frac{1}{|\mathcal{G}_i| - 1} \left[ A_i - B_i \right] + 2 \tilde{F}_i(r_i) - \frac{2}{|\mathcal{G}_i| - 1} \sum_{j \in \mathcal{G}_i \setminus \{i\} } \tilde{F}_j(r_j)

This mechanism is accurate and also permutation-invariant with respect to the agents (“anonymous”) and the sum of the payments is 0 (“budget-balanced”).

This is an instance of a more general mechanism for truthfully inducing samples from a collection of distributions that are known — each agent has a distribution F_i and you want to get their sample of that distribution. Here what they do is replace the known distributions with empirical estimates, in a sense. Why is this only accurate and not strongly accurate? It is possible that the agents could collude and pick a different common distribution G and report values from that. Essentially, each group has an incentive to report from the same distribution and then globally the optimal thing is for all the groups to report from the same distribution, but that distribution need not be F if there is global collusion. How do we get around this issue? If there is a set of “trusted” agents \mathcal{T}, then the estimators in the payment model can be built using the trusted data and the remaining untrusted agents can be put in a single group whose optimal strategy is now to follow the trusted agents. That mechanism is strongly accurate. In a sense the trusted agents cause the population to “gel” under this payment strategy.

It seems that Roth and Schoenbeck are not aware of Lambert and Shoham’s work, or it is sufficiently unrelated (they certainly don’t cite it). They also look at truth in surveying from a mechanism design perspective. Their model is somewhat more involved (an has Bayesian bits), but may be of interest to readers who like auction design.


Congratulations to my fellow Beast Amitha Knight on being a co-winner of the 2012 PEN New Enlgand Susan P. Bloom Children’s Book Discovery Award!

Speaking of children’s books, some people who saw The Hunger Games movie are upset that Rue is black. Unsurprising but sad.

And speaking of friends, my friend Amber is slumming it in Antarctica and is writing some fascinating blog posts from down there.

Can Ellen Do More Push-Ups Than Michelle Obama? They both seem to be able to do more pushups than me. Time to hit the gym I think.

I’ve been eating this spicy peanut noodle salad for lunch this week and boy is it delicious.

Typical review loads

Since becoming faculty at TTI, I’ve started to appreciate better the tensions of service commitments and I can see how many people begin to view reviewing as a chore, a burden they must bear to maintain goodwill in the “community.” Since I work in a few different communities now, I end up reviewing papers from a lot of different areas : information theory and signal processing of course, but also machine learning, security, and networks. There’s been a distinct uptick in my reviewing queue, which I find somewhat alarming.

Looking back, I did a quick calculation and in the almost 6 months I’ve been here, I’ve either finished or committed to reviewing 9 journal papers and 16 conference papers. These numbers don’t really mean too much, because some journal papers are shorter (e.g. a correspondence) and some conference papers are long (40+ pages including supplementary material). Page numbers also don’t really help because of formatting differences. I’m hoping my new iPad (ooh, shiny!) will let me pack in some reviewing time during my commute and stop me from killing so many trees.

However, I have no idea if these numbers are typical. I’ve turned down review requests because I felt like I don’t have enough time as it is. So readers : what’s a typical review load like? Should I just suck it up and accept more reviews?

Note that I’m not asking about what’s “fair” in terms of I submit N papers and therefore should review 3N or something like that. Those games are fine and all, but I really wonder what the distribution of review load is across individuals for a given journal. More on that point later…

Update: I should be clear that being on a PC will clearly cause your review load to go up. I am on 2 PCs but for smaller conferences; having 10+ ISIT reviews would add significantly to one’s total load.

Updated perl script for merging TeX files for ArXiV

Manu Sridharan (blog) left a comment the other day on my old post on my script to merge multiple TeX files (and strip the comments) for posting to ArXiV. He’s created a git repository for it, which seem so much more official and stuff. It’s at:

Thanks a bunch, Manu!

As a side note, Péter Gács has a de-macro script to eliminate all of your private macros if you’re so inclined.


I’ve been refraining from talking about the Dharun Ravi case, because it’s pretty complicated. On the one hand, after reading the New Yorker article and other material, it’s pretty clear Dharun is a grade-A jerk. And Tyler Clementi’s death was a terrible tragedy. But on the other hand, 10 years in prison is a serious thing, as Ta-Nehisi Coates points out. Ashvin shared a link to a blog post on “Deporting Homophbia”:

I have been Tyler and Dharun in a post 9/11 U.S. that accuses white men of exploiting the rest of the world and accuses brown men of destroying it. I have been Tyler and Dharun in a post 9/11 world where white men advocate for homosexual rights and advance homophobia and where brown men are understood as always homophobic. I am being presumptuous, so let me stop.

It’s an interesting take on things, and has made me think about the media coverage of the event and if and how Dharun’s race has played into how the story has been told.

Via Kamalika I learned about a lawsuit against IMDB.

A gem from SMBC via Cosma. The Beef Tensors are a nice touch.

Sepia Mutiny is shutting down, and Amardeep has some closing thoughts.

We always get to hear these stories about how service providers needs differential pricing for network traffic because they can’t make money, but then stories like this make me question the integrity of the complainers.

I heard Of Monsters and Men on KEXP and their show is sold out in Chicago, boo. Here’s their crazy video though:

CISS 2012 : day 1

I’m at CISS right now on the magnolia-filled Princeton campus. The last time I came here was in 2008, when I was trying to graduate and was horribly ill, so this year was already a marked improvement. CISS bears some similarities to Allerton — there are several invited sessions in which the talks are a little longer than the submitted sessions. However, the session organizers get to schedule the entire morning or afternoon (3 hours) as they see fit, so hopping between sessions is not usually possible. I actually find this more relaxing — I know where I’m going to be for the afternoon, so I just settle down there instead of watching the clock so I don’t miss talk X in the other session.

Because there are these invited slots, I’ve begun to realize that I’ve seen some of the material before in other venues such as ITA. This is actually a good thing — in general, I’ve begun to realized that I have to see things 3 times for me to wrap my brain around them.

In the morning I went to Wojciech Szpankowski‘s session on the Science of Information, a sort of showcase for the new multi-university NSF Center. Peter Shor gave an overview of quantum information theory, ending with comments on the additivity conjecture. William Bialek discussed how improvements in array sensors for multi-neuron recording and other measurement technologies are allowing experimental verification of some theoretical/statistical approaches to neuroscience and communication in biological systems. In particular, he discussed an interesting example of how segmentation appears in the embryonic development of fruit flies and how they can track the propagation of chemical markers during development.

David Tse gave a slightly longer version of his ITA talk (with on DNA sequencing with more of the proof details. It’s a cute version of the genome assembly problem but I am not entirely sure what it tells us about the host of other questions biologists have about this data. I’m trying to wrestle with some short-read sequencing data to understand it (and learning some Bioconductor in the process), and the real data is pretty darn messy.

Madhu Sudan talked about his work with Brendan Juba (and now Oded Goldreich) on Semantic Communication — it’s mostly trying to come up with definitions of what it means to communicate meaning using computer science, and somehow feels like some of these early papers in Information and Control which tried to mathematize linguistics or other fields. This is the magical 3rd time I’ve seen this material, so maybe it’s starting to make sense to me.

Andrea Goldsmith gave a whirlwind tour of the work in backing away from asymptotic studies in information theory, and how insights we get from asymptotic analyses often don’t translate into the finite parameter regime. This is of a piece with her stand a few years ago on cross-layer design. High SNR assumptions in MIMO and relaying imply that certain tradeoffs (such diversity-multiplexing) or certain protocols (such as amplify-and forward) are fundamental but at moderate SNR the optimal strategies are different or unknown. Infinite blocklengths are the bread and butter of information theory but now there are more results on what we can do with finite blocklength. She ended with some comments on infinite processing power and trying to consider transmit and processing power jointly, which caused some debate in the audience.

Alas, I missed Tsachy Weissmann‘s talk, but at least I saw it at ITA? Perhaps I will get to see it two more times in the future!

In the afternoon I went to the large alphabets session which was organized by Aaron Wagner. Unfortunately, Aaron couldn’t make it so I ended up chairing the session. Venkat Chandrasekaran didn’t really talk about large alphabets, but instead about estimating high dimensional covariance matrices when you have symmetry assumptions on the matrix. These are represented by the invariance of the true covariance under actions of a subgroup of the symmetric group — taking these into account can greatly improve sample complexity bounds. Mesrob Ohanessian talked about his canonical estimation framework for large alphabet problems and summarized a lot of other work before (too briefly!) mentioning his own work on the consistency of estimators under some assumptions on the generating distribution.

Prasad Santhanam talked about the insurance problem that he worked on with Venkat Anantharam, and I finally understood it a bit better. Suppose you are observing i.i.d. samples X_t from a distribution P on \mathbb{R}^{+} that represent losses paid out by an insurer. The insurer gets to observe the losses for a while and then has to start setting premiums Y_t. The question is this : when can we guarantee that Y_t remains bounded and \mathbb{P}( Y_t > X_t \forall t ) > 1 - \eta? In this case we would say the distribution is insurable.

To round out the session, Wojciech Szpankowski gave a talk on analytic approaches to bounding minimax redundancy under different scaling assumptions on the alphabet and sample sizes. There was a fair bit of generatingfunctionology and Lambert W-functions. The end part of the talk was on scaling when you know part of the distribution exactly (perhaps through offline simulation or training) but then there is part which is unknown. The last talk was by Greg Valiant, who talked about his papers with Paul Valiant on estimating properties of distributions on n elements using only \Theta(n/\log n) samples. It was a variant of the talk he gave at Banff, but I think I understood the lower bound CLT results a bit better (using Stein’s Method).

I am not sure how much blogging I will do about the rest of the conference, but probably another post or two. Despite the drizzle, the spring is rather beautiful here — la joie du printemps.

ICITS Deadline Extension

Due to conflicts with other deadlines and conferences, the submission
deadline for the “conference” track of ICITS 2012 — the International
Conference on Information-Theoretic Security — has been moved back
ten days to Thursday, March 22, 2012.

The “conference” deadline is now Thursday, March 22 (3pm EDT /  19:00 GMT).
The “workshop” deadline is  Monday, April 9.

ICITS will have two tracks this year, one which will act as a regular
computer science-style conference (published proceedings, original
work only) and the other which will behave more like a workshop,
without proceedings, where presentations on previously published work
or work in progress are welcome.

For more information, see the conference website.