# A week later, a response from Ventra

I wrote to Ventra a week ago to ask them if there was any way for me to get an automated notice when it would charge my bank account. This was a nice feature of the Chicago Card — by letting me know when it added another $20 I could get a rough sense of my transit usage. A week after asking, I got the following response: Dear Customer, I am very sorry, but at this time, we do not have any alerts set up to notify customers when their card is auto loaded with value. If you have transit auto load set up, you just have to be aware that when your balance gets to$10.00 and below, that is when the funding source you have on your acct is charged for what ever value preference you have selected to be added to your card. Or if you have Pass auto load set up, then you just need to be aware that it will always charge your funding source for the pass you have chosen and add it to your acct, three days prior to your existing pass expiring. You can keep track of your balance via your web acct on the internet or can call us to check your balance, and can also check your balance at any Ventra Vending Machine as well. We do apologize for any inconvenience this may cause, but if you have any further concerns, please feel free to contact us for assistance. Thank You And Have A Nice Day.

The old CTA system would even flash your current balance and how much was being debited when you went through. On each transaction you could tell your balance, at least at train stations. Given that I have had to tap my Ventra card 10+ times (no hyperbole) to get through a turnstile at a CTA station, I have almost no confidence that the system is charging me the correct amount. Add to that Ventra’s penchant for charging debit cards in people’s wallets without informing them, there’s only one conclusion: Ventra is wholly set up like a scam. They say “give us your bank account information and then log into our website (with its awful UI) periodically to verify that we are not overcharging you.”

I guess it’s all moot since I’m leaving Chicago, but still… arrrrrrrrrgh.

Sorry Please Thank You: Stories [Charles Yu] — collection of short stories by the author of How to Live Safely in a Science Fictional Universe. Cute, heart wrenching, cutting, it’s all in here. They are more like snapshots than anything else, or fiction experiments. People may think that Yu is writing science fiction but space opera this is not. Some of the best ones are like Cindy Sherman’s Untitled Film Stills — you wonder what the rest of the world is that engendered the slice you saw.

After Dark [Haruki Murakami] — this novel felt a bit more spare and ephemeral than some of Murakami’s earlier works. It’s certainly not as sprawling as 1Q84 or Kafka on the Shore, but weeks later I still feel it’s a little haunting. Fans of Murakami will enjoy it because it’s different but I’m not sure it would make a good introduction to his work.

That Thing Around Your Neck [Chinamanda Ngozi Adichie] — a collection of short stories set in both Nigeria and the US. It’s a glimpse into a certain slice of the Nigerian immigrant community which I found both familiar (from similarities to the South Asian experience) and different (how Biafra looms large). I’m reading Americanah, her latest novel, right now, but this was available from the library earlier. I recommend it highly.

The Amazing Maurice and His Educated Rodents [Terry Pratchett] — a cute twist on the Pied Piper thing. It’s Discworld without being too Discworld, and would have been totally up my alley when I was a kid. It’s a little less serious than Diana Wynne Jones, for example, but the humor adds to the adventure, rather than distracting from it (that may be a matter of taste though). A good present for kids at that early-YA reading level.

The Ocean at the End of the Lane [Neil Gaiman] — like the Murakami, this rural English weirding fairytale felt familiar and wistful. Unlike the Murakami, however, I found it a bit spare and… predictable I guess. That’s not wholly a bad thing, but I don’t find myself wanting to read it again, like, say, Neverwhere.

The Girl From Foreign [Sadia Shepherd] — a lovely memoir by a Pakistani-American woman who discovers her grandmother was originally from the Bene Israel community of the Konkan coast. She goes to rediscover her Jewish roots in India and indeed, to discover this community. Definitely recommended for those who like historical memoir travelogues (think of Ghosh’s In An Antique Land, which is also a wonderful book).

# the MAP perturbation framework

I started working this fall on an interesting problem (shameless plug!) with Francesco Orabona, Tamir Hazan, and Tommi Jaakkola. What we do there is basically a measure concentration result, but I wanted to write a bit about the motivation for the paper. It’s nicely on the edge of that systems EE / CS divide, so I thought it might be a nice topic for the blog. One name for this idea is “MAP perturbations” so the first thing to do is explain what that means. The basic idea is to take a posterior probability distribution (derived from observed data) and do a random perturbation of the probabilities, and then take the maximum of that perturbed distribution. Sometimes this is called “perturb-and-MAP” but as someone put it, that sounds a bit like “hit-and-run.”

The basic problem is to sample from a particular joint distribution on $n$ variables. For simplicity, let’s consider an $n$-bit vector $\mathbf{x} \in \{0,1\}^n$. There are $2^n$ possible values, so explicitly maximizing the distribution could be computationally tricky. However we are often aided by the fact probability model has some structure. For example, the $n$ bits may be identified with labels {foreground,background} and correspond to $n$ pixels in an image, and there may be some local constraints which make it more likely for adjacent pixels to have the same label. In general, these local constraints get lumped together into a potential function $\theta(\mathbf{x})$ which assigns a score to each $\mathbf{x}$. The distribution on $\mathbf{x}$ is a Gibbs distribution:

$p(\mathbf{x}) = \frac{1}{Z} \exp(\theta( \mathbf{x} ))$

where the normalizing constant $Z$ is the infamous partition function:

$Z = \sum_{\mathbf{x}} \exp(\theta(\mathbf{x}))$

It’s infamous because it’s often hard to compute explicitly. This also makes sampling from the Gibbs distribution hard.

The MAP rule chooses the $\mathbf{x}$ that maximizes this distribution. Doing this means you don’t need to calculate $Z$ since you can maximize the potential function instead:

$\mathbf{X}_{\mathsf{MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) \right\}$.

This isn’t any easier in general, computationally, but people have put lots of blood, sweat, and tears into creating MAP solvers that use tricks to do this maximization. In some cases, these solvers work pretty well. Our goal will be to use the good solvers as a black box.

Unfortunately, in a lot of applications, we really would like to sample from he Gibbs distribution, because the number-one best configuration $\mathbf{x}$ may not be the only “good” one. In fact, there may be many almost-maxima, and sampling will let you produce a list of those. One way to do this is via Markov-chain Monte Carlo (MCMC), but the problem with all MCMC methods is you have to know how long to run the chain.

The MAP perturbation approach is different — it adds a random function $\gamma(\mathbf{x})$ to the potential function and solves the resulting MAP problem:

$\mathbf{X}_{\mathsf{R-MAP}} = \mathrm{argmax} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\}$

The random function $\gamma(\cdot)$ associates a random variable to each $\mathbf{x}$. The simplest approach to designing a perturbation function is to associate an independent and identically distributed (i.i.d.) random variable $\gamma(\mathbf{x})$ for each $\mathbf{x}$. We can find the distribution of the randomized MAP predictor when each $\gamma(\mathbf{x})$ a Gumbel random variable with zero mean, variance $\pi^2/6$, and cumulative distribution function

$G( y ) = \exp( - \exp( - (y + c)))$,

where $c \approx 0.5772$ is the Euler-Mascheroni constant.

So what’s the distribution of the output of the randomized predictor $\mathbf{X}_{\mathsf{R-MAP}}$? It turns out that the distribution is exactly that of the Gibbs distribution we want to sample from:

$\mathbb{P}_{\gamma}\left( \mathbf{X}_{\mathsf{R-MAP}} = \mathbf{x} \right) = p(\mathbf{x})$

and the expected value of the maximal value is the log of the partition function:

$\mathbb{E}_{\gamma}\left[ \max_{\mathbf{x}} \left\{ \theta(\mathbf{x}) + \gamma(\mathbf{x}) \right\} \right] = \log Z$

This follows from properties of the Gumbel distribution.

Great – we’re done. We can just generate the $\gamma$ random variables, shove the perturbed potential function into our black-box MAP solver, and voilà: samples from the Gibbs distribution. Except that there’s a little problem here – we have to generate $2^n$ Gumbel variables, one for each $\mathbf{x}$, so we’re still a bit sunk.

The trick is to come up with lower-complexity perturbation (something that Tamir and Tommi have been working on for a bit, among others), but I will leave that for another post…

# expected time for an optimal game of memory

Via Dan Katz, I learned about a recent problem (warning: JSTOR) from the American Mathematical Monthly (a publication of the Mathematics Association of America, making shaded icosahedrons look cool for almost 100 years):

What to Expect in a Game of Memory
Author(s): Daniel J. Velleman, Gregory S. Warrington
The American Mathematical Monthly, Vol. 120, No. 9 (November), pp. 787-805

The game of memory is played with a deck of $n$ pairs of cards. The cards in each pair are identical. The deck is shuffled and the cards laid face down. A move consists of flipping over first one card and then another. The cards are removed from play if they match. Otherwise, they are flipped back over and the next move commences. A game ends when all pairs have been matched. We determine that, when the game is played optimally, as $n \rightarrow \infty$
• The expected number of moves is $(3 - 2 \ln 2) n + 7/8 - 2 \ln 2 \approx 1.61 n$.
• The expected number of times two matching cards are unwittingly flipped over is $\ln 2$.
• The expected number of flips until two matching cards have been seen is $2^{2n} / \binom{2n}{n} \approx \sqrt{\pi n}$.

This is not a competitive game of memory, but the singe player version. It’s a kind of explore-exploit tradeoff with a simple structure — if you know how to exploit, do it. Note that one could do $2 n$ moves by flipping every card over once (there are $2n$ cards) to learn all of their identities and then removing all of the pairs one by one. The better strategy is

1. Remove any known pair.
2. If no known pair is known, flip a random unknown card and match it if you can.
3. If the first card is not matchable, flip another random unknown card to learn its value (and remove the pair if it matches.

This strategy exploits optimally when it can and explores optimally when it can’t. The second bullet point in the abstract is the gain from getting lucky, i.e. two randomly drawn cards matching.

The paper is an interesting read, but the arguments are all combinatorial. Since the argument is a limiting one as $n \to \infty$, I wonder if there is a more “probabilistic” argument (this is perhaps a bit fuzzy) for the results.

# dismissing research communities is counterproductive

I recently saw that Andrew Gelman hasn’t really heard of compressed sensing. As someone in the signal processing/machine learning/information theory crowd, it’s a little flabbergasting, but I think it highlights two things that aren’t really appreciated by the systems EE/algorithms crowd: 1) statistics is a pretty big field, and 2) the gulf between much statistical practice and what is being done in SP/ML research is pretty wide.

The other aspect of this is a comment from one of his readers:

Meh. They proved L1 approximates L0 when design matrix is basically full rank. Now all sparsity stuff is sometimes called ‘compressed sensing’. Most of it seems to be linear interpolation, rebranded.

I find such dismissals disheartening — there is a temptation to say that every time another community picks up some models/tools from your community that they are reinventing the wheel. As a short-hand, it can be useful to say “oh yeah, this compressed sensing stuff is like the old sparsity stuff.” However, as a dismissal it’s just being parochial — you have to actually engage with the use of those models/tools. Gelman says it can lead to “better understanding one’s assumptions and goals,” but I think it’s more important to “understand what others’ goals.”

I could characterize rate-distortion theory as just calculating some large deviations rate functions. Dembo and Zeitouni list RD as an application of the LDP, but I don’t think they mean “meh, it’s rebranded LDP.” For compressed sensing, the goal is to do the inference in a computationally and statistically efficient way. One key ingredient is optimization. If you just dismiss all of compressed sensing as “rebranded sparsity” you’re missing the point entirely.