expected time for an optimal game of memory

Via Dan Katz, I learned about a recent problem (warning: JSTOR) from the American Mathematical Monthly (a publication of the Mathematics Association of America, making shaded icosahedrons look cool for almost 100 years):

What to Expect in a Game of Memory
Author(s): Daniel J. Velleman, Gregory S. Warrington
The American Mathematical Monthly, Vol. 120, No. 9 (November), pp. 787-805

The game of memory is played with a deck of n pairs of cards. The cards in each pair are identical. The deck is shuffled and the cards laid face down. A move consists of flipping over first one card and then another. The cards are removed from play if they match. Otherwise, they are flipped back over and the next move commences. A game ends when all pairs have been matched. We determine that, when the game is played optimally, as n \rightarrow \infty
• The expected number of moves is (3 - 2 \ln 2) n + 7/8 - 2 \ln 2 \approx 1.61 n.
• The expected number of times two matching cards are unwittingly flipped over is \ln 2.
• The expected number of flips until two matching cards have been seen is 2^{2n} / \binom{2n}{n} \approx \sqrt{\pi n}.

This is not a competitive game of memory, but the singe player version. It’s a kind of explore-exploit tradeoff with a simple structure — if you know how to exploit, do it. Note that one could do 2 n moves by flipping every card over once (there are 2n cards) to learn all of their identities and then removing all of the pairs one by one. The better strategy is

  1. Remove any known pair.
  2. If no known pair is known, flip a random unknown card and match it if you can.
  3. If the first card is not matchable, flip another random unknown card to learn its value (and remove the pair if it matches.

This strategy exploits optimally when it can and explores optimally when it can’t. The second bullet point in the abstract is the gain from getting lucky, i.e. two randomly drawn cards matching.

The paper is an interesting read, but the arguments are all combinatorial. Since the argument is a limiting one as n \to \infty, I wonder if there is a more “probabilistic” argument (this is perhaps a bit fuzzy) for the results.

estimating probability metrics from samples

I took a look at this interesting paper by Sriperumbudur et al., On the empirical estimation of integral probability metrics (Electronic Journal of Statistics Vol. 6 (2012) pp.1550–1599). The goal of the paper is to estimate a distance or divergence between two distributions P and Q based on samples from each distribution. This sounds pretty vague at first… what kind of distributions? How many samples? This paper looks at integral probability metrics, which have the form

\gamma(P,Q) = \sup_{f \in \mathcal{F}} \left| \int_{S} f dP - \int_{S} f dQ \right|

where S is a measurable space on which P and Q are defined, and \mathcal{F} is a class of real-valued bounded measurable functions on S. This class doesn’t contain Csiszár \phi-divergences (also known as Ali-Silvey distances), but does contain the total variational distance.

Different choices of the function class give rise to different measures of difference used in so-called two-sample tests, such as the Kolmogorov-Smirnov test. The challenge in practically using these tests is that it’s hard to get bounds on how fast an estimator of \gamma(P,Q) converges if we have to estimate it from samples of P and Q. The main result of the paper is to provide estimators with consistency and convergence guarantees. In particular, they estimators are based on either linear programming or (in the case of kernel tests) in closed form.

The second section of the paper connects tests based on IPMs with the risk associated to classification rules for separating P and Q when the separation rule is restricted to come from the function class \mathcal{F} associated to the rule. This is a nice interpretation of these two-sample tests — they are actually doing similar things for restricted classes of classifiers/estimators.

Getting back to KL divergence and non-IPM measures, since total variation gives a lower bound on the KL divergence, they also provide lower bounds on the total variation distance in terms of other IPM metrics. This is important since the total variation distance can’t be estimated itself in a strongly consistent way. This could be useful for algorithms which need to estimate the total variation distance for continuous data. In general, estimating distances between multivariate continuous distributions can become a bit of a mess when you have to use real data — doing a plug-in estimate using, e.g. a kernel density estimator is not always the best way to go, and instead attacking the distance you want to measure directly could yield better results.

Postdoc at INRIA-ENS Paris on Graphs, Algorithms and Probability

Applications are invited for a Postdoc position (full-time, up to 2 years) at INRIA-ENS in Paris. The position is funded by the ANR GAP grant “Graphs, Algorithms and Probability.”

Requirements are a PhD degree in Computer Science or Mathematics and a strong background in some of the following topics:

  • discrete probability
  • statistical learning
  • combinatorial optimization
  • stochastic networks

Applications must include a research statement, a CV and the names and contacts of references. All material should be sent by email to Marc Lelarge. Please indicate in the subject POSTDOC GAP.

Important dates:

  • Intention of application (short email): as soon as possible
  • Deadline for application: December 1st, 2013
  • Suggested starting dates: Jan.-Feb. 2014

ISIT Blogging, part 2

Logarithmic Sobolev inequalities and strong data processing theorems for discrete channels
Maxim Raginsky (University of Illinois at Urbana-Champaign, USA)
Max talked about how the strong data processing inequality (DPI) is basically a log-Sobolev inequality (LSI) that is used in measure concentration. The strong DPI says that
D(QW \| PW) \le \alpha D(Q \| P)
for some \alpha < 1, so the idea is to get bounds on
\delta^*(P,W) = \sup_{Q} \frac{D(QW \| PW)}{D(Q \| P)}.
What he does is construct a hierarchy of LSIs in which the strong DPI fits and then gets bounds on this ratio in terms of best constants for LSIs. The details are a bit hairy, and besides, Max has his own blog so he can write more about it if he wants…

Building Consensus via Iterative Voting
Farzad Farnoud (University of Illinois, Urbana-Champaign, USA); Eitan Yaakobi (Caltech, USA); Behrouz Touri (University of Illinois Urbana-Champaign, USA); Olgica Milenkovic (University of Illinois, USA); Jehoshua Bruck (California Institute of Technology, USA)
This paper was about rank aggregation, or how to take a bunch of votes expressed as permutations/rankings of options to produce a final option. The model is one in which people iteratively change their ranking based on the current ranking. For example, one could construct the pairwise comparison graph (a la Condorcet) and then have people change their rankings when they disagree with the majority on an edge. They show conditions under which this process converges (the graph should not have a cycle) and show that if there is a Condorcet winner, then after this process everyone will rank the Condorcet winner first. They also look at a Borda count version of this problem but to my eye that just looked like an average consensus method, but it was at the end of the talk so I might have missed something.

Information-Theoretic Study of Voting Systems
Eitan Yaakobi (Caltech, USA); Michael Langberg (Open University of Israel, Israel); Jehoshua Bruck (California Institute of Technology, USA)
Eitan gave this talk and the preceding talk. This one was about looking at voting through the lens of coding theory. The main issue is this — what sets of votes or distribution of vote profiles will lead to a Condorcet winner? Given a set of votes, one could look at the fraction of candidates who rank candidate j in the i-th position and then try to compute entropies of the resulting distributions. The idea is somehow to characterize the existence or lack of a Condorcet winner in terms of distances (Kendall tau) and these entropy measures. This is different than looking at probability distributions on permutations and asking about the probability of there existing a Condorcet cycle.

Brute force searching, the typical set and Guesswork
Mark Chirstiansen (National University of Ireland Maynooth, Ireland); Ken R Duffy (National University of Ireland Maynooth, Ireland); Flávio du Pin Calmon (Massachusetts Institute of Technology, USA); Muriel Médard (MIT, USA)
Suppose an item X is chosen \sim P from a list and we want to guess the choice that is made. We’re only allowed to ask questions of the form “is the item Y?” Suppose now that the list is a list of codewords of blocklength k drawn i.i.d. according to Q. This paper looks at the number of guesses one needs if P is uniform on the typical set according to Q versus when P is distributed according the distribution Q^k conditioned on X being in the typical set. The non-uniformity of the latter turns out to make the guessing problem a lot easier.

Rumor Source Detection under Probabilistic Sampling
Nikhil Karamchandani (University of California Los Angeles, USA); Massimo Franceschetti (University of California at San Diego, USA)
This paper looked at an SI model of infection on a graph — nodes are either Susceptible (S) or Infected (I), and there is a probability of transitioning from S to I based on your neighbors’ states. There in exponential waiting time \tau_{ij} for the i to infect j if i is infected. The idea is that the rumor starts somewhere and infects a bunch of people and then you get to observe/measure the network. You want to find the source. This was studied by Zaman and Shah under the assumption of perfect observation of all nodes. This work looked at the case where nodes randomly report their infection state, so you only get an incomplete picture of the infection state. They characterize the effect of the reporting probability on the excess error and show that for certain tree graphs, incomplete reporting is as good as full reporting.


David McAllester, my department chair at TTI, has a started a new blog.

I thought it was pretty well known that people are fairly unique by ZIP code, but Forbes has an article about it now (h/t Raj). Of course, stores can also ping a smartphone’s WiFi to get more accurate location information about your activity within the store — when you check out they can tag your the MAC address of your device to all the other information about you. Creeptastic!

Bradley Efron’s perspective on the impact of Bayes’ Theorem from Science (h/t Kevin).

Some discussion on what makes a popular philosophy book. I wonder what, if anything, transfers over to a popular mathematical book?

Some thoughts from Larry Laudan on the mathematization of the presumption of innocence.

a cute paper: finding a most biased coin with fewest flips

I saw this paper on ArXiV a while back and figured it would be a fun read, and it was. Post-ISIT blogging may have to wait for another day or two.

Finding a most biased coin with fewest flips
Karthekeyan Chandrasekaran, Richard Karp
arXiv:1202.3639 [cs.DS]

The setup of the problem is that you have n coins with biases \{p_i : i \in [n]\}. For some given p \in [\epsilon,1-\epsilon] and \epsilon \in (0,1/2), each coin is “heavy” (p_i = p + \epsilon) with probability \alpha and “light” (p_i = p - \epsilon) with probability 1 - \alpha. The goal is to use a sequential flipping strategy to find a heavy coin with probability at least 1 - \delta.

Any such procedure has three components, really. First, you have to keep track of some statistics for each coin i. On the basis of that, you need a rule to pick which coin to flip. Finally, you need a stopping criterion.

The algorithm they propose is a simple likelihood-based scheme. If I have flipped a particular coin i a bunch of times and gotten h_i heads and t_i tails, then the likelihood ratio is
L_i = \left( \frac{p+\epsilon}{p - \epsilon} \right)^{h_i} \left( \frac{ 1 - p - \epsilon }{ 1 -p + \epsilon} \right)^{t_i}
So what the algorithm does is keep track of these likelihoods for the coins that it has flipped so far. But what coin to pick? It is greedy and chooses a coin i which has the largest likelihood L_i so far (breaking ties arbitrarily).

Note that up to now the prior probability \alpha of a coin being heavy has not been used at all, nor has the failure probability \delta. These appear in the stopping criterion. The algorithm keeps flipping coins until there exists at least one $i$ for which
L_i \ge \frac{1 - \alpha}{\alpha} \cdot \frac{ 1 - \delta }{\delta}
It then outputs the coin with the largest likelihood. It’s a pretty quick calculation to see that given (h_i, t_i) heads and tails for a coin $i$,
\mathbb{P}(\mathrm{coin\ }i\mathrm{\ is\ heavy}) = \frac{\alpha L_i}{ \alpha L_i + (1 - \alpha) },
from which the threshold condition follows.

This is a simple-sounding procedure, but to analyze it they make a connection to something called a “multitoken Markov game” which models the corresponding mutli-armed bandit problem. What they show is that for the simpler case given by this problem, the corresponding algorithm is, in fact optimal in the sense that it makes the minimum expected number of flips:
\frac{16}{\epsilon^2} \left( \frac{1 - \alpha}{\alpha} + \log\left( \frac{(1 -\alpha)(1 - \delta)}{\alpha \delta} \right) \right)

The interesting thing here is that the prior distribution on the heavy/lightness plays a pretty crucial role here in designing the algorithm. part of the explore-exploit tradeoff in bandit problems is the issue of hedging against uncertainty in the distribution of payoffs — if instead you have a good handle on what to expect in terms of how the payoffs of the arms should vary, you get a much more tractable problem.

Yet more skims from ArXiV

I’m still catching up on my backlog of reading everything, but I’ve decided to set some time aside to take a look at a few papers from ArXiV.

  • Lecture Notes on Free Probability by Vladislav Kargin, which is 100 pages of notes from a course at Stanford. Pretty self-explanatory, except for the part where I don’t really know free probability. Maybe reading these will help.
  • Capturing the Drunk Robber on a Graph by Natasha Komarov and Peter Winkler. This is on a simple pursuit-evasion game in which the robber (evader) is moving according to a random walk. On a graph with n vertices:

    the drunk will be caught with probability one, even by a cop who oscillates on an edge, or moves about randomly; indeed, by any cop who isn’t actively trying to lose. The only issue is: how long does it take? The lazy cop will win in expected time at most 4 n^3/27 (plus lower-order terms), since that is the maximum possible expected hitting time for a random walk on an n-vertex graph [2]; the same bound applies to the random cop [4]. It is easy to see that the greedy cop who merely moves toward the drunk at every step can achieve O(n^2); in fact, we will show that the greedy cop cannot in general do better. Our smart cop, however, gets her man in expected time n + o(n).

    How do you make a smarter cop? In this model the cop can tell where the robber is but has to get there by walking along the graph. Strategies which try to constantly “retarget” are wasteful, so they propose a strategy wherein the cop periodically retargets to eventually meet the robber. I feel like there is a prediction/learning algorithm or idea embedded in here as well.

  • Normalized online learning by Stephane Ross, Paul Mineiro, John Langford. Normalization and data pre-processing is the source of many errors and frustrations in machine learning practice. When features are not normalized with respect to each other, procedures like gradient descent can behave poorly. This paper looks at dealing with data normalization in the algorithm itself, making it “unit free” in a sense. It’s the same kind of weights-update rule that we see in online learning but with a few lines changed. They do an adversarial analysis of the algorithm where the adversary gets to scale the features before the learning algorithm gets the data point. In particular, the adversary gets to choose the covariance of the data.
  • On the Optimality of Treating Interference as Noise, by Chunhua Geng, Navid Naderializadeh, A. Salman Avestimehr, and Syed A. Jafar. Suppose I have a K-user interference channel with gains \alpha_{ij} between transmitter i and receiver j. Then if
    \alpha_{ii} \ge \max_{j \ne i} \alpha_{ij} + \max_{k \ne i} \alpha_{ki}
    then treating interference as noise is optimal in terms of generalized degrees of freedom. I don’t really work on this kind of thing, but it’s so appealing from a sense of symmetry.
  • Online Learning under Delayed Feedback, byPooria Joulani, András György, Csaba Szepesvári. This paper is on forecasting algorithms which receive the feedback (e.g. the error) with a delay. Since I’ve been interested in communication with delayed feedback, this seems like a natural learning analogue. They provide ways of modifying existing algorithms to work with delayed feedback — one such method is to run a bunch of predictors in parallel and update them as the feedback is returned. They also propose methods which use partial monitoring and an approach to UCB for bandit problems in the delayed feedback setting.

Persi Diaconis on coincidence

Persi Diaconis gave the second annual Billingsley Lecture at UChicago yesterday on the topic of coincidences and what a skeptical statistician/probabilist should say about them. He started out by talking about how Jung was fascinated by paradoxes (apparently there’s one about having fish come up all the time in conversation).

It was mostly a general-audience talk (with some asides about Poisson approximation), and the first part on the birthday problem and variants. Abstracted away, the question is given n balls (people) and C bins/categories (days), how big should n be so that there’s an even chance that two balls land in the same bin? Turns out n \approx latex 1.2 \sqrt{C}, as we know, but we can expand this to deal with approximate matches (you need only 7 people to get 2 birthdays in the same week with probability around 1/2). If you want to put a graph on it you can ask social-network coincidence questions and get some scalings as a function of the number of edges and number of categories — here there are n vertices and C colors for the vertices. What these calculations show, of course, is that most coincidences are not so surprising, at least in this probabilistic sense. Some more advanced treatment might be found in Sukhada Fadnavis’s preprint (which also has something about a “shameful conjecture” on chromatic polynomials that was proved in 2000, but I don’t know why it is shameful). The second part of the talk was on problems arising in the study of ESP — namely that experimental controls are not really present, so the notion of a “trial” is hard to pin down, leading (of course) to more false perceptions of coincidences are being surprising. He closed with some remarks about how our perception of coincidence is really about how our minds work, and pointed to some work by Ruma Falk for those who are interested in that angle of things.

I was unaware of this body of Diaconis’s work, and it was nice to have a high-level talk to cap off the day.

HGR maximal correlation revisited : a corrected reverse inequality

Sudeep Kamath sent me a note about a recent result he posted on the ArXiV that relates to an earlier post of mine on the HGR maximal correlation and an inequality by Erkip and Cover for Markov chains U -- X -- Y which I had found interesting:
I(U ; Y) \le \rho_m(X,Y)^2 I(U ; X).
Since learning about this inequality, I’ve seen a few talks which have used the inequality in their proofs, at Allerton in 2011 and at ITA this year. Unfortunately, the stated inequality is not correct!

On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover
Venkat Anantharam, Amin Gohari, Sudeep Kamath, Chandra Nair

What this paper shows is that the inequality is not satisfied with \rho_m(X,Y)^2, but by another quantity:
I(U ; Y) \le s^*(X;Y) I(U ; X)
where s^*(X;Y) is given by the following definition.

Let X and Y be random variables with joint distribution (X, Y) \sim p(x, y). We define
s^*(X;Y) = \sup_{r(x) \ne p(x)} \frac{ D( r(y) \| p(y) ) }{ D( r(x) \| p(x) },
where r(y) denotes the y-marginal distribution of r(x, y) := r(x)p(y|x) and the supremum on the right hand side is over all probability distributions r(x) that are different from the probability distribution p(x). If either X or Y is a constant, we define s^*(X; Y) to be 0.

Suppose (X,Y) have joint distribution P_{XY} (I know I am changing notation here but it’s easier to explain). The key to showing their result is through deriving variational characterizations of \rho_m and s^* in terms of the function
t_{\lambda}( P_X ) := H( P_Y ) - \lambda H( P_X )
Rather than getting into that in the blog post, I recommend reading the paper.

The implication of this result is that the inequality of Erkip and Cover is not correct : not only is \rho_m(X,Y)^2 not the supremum of the ratio, there are distributions for which it is not an upper bound. The counterexample in the paper is the following: X \sim \mathsf{Bernoulli}(1/2), and Y is generated via this asymmetric erasure channel:

Joint distribution counterexample

Joint distribution counterexample (Fig. 2 of the paper)

How can we calculate \rho_m(X,Y)? If either X or Y is binary-valued, then
\rho_m(X,Y)^2 = -1 + \sum_{x,y} \frac{ p(x,y)^2 }{ p(x) p(y) }
So for this example \rho_m(X,Y)^2 = 0.6. However, s^*( X,Y) = \frac{1}{2} \log_2(12/5) > 0.6 and there exists a sequence of variables U_i satisfying the Markov chain such that U_i -- X -- Y such that the ratio approaches s^*.

So where is the error in the original proof? Anantharam et al. point to an explanation that the Taylor series expansion used in the proof of the inequality with \rho_m(X,Y)^2 may not be valid at all points.

This seems to just be the start of a longer story, which I look forward to reading in the future!