The sun always rises?

I went to a workshop at the end of the summer at the American Institute of Mathematics on Permanents and modeling probability distributions. The main questions we looked at were how to estimate a probability distribution from samples when you don’t know e.g. how many possible values there are. A simple version of this which is often mentioned is estimating the number of different butterfly species from a sample containing many unique species. As C.B. Williams wrote:

About 1940, a closer cooperation in a new field was started by a letter from Dr Steven Corbet, who had collected butterflies for many years in Malaya. On studying his collections after returning to England, he found that he had approximately 9000 specimens which included 316 species. Of these 118 were represented by only a single individual; 74 by 2 individuals, 44 by 3 and so on-the greater the number of individuals per species, the fewer were the number of species at that
level. From this skew distribution it followed that the 37% rarer species accounted for only 1.3% of the total sample.

R.A. Fisher, Corbet and Williams wrote a paper in the Journal of Animal Ecology in 1943 to estimate the number of species (“The relation between the number of species and the number of individuals in a random sample of an animal population”). Another famous example is estimating the number of words Shakespeare knew, as in the work of Thisted and Efron (popularized a bit more by the application to determining if a new poem is by Shakespeare or not).

At the workshop I learned from Susan Holmes about this collection of essays by S.L. Zabell called Symmetry and its Discontents, which I have been enjoying tremendously. Alon Orlitsky gave a short presentation orienting the workshop participants, and started out with a “simpler” problem of inductive inference — given that you saw a trial succeed p times and fail q times, what is the bias of the coin? Bayes thought about this a lot, as did Laplace, who wrote this infamous example (excerpted from Zabell’s book):

Thus we find that an event having occurred successively any number of times, the probability that it will happen again the next time is equal to this number increased by unity, divided by the same number, increased by two units. Placing the most ancient epoch of history at five thousand years ago, or at least 1826213 days, and the sun having risen constantly in the interval at each revolution of twenty-four hours, it is a bet of 1826214 to one that it will rise again tomorrow. [Essai philosophique p. xvii]

Laplace gets a lot of flak for this estimate, and it’s an infamous example of the “excesses of probabilistic reasoning.” But as I learned this morning, Laplace immediately went on to say:

But this number is incomparably greater for him who, recognizing in the totality of the phenomena the regulatory principle of days and seasons, sees that nothing at the present moment can arrest the course of it.

Laplace was advocating this method of calculation formally, as the correct way to compute a probability based only on the information from the sample (e.g. the number of repeated successes/events). As Persi Diaconis pointed out at the workshop, Laplace would be “turning over in his grave” at some of the things people have accused him of mathematically.

And that is your historical nugget for the day.

Lazy random walks on the torus via Dirichlet forms

One of the simple example graphs I’ve used in some of my research on gossip algorithms has been the 2-dimensional torus with n vertices, which looks like a \sqrt{n} \times \sqrt{n} grid with the top and left edges wrapped around to connect with the bottom and right edges. Every vertex has 4 neighbors. Now imagine a very lazy random walk on this graph in which a random walker moves from vertex u to one of its neighbors with probability 1/n. It’s “well known” that this random walk takes around n^2 steps to mix. That is, if P is the matrix of transition probabilities then

T_{r} = \frac{1}{1 - \lambda_2(P)} \approx n^2

Here \lambda_2(P) is the second largest eigenvalue of P and the relaxation time T_{r} is the inverse of the spectral gap of the matrix. One way of characterizing T_{r} for reversible Markov chains is via the Dirichlet form. For a function f : V \to \mathbb{R} on the states of the chain define the Dirichlet form D(f,f) by

D(f,f) = \frac{1}{2} \sum_{u,v \in V} \pi_u P_{uv} (f(u) - f(v))^2

In our example the stationary distribution pi_u = 1/n and P_{uv} = 1/n for all edges in the graph. We write f \perp \pi if

\sum_{v \in V} \pi_v f(v) = 0

Define a norm associated with \pi via

\|f\|^2 = \sum_{v \in V} \pi_i f(v)^2

Then the characterization is

T_{r} = \sup_{f \perp \pi, f \ne 0} \frac{ \|f\|^2 }{ D(f,f) }

One question I asked myself today was whether it was “easy” to see what f you should choose in the grid example to get the scaling of n^2. Here’s one choice that gives the correct scaling. We’ll set f to be constant on each column. Assume without loss of generality that \sqrt{n} is divisible by 4 and set m = \sqrt{n}/4. The values for f on the columns will be like two triangles:

\{\frac{1}{n}, \frac{2}{n}, \ldots, \frac{m}{n},  \frac{m}{n}, \frac{m-1}{n}, \ldots, \frac{1}{n}, \frac{-1}{n}, \frac{-2}{n}, \ldots, \frac{-m}{n}, \frac{-m}{n}, \frac{-m+1}{n}, \ldots, \frac{-1}{n} \}

Now we can evaluate the norm, noting that there are \sqrt{n} vertices per column:

\|f\|^2 = 4 \sum_{i =1}^{m} \frac{1}{n} \sqrt{n} \frac{i^2}{n^2} = c \frac{1}{n}

This is because the sum of the first m squares scales like m^3 and m = \sqrt{n}/4. Now turning to the Dirichlet form, note that each difference between columns is at most 2/n and there are fewer than n edges for which f(u) \ne f(v). Thus:

D(f,f) \le \frac{1}{2} n \frac{1}{n} \frac{1}{n} \frac{4}{n^2} = c' \frac{1}{n^3}

Taking the ratio gives the lower bound of T_r \ge c''/n^2 as desired.

The first f I tried was just equal to +1 on the first half of the columns and -1 on the second half of the columns. This ends up giving a suboptimal bound, because the norm \|f\|^2 = 1 but in the denominator we get 2 \sqrt{n} positive terms. The key is to make all the differences (f(u) - f(v))^2 in the denominator small while keeping the average of \|f\|^2 large enough. Even though you sum over n small differences in the denominator, it stays small enough to pay for the \|f\|^2 = c/n in the numerator.

While doing this calculation, I noticed that the book Markov Chains and Mixing Times is also online — it makes a handy reference and is a little easier to use than my old go-to, the Aldous-Fill book.

another amusing footnote

I seem to have a penchant for picking books with amusing footnotes. Or maybe most math books have them and I’ve been remarkably unlucky. Here’s one from Random Fields and Geometry, by R.J. Adler and J.E. Taylor:

The use of T comes from the prehistory of Gaussian processes, and probably stands for “time.” While the whole point of this book is to get away from the totally ordered structure of \mathbb{R}, the notation is too deeply entombed into the collective psyche of probabilists to change it now. Later on, however, when we move to manifolds as parameter spaces, we shall emphasize this by replacing T by M. Nevertheless, points in M will still be denoted by t. We hereby make the appropriate apologies to geometers.

It’s a good book so far, and may help me solve some pesky technical point in a new problem I’ve been working on. Hooray for new problems!

Following the Perturbed Leader

I’m reading Cesa-Bianchi and Lugosi’s Prediction, Learning, and Games in more detail now, and in Chapter 4 they discuss randomized prediction and forecasting strategies. The basic idea is that nature has a sequence of states or outcomes (y_1, y_2, \ldots, y_n). You have a set of possible actions \{1, 2, \ldots, N\} to choose from at each time. If you choose action i at time t then you incur a loss \ell(i, y_t). After you choose your action the state of nature is revealed to you. The goal is to come up with a strategy for choosing actions at each time t, given the past states (y_1, \ldots, y_{t-1}).

One simple strategy that you might consider is to choose at time t to take the action i which would have resulted in the minimum loss so far:

I_t = \min_{i} \sum_{k=1}^{t-1} \ell(i, y_k).

This seems like a good idea at first glance but can easily be “fooled” by a bad state sequence (y_1, \ldots, y_n). This strategy is called fictitious play because you are pretending to play each action over the past and choosing the one which would have given you the best performance to date. Unfortunately, the strategy is not Hannan consistent, which loosely means that the gap between the average loss under this strategy and the average loss under the best fixed action does not go to 0.

So what to do? One option, proposed by Hannan, is to add noise to all of the empirical losses \sum_{k=1}^{t-1} \ell(i, y_k) and then choose the minimum of the noisy ones:

I_t = \min_{i} \sum_{k=1}^{t-1} \ell(i, y_k) + Z_{i,t}.

This strategy is Hannan consistent and is called follow the perturbed leader since the action you choose is the best action after a perturbation.

If you got this far, I bet you were thinking I’d make some comment on the election. We’ve been following a perturbed leader for a while now. So here is an exercise : how is this prediction model a bad model for politics?

ITA 2008 : Part the First

I’ve spent this week at ITA 2008 at UCSD. There have been lots of great talks, but my brain is fairly saturated, so my ability to take more than cursory notes decreased as the workshop progressed. I’ve met a number of people who apparently read this blog (hi!), which is nice. One thing I’ll try to do is put in some more links to things as I think of them and also to raise some (possibly trivial) questions that came up as I wrote about the papers.

Iterative coding for network coding
Andrea Montanari, Stanford, Vishwambar Rathi, EPFL, and Rudiger Urbanke, EPFL

I think of this paper as the Shannon-theory version of the Koetter-Kschischang coding-theory analysis of network coding. There is a preprint of some of the results on ArXiV that I read a little while ago (and unsuccessfully presented in our group meeting). The channel model is a rank metric channel (see Gabidulin), and a notion of distance between subspaces can be defined. The idea is the following — messages are encoded into subspaces of a vector space over a finite field (taken to be F2 here). The set of all subspaces is the projective space, and the set of all subspaces of dimension k is the Grassmann manifold. So we can look at a codebook as a set of binary matrices of dimension m x l. In this model, headers are preserved, so the decoder in a network code knows the transfer matrix of the channel exactly. However, errors can still be introduced in the packet payload in the following sense. Each column can be written as

yj = xj + zj

where zj is uniformly chosen from an unknown subspace of limited rank. This is a perfectly good DMC channel model, and one can calculate its capacity. The interesting thing is that then you can define an LDPC code (with all variable nodes of degree 2 and a non-soliton distribution for the check nodes) across the columns of the codeword (which is a matrix, remember), with matrix-valued weights on the edges of the Tanner graph. The messages that are propagated in the BP algorithm correspond to information about affine subspace memberships of the rows. This code (a) achieves capacity on the rank metric channel, and the error for erasures also goes to 0. Since I have been concerned about jamming and interference and malicious attacks, the assumption on the header being preserved feels like a limitation of the approach, but I think that perhaps concatenating codes may also be able to preserve something about the header information. The reason I call it sort of Shannon theoretic is that the error model is random and the density evolution arguments are a bit like the probabilistic method.

On rate-distortion function of Poisson processes with a queueing distortion measure
Todd P. Coleman, UIUC, and Negar Kiyavash, UIUC, and Vijay G. Subramanian, Hamilton Institute, NUIM

The main idea here is find a rate-distortion problem corresponding to the bits through queues result on timing channels. So given a set of Poisson arrivals {Xi}, how can we define a distortion measure and rate-distortion function that makes sense? The choice they make is to requires that the reconstruction be causal, so that the packet times in the reconstructed version are delayed w.r.t. the originals and the objective then is to minimize the induced service time of the reconstruction. This makes the test channel into a first-come first-serve queue and in particular the optimal test channel is the exponential timing channel. The rate distortion function turns out to be -a log(D) where a is the intensity of the Poisson process. The benefit of their definition is that the proofs look as clean as the AWGN and BSC rate distortion proofs.

Network tomography via network coding
G. Sharma, IIT Bombay, S. Jaggi, CUHK, and B. K. Dey IIT Bombay

Network tomography means using input-output probing of a network to infer internal structural properties. In their model, all nodes have unique IDs and share some common randomness. Inferring the presence of absence of a single edge in the graph is relatively easy, and then the process can be repeated for every edge. The more interesting problem is structural inference in an adversarial setting. They show that the graph can be recovered if the min-cut from the source to destination is greater than twice the min-cut from the adversary to the destination (I think), which squares with the Byzantine Generals problem. The tomography problem is interesting, and I wonder if one can also embed adversary-detection methods as well to create a protocol to quarantine misbehaving internal nodes.

Coding theory in projective spaces
Tuvi Etzion, Technion, and Alexander Vardy, UCSD

This talk also dealt with the rank-metric channel. Previous work of Koetter and Kscischang had found analogues to the Singleton and Johnson bounds, and here they find Delsarte, Iterated Johnson, and packing bounds, as well as a “linear programming bound.” They show there are no nontrivial perfect codes in the projective space of dimension n over Fq, and define the notion of a quasilinear code, which is an abelian group in the projective space. A code is quasilinear iff the size is a power of two. Honestly, I found myself a bit lost, since I’m not really a coding theorist, but it seems that the rank metric channel has lots of interesting properties, some of which are old, and some of which are new, and so exciting new algebraic codes can be constructed for them.

On capacity scaling in arbitrary wireless networks
Urs Niesen, MIT, Piyush Gupta, Bell Labs, and Devavrat Shah, MIT

The goal for this paper was to relax some of the random-placement assumptions made in a lot of cooperative schemes for large wireless networks. They look at n nodes scattered in [0, n1/2]2 with a standard path loss model for the gains, full CSI, and equal rate demands for the each source-destination pair. There are three schemes we can consider — multihop, hierarchical cooperation, and a new scheme they have here, cooperative multihop. For large path loss exponent the multihop strategy is optimal, and in other regimes hierarchical cooperation is optimal. The cooperative multihop scheme works by dividing the network into squarelets and then doing multihop on the squarelet-level overlay graph. This strategy is optimal in some cases, and has the benefit of not relying on the node placement model, unlike the hierarchical cooperation scheme. These network models are interesting, but the full CSI assumption seems pretty darn restrictive. I wonder if the cooperative multihop would be more attractive also because channel estimation and ranging need only be done on a local level.

Percolation processes and wireless network resilience
Zhenning Kong and Edmund M. Yeh, Yale

Consider a random geometric graph with density chosen above the percolation threshold. This is an often (ab?)used model for wireless sensor networks. First off, throwing n points in the unit square converges, in distribution to the percolation model with an infinite Poisson process in the plane and the Gilbert disc model. If we make nodes “fail” by flipping a coin for each node, this is just thinning the process, so it’s easy to find a threshold on the failure probability in terms of the percolation threshold that guarantees a giant connected component will still exist. So far so good, but what happens if the failure probability depends on the degree of the node? High degree nodes may serve more traffic and hence will fail sooner, so this is a good model. This work finds necessary and sufficient conditions on the failure probability distribution p(k) for a giant connected component to still exist. Here k is the node degree. The necessary and sufficient conditions don’t match, but they are interesting. In particular, the sufficient condition is very clean. One application of these results is to a cascading failures model, where you want to stop one node going down from taking down the whole network (c.f. power failures in August 1996). It would be interesting to see if a real load redistribution could be integrated into this — nodes have weights (load) on them, proportional to degree, say, and then dynamically nodes are killed and push their load on to their ex-neighbors. How fast does the whole network collapse? Or perhaps a sleeping-waking model where a node dumps its traffic on neighbors while sleeping and then picks up the slack when it wakes up again…

You say Tschebyscheff, I say Chebyshev

As I reread the Burnashev-Zingagirov paper on interval estimation today, I came across a new (to me) spelling of the mathematician Chebyshev‘s name. I found a page with variant spellings, including

  • Chebyshev
  • Chebyshov
  • Chebishev
  • Chebysheff
  • Tschebischeff
  • Tschebyshev
  • Tschebyscheff
  • Tschebyschef
  • Tschebyschew

I know that “Tsch” comes from French/German transliterations. But today I saw “Chebysgev,” which is a totally new one to me. Where does the “g” come in? The name is actually Чебышев, which may or may not show up depending on your Unicode support.

UPDATE : Hari found “Tchebichev” in Loève’s Probability Theory book.

Smirnov’s Theorem : dependence flowchart

This last semester we had a reading group on percolation theory, using the new book by Bollobas and Riordan. The crowning moment of our discussions was a 3-week trek through the proof of Smirnov’s theorem, which shows the conformal invariance of crossing probabilities for the triangular lattice. The book apparently contains the first “complete” proof in print. It’s quite involved, and for my part of the presentation I made the following flowchart of the structure of the proof:

Smirnov’s Theorem Flowchart

a little puzzle

Prasad brought work to a halt earlier this week by issuing the following little conundrum. One indepdenent fair 6-sided die is rolled for each of n people. Each person’s number is written on a card and stuck on their head so that they can see everyone else’s number but their own. The people are not allowed to communicate after the numbers are assigned, and from just viewing the others’ numbers they must guess their own. They can come up with any rule that they like for doing this.

What rule should you choose to that is the probability that all of them guess correctly is maximized, and what is that probability?

This is a variation on the “colored hats game,” but the criterion you want to maximize is slightly different than in some instances of the game.

paper a day : approximating high-dimensional pdfs by low-dimensional ones

Asymptotically Optimal Approximation of Multidimensional pdf’s by Lower Dimensional pdf’s
Steven Kay
IEEE Transactions on Signal Processing, V. 55 No. 2, Feb. 2007, p. 725–729

The title kind of says it all. The main idea is that if you have a sufficient statistic, then you can create the true probability density function (pdf) of the data from the pdf of the sufficient statistic. However, if there is no sufficient statistic, you’re out of luck, and you’d like to create a low-dimensional pdf that somehow best captures the features you want from the data. This paper proves that a certain pdf created by a projection operation is optimal in that it minimizes the Kullback-Leibler (KL) divergence. Since the KL divergence dictates the error in many hypothesis tests, this projection operation is good in that decisions based on the projected pdf will be close to decisions based on the true pdf.

This is a correspondence item, so it’s short and sweet — equations are given for the projection and it is proved to minimize the KL divergence to the true distribution. Examples are given for cases in which sufficient statistics exist and do not exist, and an application to feature selection for discrimination is given. The benefit is that this theorem provides a way of choosing a “good” feature set based on the KL divergence, even when the true pdf is not known. This is done by estimating an expectation from the observed data (the performance then depends on the convergence speed of the empirical mean to the true mean, which should be exponentially fast in the number of data points).

The formulas are sometimes messy, but it looks like it could be a useful technique. I have this niggling feeling that a “bigger picture” view would be forthcoming from looking at information geometry/differential geometry viewpoint, but my fluency in those techniques is lacking at the moment.

Update: My laziness prevented me from putting up the link. Thanks, Cosma, for keeping me honest!