platonic solids and surfaces of constant norm

Warning : this is all tenuous connections in my head and probably has some horrible misunderstanding in it.

In high dimensions (> 4) there are only 3 regular polyhedra — the n-simplex, n-ocahedron, and n-cube. The latter are also surfaces of constant l1-norm and l-norm respectively. Those two can be thought of as “dual” to each other since 1/1 + 1/∞ = 1 (by analogy with p norms). The sphere is also a regular body that exists in every dimension, and is a surface of constant 2-norm, and 2 is self-dual. Is there some crazy norm or metric such that the simplex is also a surface of constant norm?

contradiction injunction

I came across the following in Royden’s Real Analysis:

All students are enjoined in the strongest possible terms to eschew proofs by contradiction!  There are two reasons for this prohibition : First, such proofs are very often fallacious, the contradiction on the final page arising from an erroneous deduction on an earlier page… Second, even when correct, such a proof gives little insight into the connection between A and B, whereas both the direct proof and the proof by contraposition construct a chain of argument connecting A with B.  One reason that mistakes are so much more likely in proofs by contradiction than in direct proofs or proofs by contraposition is that in a direct proof (assuming the hypothesis is not always false) all deductions from the hypothesis are true in those cases where the hypothesis holds, and similarly for proofs by contraposition (if the conclusion is not always true), the deductions from the negation of the conclusion are true in those cases where the conclusion is false.  Either way, one is dealing with true statements, and one’s intuition and knowledge about what is true help to keep one from making erroneous statements.  In proofs by contradiction, however, you are (assuming the theorem is true) in the unreal world where any statement can be derived, and so the falsity of a statement is no indication of an erroneous deduction.

This impassioned attack reminds me of Ionesco’s The Lesson or Stoppard’s Jumpers.  It’s tempting to psychoanalyze Royden and say that he must have an irrational fear of an unreal “make-believe” world or an overwhelming need to have certainty and truth.  But that’s just silly, isn’t it?

paper a day : pseudodifferential operators for wireless communications

It should really be “paper every two weeks” or something — mostly it’s laziness in the blogging half of the reading a paper a day rather than the reading part. This one was pretty heavy going though.

Pseudodifferential operators and Banach algebras in mobile communications
Thomas Strohmer
Appl.Comp.Harm.Anal., to appear.

This paper is pretty heavy on the math but does a nice job of explaining the engineering concepts as well. There are two main thrusts here : modeling wirelss channels and OFDM pulse design. Multipath fading and Doppler shift in wireless communication channels can be mathematized via pseudodifferential operators, and designing OFDM pulses can be viewed as trying to make a certain matrix that represent the channel equalization sparse or concentrated on its diagonal. The tools Strohmer uses are pretty advanced (for me), but that’s because I’m not up on my harmonic analysis.

Wireless fading channels are typically modeled by a convolution with a time-varying filter. If the filter were time-invariant we could take Fourier transforms and write the channel as a multiplication by a transfer function. In the time-varying case we can make a similar representation in terms of a time-varying transfer function, which makes the channel law into a pseudodifferential operator whose Kohn-Nirenberg symbol is precisely the time-varying transfer function. Thus modeling constraints on the symbol can be translated into constraints on the operator.

The decay profile from multipath fading and the effect of Doppler shift provide constraints on the localization of the symbol in the time-frequency plane. The engineering constraints don’t give a nice characterization of the symbol per se, but we can embed the class of channels into a Banach algebra of operators with certein weight functions. We can also embed the symbol into a specific modulation space called a Sjöstrand class.

Turning to the equalization problem, OFDM pulses form a Gabor system, which is a kind of time-frequency basis for a function space. We would like to choose a basis so that recovering the data modulated on these pulses is easy. It turns out that the whole equalization operation can be written as a matrix that is related to the set of pulses, so the condition we want is for this matrix and its inverse to be nearly diagonal or sparse.

The big theorem (Theorem 4.1) in the paper essentially states that for pulses with good time-frequency localization, if the channel’s K-N symbol is invertible, then the inverse of the equalization matrix belongs to a certain algebra. This is the mathematical statement of the pulses being a “good system for communication.” This plus some more advanced relationship between different spaces gives a way of actually engineering a pulse design that can trade off the spectral efficiency versus inter-symbol and inter-channel interference.

All in all, this was a good paper to read but I don’t think I have the background to go and use these tools and techniques on anything because the math is pretty far above my head.

paper a day : games people don’t play

Games People Don’t Play
Peter Winkler
Puzzlers’ Tribute, David Wolfe and Tom Rodgers, eds., A K Peters Ltd. (2001)

This is a short note on 4 little games that people don’t really play. Some are unfair, some are violent, but the interesting question arises for each — what is the best strategy for each player? I’ll describe two of them (with variants) — the other two are equally interesting, but there’s no point in repeating everything!

In “Larger Or Smaller”, Paula writes two numbers on different slips of paper. Victor picks one and has to guess if its the larger or smaller, winning $1 if he’s right. In one version the numbers are just different integers and in the other the numbers are drawn from a uniform distribution on [0,1] and Paula picks which one Victor sees.

Victor can do better than even (if only by a tiny amount) by adopting a clever threshholding strategy proposed by Cover. He guesses “larger” or “smaller” based on a comparison to his threshhold and will win slightly more than half the time (think about it). In the random version, the game is completely fair, with a simple strategy for Paula to choose which number to reveal.

In “Colored Hats” a dictator gives blue and red hats to a roomful of dissidents. They cannot communicate. Each can see the hats of the others and the simultaneously guess their own hat color. If they are wrong they are executed. How do we maximize the number of survivors? As a variant, the dissidents are lined up and guess from the back of the line to the front, so they can see all the hats ahead of them and hear the guesses from those behind them.

It turns out you can save floor(n/2) of the dissidents by having them pair up and guess their partner’s hat color. In the sequential version you can save all but 1 by having the first person guess based on the parity of the red hats in front of them. This provides enough bias for everyone to guess their own hat color.

This hat problem is particularly interesting because of its relation to some of the work in my Master’s thesis. So this paper is actually relevant (albeit tangentially) to my research. Plus it’s a fun and entertaining read. Recreational math and little games like this was what really got me interested in mathematics when I was younger. That and Raymond Smullyan, of course.

paper a day : the Gale-Ryser theorem

By request, today’s paper is kind of about the mathematics of paint-by-numbers puzzles. Merry Christmas, Erin.

A Simple Proof of the Gale-Ryser Theorem
Manfred Krause
The American Mathematical Monthly, Vol. 103, No. 4. (Apr., 1996), pp. 335-337.

The Gale-Ryser Theorem is one of those beautiful little gems in combinatorics. The problem is rather simple to state. Suppose you have a k by n matrix A of zeros and ones. Let r be the vector of sums along each row and c be the vector of sums along each column. The question is this : given an r and a c, when does there exist a matrix A with those row and column sums?

First off, note that r and c satisfy

\| r \|_1 = \|c\|_1

since the total number of ones has to be constant. We need a few more concepts to understand the result. A partition of N is a set of positive integers r1 ≥ r2 ≥ … ≥ rk whose sum is N. It turns out that it’s sufficient to only consider r and c that are partitions (hint : start swapping rows and columns).

Given two partitions r and c of N, we say r is dominated by c if

\sum_{i=1}^{m} r_i \le \sum_{i=1}^{m} c_i \qquad \forall m

where the summands become 0 if we exhaust the partition. As it turns out, dominance provides a partial order on the set of partitions that is particularly useful in combinatorics. The whole theory of majorization is based around the dominance order.

The last concept we need is a Ferrers board and the conjugate partition. Given a partition r of N, imagine a matrix of 0’s and 1’s that has rj 1’s in row j. This is the Ferrers matrix of the partition. The conjugate partition of r, denoted by r*, is the partition whose Ferrers matrix is the transpose of the Ferrers matrix for r.

The Gale-Ryser theorem says that a matrix A exists with row and column sums r and c if and only if r is dominated by c*.

To prove that we need the dominance relation, suppose that r is not dominated by c*. Then look for gaps in A — that is, rows in which you get a 0 and then a 1. Induction on the number of gaps proves it. If there are no gaps we have a Ferrers matrix, so we’re done. If there is a gap, you can swap two entries in a column to reduce the number of gaps.

The novelty in the paper is a new proof of the “only if” half of the statement. It’s basically an algorithm that provably constructs a matrix satisfying the row and column sums given. Clearly this matrix is not unique in general. You start with the Ferrers matrix A0 for c and do a series of swaps until the row sums equal r. Let Aj be the matrix after j swaps. The algorithm guarantees the following:

\| r(A_j) - q \|_2 < \| r(A_{j-1}) - q \|_2

So it will eventually converge to a valid matrix.

The application to paint-by-numbers should be pretty clear. A PBN is just a 0-1 matrix of pixels. If the image in the PBN is convex, then the puzzle is kind of trivial, but you can use the theorem to decide if it's solvable. An interesting extension to work out would be a theorem or fast algorithm for deciding if a general PBN is solvable or not. If I had the time to think about this, I would. Maybe it's something to do on a rainy afternoon after my deadlines have passed.

flags

Does anyone know why a maximal chain in the lattice of subspaces of a vector space is called a flag? A chain of increasing subspaces doesn’t look anything like a flag!

a paper a day : neuron polarity

I’ll keep ’em shorter from now on I think.

A probabilistic model for the establishment of neuron polarity
Kostya Khamin and Raya Khamin
J. Math Biol. 42, 26-40 (2001)

This paper seeks to answer a puzzling question in neuron developent — why does only one neurite (site on the neuron) out of many on a neuron grow into an axon? The answer they have lies in a particular kind of inhomogenous Markov chain or random walk that has also been called a “balls-in-bins process with feedback.” This kind of model has also been used in analyzing competitive scenarios in other fields as well as preferential attachment models for networks. They model neurites as bins and the length as the number of balls in a bin.

The basic model is this: suppose we have k bins which contain balls. Let a(t) be a length k vector whose entries are the number of balls in the bins at time t, and let f(n) be function that we will call the value of having n balls. At each time, a ball is added to one of the bins at random. The probability that
bin j is chosen is:

That is, the probability is equal to the value of bin j over the sum of the values of all the bins.

In this paper they look at value functions of the form

and show that there are three regimes of behavior:

  • Critical regime (p = 1) : all bins grow linearly with time, but there is one winner.
  • Ballistic growth (p > 1) : after a certain time, only one neurite grows. Another way of viewing this is all but finitely many balls will end up in one bin.
  • Subcritical + winner (1/2 < p < 1) : In this case the ratio of the number of balls in any two bins tends to one. So they all grow linearly, but if b(t) = a(t) – t/k is the fluctuation around the linear growth, then

    That is, the flucations are on the order of f(n). The bin with the largest c_j will always be leading.

  • Subcritical (0 < p <= 1/2) : In this case there is no leader. That is, for each bin there exists an infinite sequence of times for which that bin is the leader. The flucations, appropriately normalized, tend to be Gaussian. That is, some sort of central limit behavior kicks in.

In order to prove these cool and weird results, they embed the arrival process of balls in each bin into a sum of exponential random variables whose rate changes over time (according to how many balls are in the bin). That is, if a bin has r balls, the rate of the next term in the sum is f(r). The bin whose exponential random variable is smallest gets the next ball — the min operation on exponential random variables gives the probability ratio above. That is, the probability that the smallest exponential random variable was the j-th one is exactly that probability above. Thus the probability distribution for the number of balls is identical for the discrete-time model as well as the continuous time embedding. Once they have the embedding it’s a bunch of other tricks to get the results in nice form.

A more recent paper by Oliveira has a nice extension of these arguments for more general f(n) by approximating the embedding with Brownian Motion. As it turns out, the critical thing to check is the summability and square summability of 1/f(n).

another quick test

Since I was updating the old blog anyway, I figured I’d try out some nify new features that people have developed, like MimeTeX, which is a cgi script that renders LaTeX on the fly for you and doesn’t have any other dependencies. So now I can write about things like the Gaussian distribution:

Although it might be a bit hard to use expressions that are too fancy and use all the crazy squiggles and arrows that come with the AMS packages, this is definitely a step up in the world of blogging math.

kriging

In trying to figure out if this problem I’m working on has been addressed by the statistics community, I found myself forming Google queries on “semiparametric Gaussian estimation.” The problem I’m looking at is the following. Suppose X(t) is an iid Gaussian vector. I get to observe Y(t) = A(t) X(t) + W(t) where W(t) is an iid Gaussian vector of noise and A(t) is a matrix-valued random variables taking values in a finite set, iid across time. I want to form a minimum mean-squared error (MMSE) estimate of X(t) from Y(t). If A(t) is known at all times t then this is easy since I can make a linear estimator for each value that A(t) can take. Instead, I’ll make the crazy assumption that the estimator has to be linear and designed off-line (i.e. not data dependent), and that the distribution p(A) is not known. What’s the best estimator and worst-case error?

In my searching the web, however, I turned up some crazy things on long memory parameters in nonstationary time series. I also came across the term kriging, which looked like a typo for something. Instead, it really means Gaussian process regression — yet another instance of jargon standing in the way of understanding. Unfortunately, I don’t think it’s quite what I want. Back to the ol’ search engine…

Chernoff bounds and Bernstein’s trick

I came across some references to “Bernstein’s trick” in some papers I was reading, but had to do a little digging to find out what it really meant. Suppose you have independent and identically distributed random variables Xi for i = 1, 2, … n, taking values in [-1, 1], and with E[Xi] = 0. Our goal is to bound

Pr[ (1/n) ∑ Xi > a ] .

The Bernstein trick is to multiply both sides by a dummy variable t and exponentiate both sides:

Pr[ (1/n) ∑ Xi > a ] = Pr[ exp(t ∑ Xi) > exp(a n t) ] ,

Now we use the Markov inequality on the random variable exp(t ∑ X_i):

Pr[ exp(t ∑ Xi) > exp(a n t) ] ≤ exp(- a n t) E[ exp(t ∑ Xi) ],

Now we can leverage the fact that the X_i‘s are independent:

exp(- a n t) E[ exp(t ∑ Xi) ] = exp(- a n t) E[ exp(tX) ]n,

where X has the same distribution as Xi. Then we can expand exp(tX) as a Taylor series:

exp(- a n t) E[ exp(tX) ]n = exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] .

We now leverage the fact that X is in the interval [-1,1] and that E[X] = 0 to bound the sum of the higher-order terms in the Taylor series:

exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] ≤ exp(- a n t) exp(b k t2) .

For some constant b, which you can pick based on how much you know about the distribution of X.

Now the more probability-savvy may say “wait a minute, E[ exp(tX) ] is just the moment generating function mX(t), so isn’t this just a fancy Chernoff bound? The answer is basically yes, but the term “Chernoff bound” is used for a lot of related bounds. The one I’m most familiar with requires you to know the moment generating function E[ exp(tX) ]:

P( exp(t X) > exp(a t) ) ≤ exp(-a t) mX(t) .

Since this holds for all t, you differentiate with respect to t and find the minimum to get the tightest bound.

The main difference in these two Chernoff-style bounds is the lack of knowledge needed to apply the former to sums of iid random variables. Of course, you can get all of these bounds from large deviations theory or concentration inequalities, but sometimes a rough trick is all you need to get a sufficiently fast exponential decay so that you can move on to the next step.

Note : updated with nicer characters, thanks to Erin.

[1] Ahlswede, R. “Elimination of Correlation in Random Codes for Arbitrarily Varying Channels,” Zeitschr. Wahr. und Verw. Geb. V.44, 1978, pp. 159–175.
[2] Wigderson, A. and Xiao, D. “A Randomness-Efficient Sampler for Matrix-valued Functions and Applications.” FOCS 2005.
[3] Gallager, R. Discrete Stochastic Processes, Boston : Kluwer, 1996.