paper a day : periodic bloodsucking rates for vampires

Alex forwarded me this reference:

R. F. Hartl, A. Mehlmann, and A. Novak
Journal of Optimization Theory and Applications, Vol.75 No. 3, 1992

In this paper, we present a new approach for modeling the dynamic intertemporal confrontation between vampires and humans. It is assumed that the change of the vampiristic consumption rate induces costs and that the vampire community also derives some utility from possessing humans and not only from consuming them. Using the Hopf bifurcation theorem it can be shown that cyclical bloodsucking strategies are optimal. These results are in accordance with empirical evidence.

Keywords: maximum principle; limit cycles; economics of human resources; vampire myth

To the feather-fool and lobcock, the pseudo-scientist and materialist, these deeper and obscurer things must, of course, appear a grandam’s tale.
Montague Summers. The Vampire in Europe

This paper analyzes a mathematical model of bloodsucking rates for vampires using control theory. However, as they note in the introduction,

To a traditional vampirologist, the use of optimal control theory against vampires, as exercised in Ref. 6, seems highly questionable. This is due to the fact that the application of Pontryagin’s principle requires the derivation of a shadow price for vampires. Such a price is, however, nonexistent since vampires do not have a shadow.

As a predator-prey scenario, we can model the dynamics of the population using some differential equations. The problem for the vampires is to set a bloodsucking rate (humans per vampire) so as to maximize a utility function subject to the dynamics. However, the model has to be made more sophisticated to account for the cyclical bloodsucking patterns found in real vampires. The modifications are twofold — firstly, vampires also derive some utility from posessing humans rather than just sucking blood from them, and secondly, changing the consumption rate penalizes the utility. So in this push-and-pull framework they can derive some cycles, appropriately named “cycles of fear” in which the bloodsucking rate is modulated over time to achieve a stable tradeoff and net utility.

The full version, which is not to be missed, can be found via SpringerLink.
For some earlier comments on the optimal destruction of vampires and macroeconomic policy (which involves the shadow price), see this related JSTOR article.

My research is waaaaaay too boring.


paper a day : importance sampling in rare-event simulations

Introduction to importance sampling in rare-event simulations
Mark Denny
European Journal of Physics, 22 (2001) : 403–411

This paper is about importance sampling (IS), which a method to improve the error behavior of Monte Carlo (MC) methods. In engineering systems, getting good simulation results for rare events (such as decoding error) on the order of 10-10 would require an obscene amount of computation if you just did things the naive way. For example, the quality of a numerical bound on the tail probability of a random variable gets worse and worse as you look farther and farther out. Importance sampling is a method of reweighting the distribution to either get a smaller error in the regime of interest and/or uniformize the estimation error. This paper gives some motivation, a simple IS algorithm, analysis, and some simulations. It’s pretty readable, and I went from knowing nothing about importance sampling to getting a decent idea of how to use it in practice, along with its potential problems and benefits.

paper a day : Fountain Capacity

This is one of the papers I saw at ISIT. At the time I thought I understood it, but reading the actual paper made it much clearer.

Fountain Capacity
S. Shamai, E. Telatar, and S. Verdú
Proc. Int’l Symp. on Information Theory

The background for this paper is fountain codes. The basic idea is pretty simple. Suppose you have k packets of data that you want to send. Then one way of sending them is to pick a random subset of the packets, XOR them, and send that, along with some header information about which packets are in the XOR. This is just a linear combination of the packets, so a receiver getting a sufficient number of coded packets will be able to invert the set of linear equations to get the remaining data. The encoder acts like a fountain spewing out coded packets. Suppose the packets go over an erasure channel, for which packets either arrive intact or are erased. If there are many users with varying link qualities, the ones with better links will be able to decode early and disconnect. If a user connects in the middle, they don’t lose anything from missing the “earlier” coded packets. All that matters is that each user/decoder gets enough unerased packets to enable them to decode.

The purpose of this paper is to see if and how this decoder-centric view of communication can be put on a more traditional information-theoretic (read : Shannon Theory) foundation. The idea is that the encoder will take M messages and encode them into infinitely long strings. The encoding is randomized, and the randomization is known to the decoder. The encoded symbols go through a channel, but the decoder is pay-per-view : it can only see n outputs of the channel, but it can’t control which n outputs. An adversary, without knowing which codebook is being used, picks the schedule of n outputs observed by the decoder. The probability of decoding error is then given by the worst case over subsets of n outputs. The capacity is the supremum of rates for which the decoding error can be driven to 0.

The main results of the paper are

  • Randomized coding is necessary, and an infinite amount of randomness is required.
  • Fountain capacity is always upper bounded by Shannon capacity.
  • For stationary memoryless channels the fountain capacity is equal to the Shannon capacity.
  • For channels with memory, the fountain capacity may be less than Shannon capacity.

What’s a little unsatisfying about this paper is precisely the same thing that makes it interesting — the channel model they’re using is a slippery one, so it’s difficult to see how “reasonable” it is. For example, I think the infinite randomization thing is a real drawback, but maybe the channel assumptions can be relaxed to take that away. I’m interested in this paper because it is very related to arbitrarily varying channels (AVCs), which is what my thesis will likely revolve around. There are some comments at the end which relate it to AVCs, but I should think a little bit more about the real underlying problem and how to connect it back to AVCs.

paper a day (month?) : isomap

A Global Geometric Framework for Nonlinear Dimensionality Reduction
J. B. Tenenbaum, V. de Silva, J. C. Langford
Science, v290, p2319 – 2323, 22 December 2000.

I have to present this paper for a computational neuroscience reading group that I can (finally) attend. The basic problem is something called manifold learning. Imagine you have a very large data set in a huge number of dimensions — for example, 64 x 64 pixel images of faces, which live in a 4096-dimensional space. Furthermore, suppose that all of the pictures are of the same person, with only two parameters changing — the angle of rotation of the face, and the illumination. The data has only two degrees of freedom, so you would think it would live on a 2-dimensional subspace.

Unfortunately, the data you have doesn’t occupy a linear subspace in the observed variables. Instead, they live on a manifold, which is like a surface in your high dimensional space that may be strangely curved or twisted, and so may be very poorly approximated by a linear subspace. However, the manifold does have its own coordinate system, and you can
calculate distances between points on the manifold. The shortest path between two points is called a geodesic.

Another way to visualize this is a ribbon that is curled into a spiral. A point A on the ribbon might look close to a point B that is on an outer ring, but if you unwrapped the ribbon they would be far apart. So one thing you might think to do is somehow figure out what the real distance (on the surface of the ribbon) between A and B is. You might be able to do that by somehow hopping from data-point to data-point in short hops that would hopefully follow the contour of the ribbon.

That is exactly what the Isomap algorithm, described in this paper, does to perform its dimensionality reduction. Given a set of points {xk : k = 1, 2,…, K} in n-dimensional space X, they first make a graph G with vertex set {xk} by putting an edge between two points if the distance between them in X is less than some threshold. The edge has a weight given by the distance. Then they find a K x K matrix D whose (i,j)-th entry is the minimum-weight (distance) path in the graph G between xi and xj. Assuming the threshold is not too large and there are a lot of data points, these lengths should closely approximate the true (geodesic) distance on the manifold.

Armed with this matrix of distances between points, we can try to embed the manifold into a d-dimensional Euclinean space without distorting the distances too much. This is an easy problem to visualize — just imagine taking a globe, fixing a few cities, and trying to make a flat map in which the distances between those cities are preserved. There is an algorithm called multidimensional scaling (MDS) that can do this. You can trade off the embedding distortion with the number of dimensions.

This paper comes with its own homepage, which has some data sets and MATLAB code. If only all practitioners were so generous — too often the algorithm implementation is kept under wraps, which makes me wonder if there are some dirty secrets hiding behind the pretty plots.

One thing about reading this paper that annoyed me is that all of the technical details (which I care about) are hidden in tiny-print footnotes. Furthermore, all the citations do not include the paper titles, so you can’t tell cited papers are actually about. I know that page space is precious, but it’s just plain stupid. Shame on you, Science. I expected better.

As as amusing postscript, the commentary on this paper and the locally linear embedding paper (Roweis and Saul) written by Seung and Lee has pictures of Bush and Gore in the print edition but due to copyright issues the online version had to be changed.

paper a day : a counterexample for Gaussian feedback capacity

This is kind of a cop-out, since it’s a two-page “note” rather than a real paper, but since I’m so out of practice of reading and synthesizing papers I think that it’ll be a baby-step back into the swing of things. This paper is another information-theory one, but it uses a more general strategy that is easily explained to a general audience.

A Counterexample to Cover’s 2P Conjecture on Gaussian Feedback Capacity
Young-Han Kim
ArXiV : cs.IT/0511019

Imagine you have a communication channel that can be modeled as:

where X is the input, Y is the output, and Z is a stationary Gaussian process (not necessarily white noise, but possibly shaped with some spectrum). The approach information theory uses is to say that we’ll limit ourselves to a block of n time steps and then pick a set of 2nR sequences of length n that will encode 2nR messages. These sequences are called codewords and R is the rate of the code, measured in bits/time. If you use n = 10 time steps at rate R, you get a total of 10 R bits, which can stand for one of 210 R messages. A coding system for this model picks a set of codewords and a decoder that will take the noisy output Y = X + Z and guess which codeword was sent. Shannon’s Noisy Coding Theorem says that there’s an number C, called the capacity of the channel, such that for any rate R below C, the probabiliy of making a decoding error goes to 0 as the blocklength n goes to infinity.

Of course, you have constrained resources, and in the above Gaussian problem we assume that the power (squared Euclidean norm) of the codewords is constrained to n P. If we fix the noise power as well, the capacity is neatly parameterized in terms of P and we can denote it by C(P). The paper here is about the role of feedback — if the encoder knows exactly what the decoder receives, then picking codewords “in advance” may not be a good idea, since the encoder can adapt what it sends based on “how confused” the decoder is. There’s a longstanding conjecture that feedback can’t improve the actual capacity by more than doubling the effective power, so that CFB(P) is smaller than C(2P).

It turns out this is false, and the scheme used to show it is a modification of an old one by Schalkwijk and Kailath (Transactions on Information Theory, 1966). The intution is that with perfect feedback, the encoder knows exactly what the decoder does, and therefore can mimic the decoder’s operation perfectly. Imagine Alice is talking to Bob, but Alice knows exactly what Bob thinks she is saying. Using this knowledge, she can tailor the next thing she says to correct Bob’s misunderstanding as best she can. Bob will still not quite get it, but will be closer, so Alice can iteratively zoom on the precise meaning until she’s satisfied that Bob has gotten close enough.

Another way of thinking about it is helping someone hang a picture. You can see if the picture is level or not, and you issue commands: “two inches higher on the left,” “one inch on the right,” “half an inch on the left,” “a nudge more on the right,” until it’s hung straight. This is precisely the intuition behind the Schalkwijk-Kailath coding scheme. With this feedback you can get more precise than you could without, and similarly in this Gaussian communication example, you can communicate more bits reliably with feedback than without.

The result is not earth-shattering, since there’s another result that says you can get at most half a bit per unit time with feedback than without. But it’s a nice application of the coding scheme and a cool demonstration of the usefulness of concrete examples.

paper a day : exponential error bounds for AVCs

Exponential Error Bounds for Random Codes in the Arbitrarily Varying Channel
Thomas Ericson
IEEE Transactions on Information Theory, 31(1) : 42–48

I really wish I had read this paper fully before writing my ISIT submission, because the vocabulary and terminology is less confusing here. The title is pretty self-explanatory, as long as you know information theory. An arbitrarily varying channel (AVC) is a probabilistic/mathematical model for a communication scenario in which there is a sender, receiver, and jammer. The jammer wants to minimize the communication rate between the sender and receiver. Assuming there exists some nonzero communication rate that is achievable over all possible strategies of the jammer, Ericson wants to find how the probability of making a decoding error decays as a function of the blocklength of the code. In a heuristic sense, if we make the packets of data larger and larger, how does the error probability scale?

AVCs have a lot of technical junk around them, but one important thing is the distinction between random and deterministic codes. A deterministic code C is a fixed mapping from messages {1, 2, 3, … N} into codewords {x1,x2, …, xN }. The jammer knows this mapping and so can tailor its strategy to be as bad as possible for this codebook. A random code is a random variable C taking values in the set of deterministic codes. If the values that C can take on is exp(n Rz), Ericson calls Rz the key rate of the code. That is, for a blocklength of n, we have nRz bits securely shared between the sender and receiver in order to choose a codebook without the jammer’s knowledge.

Ericson wants to find the largest attainable E such that the probability of decoding error is upper bounded by exp(-n (E – d)) for some small d. This is called the reliability function of the channel. Clearly it will depend on the code rate Rx = n-1 log N and the key rate Rz . Let F(Rx) is the reliability function when Rz is infinite. The first result is that E(Rx, Rz) >= min[ F(Rx), Rz].

In some cases the AVC capacity for deterministic codes is zero. This happens intuitively when the jammer can “pretend” to be the sender, so that the receiver can’t tell how to decode the message. In this case we say the channel is symmetrizable. If the channel is symmetrizable, then in fact E(Rx, Rz) = min[ F(Rx), Rz]. This basically says that the best decay in the error probability is limited by the key rate.

Interestingly enough, this is a similar result to the one that I proved in my ISIT paper, only in my case I did it for the Gaussian channel and I only have an upper bound on the error decay. Strict equality is a bit of a bear to prove.

Anyway, this paper is highly relevant to my current research and probably of interest to about 15 people in the world.

paper a day : pseudodifferential operators for wireless communications

It should really be “paper every two weeks” or something — mostly it’s laziness in the blogging half of the reading a paper a day rather than the reading part. This one was pretty heavy going though.

Pseudodifferential operators and Banach algebras in mobile communications
Thomas Strohmer
Appl.Comp.Harm.Anal., to appear.

This paper is pretty heavy on the math but does a nice job of explaining the engineering concepts as well. There are two main thrusts here : modeling wirelss channels and OFDM pulse design. Multipath fading and Doppler shift in wireless communication channels can be mathematized via pseudodifferential operators, and designing OFDM pulses can be viewed as trying to make a certain matrix that represent the channel equalization sparse or concentrated on its diagonal. The tools Strohmer uses are pretty advanced (for me), but that’s because I’m not up on my harmonic analysis.

Wireless fading channels are typically modeled by a convolution with a time-varying filter. If the filter were time-invariant we could take Fourier transforms and write the channel as a multiplication by a transfer function. In the time-varying case we can make a similar representation in terms of a time-varying transfer function, which makes the channel law into a pseudodifferential operator whose Kohn-Nirenberg symbol is precisely the time-varying transfer function. Thus modeling constraints on the symbol can be translated into constraints on the operator.

The decay profile from multipath fading and the effect of Doppler shift provide constraints on the localization of the symbol in the time-frequency plane. The engineering constraints don’t give a nice characterization of the symbol per se, but we can embed the class of channels into a Banach algebra of operators with certein weight functions. We can also embed the symbol into a specific modulation space called a Sjöstrand class.

Turning to the equalization problem, OFDM pulses form a Gabor system, which is a kind of time-frequency basis for a function space. We would like to choose a basis so that recovering the data modulated on these pulses is easy. It turns out that the whole equalization operation can be written as a matrix that is related to the set of pulses, so the condition we want is for this matrix and its inverse to be nearly diagonal or sparse.

The big theorem (Theorem 4.1) in the paper essentially states that for pulses with good time-frequency localization, if the channel’s K-N symbol is invertible, then the inverse of the equalization matrix belongs to a certain algebra. This is the mathematical statement of the pulses being a “good system for communication.” This plus some more advanced relationship between different spaces gives a way of actually engineering a pulse design that can trade off the spectral efficiency versus inter-symbol and inter-channel interference.

All in all, this was a good paper to read but I don’t think I have the background to go and use these tools and techniques on anything because the math is pretty far above my head.

paper a day : games people don’t play

Games People Don’t Play
Peter Winkler
Puzzlers’ Tribute, David Wolfe and Tom Rodgers, eds., A K Peters Ltd. (2001)

This is a short note on 4 little games that people don’t really play. Some are unfair, some are violent, but the interesting question arises for each — what is the best strategy for each player? I’ll describe two of them (with variants) — the other two are equally interesting, but there’s no point in repeating everything!

In “Larger Or Smaller”, Paula writes two numbers on different slips of paper. Victor picks one and has to guess if its the larger or smaller, winning $1 if he’s right. In one version the numbers are just different integers and in the other the numbers are drawn from a uniform distribution on [0,1] and Paula picks which one Victor sees.

Victor can do better than even (if only by a tiny amount) by adopting a clever threshholding strategy proposed by Cover. He guesses “larger” or “smaller” based on a comparison to his threshhold and will win slightly more than half the time (think about it). In the random version, the game is completely fair, with a simple strategy for Paula to choose which number to reveal.

In “Colored Hats” a dictator gives blue and red hats to a roomful of dissidents. They cannot communicate. Each can see the hats of the others and the simultaneously guess their own hat color. If they are wrong they are executed. How do we maximize the number of survivors? As a variant, the dissidents are lined up and guess from the back of the line to the front, so they can see all the hats ahead of them and hear the guesses from those behind them.

It turns out you can save floor(n/2) of the dissidents by having them pair up and guess their partner’s hat color. In the sequential version you can save all but 1 by having the first person guess based on the parity of the red hats in front of them. This provides enough bias for everyone to guess their own hat color.

This hat problem is particularly interesting because of its relation to some of the work in my Master’s thesis. So this paper is actually relevant (albeit tangentially) to my research. Plus it’s a fun and entertaining read. Recreational math and little games like this was what really got me interested in mathematics when I was younger. That and Raymond Smullyan, of course.

paper a day : the Gale-Ryser theorem

By request, today’s paper is kind of about the mathematics of paint-by-numbers puzzles. Merry Christmas, Erin.

A Simple Proof of the Gale-Ryser Theorem
Manfred Krause
The American Mathematical Monthly, Vol. 103, No. 4. (Apr., 1996), pp. 335-337.

The Gale-Ryser Theorem is one of those beautiful little gems in combinatorics. The problem is rather simple to state. Suppose you have a k by n matrix A of zeros and ones. Let r be the vector of sums along each row and c be the vector of sums along each column. The question is this : given an r and a c, when does there exist a matrix A with those row and column sums?

First off, note that r and c satisfy

\| r \|_1 = \|c\|_1

since the total number of ones has to be constant. We need a few more concepts to understand the result. A partition of N is a set of positive integers r1 ≥ r2 ≥ … ≥ rk whose sum is N. It turns out that it’s sufficient to only consider r and c that are partitions (hint : start swapping rows and columns).

Given two partitions r and c of N, we say r is dominated by c if

\sum_{i=1}^{m} r_i \le \sum_{i=1}^{m} c_i \qquad \forall m

where the summands become 0 if we exhaust the partition. As it turns out, dominance provides a partial order on the set of partitions that is particularly useful in combinatorics. The whole theory of majorization is based around the dominance order.

The last concept we need is a Ferrers board and the conjugate partition. Given a partition r of N, imagine a matrix of 0’s and 1’s that has rj 1’s in row j. This is the Ferrers matrix of the partition. The conjugate partition of r, denoted by r*, is the partition whose Ferrers matrix is the transpose of the Ferrers matrix for r.

The Gale-Ryser theorem says that a matrix A exists with row and column sums r and c if and only if r is dominated by c*.

To prove that we need the dominance relation, suppose that r is not dominated by c*. Then look for gaps in A — that is, rows in which you get a 0 and then a 1. Induction on the number of gaps proves it. If there are no gaps we have a Ferrers matrix, so we’re done. If there is a gap, you can swap two entries in a column to reduce the number of gaps.

The novelty in the paper is a new proof of the “only if” half of the statement. It’s basically an algorithm that provably constructs a matrix satisfying the row and column sums given. Clearly this matrix is not unique in general. You start with the Ferrers matrix A0 for c and do a series of swaps until the row sums equal r. Let Aj be the matrix after j swaps. The algorithm guarantees the following:

\| r(A_j) - q \|_2 < \| r(A_{j-1}) - q \|_2

So it will eventually converge to a valid matrix.

The application to paint-by-numbers should be pretty clear. A PBN is just a 0-1 matrix of pixels. If the image in the PBN is convex, then the puzzle is kind of trivial, but you can use the theorem to decide if it's solvable. An interesting extension to work out would be a theorem or fast algorithm for deciding if a general PBN is solvable or not. If I had the time to think about this, I would. Maybe it's something to do on a rainy afternoon after my deadlines have passed.

paper a day : marton’s broadcast region

This is pretty heavy on the information theory I guess…

A Coding Theorem for the Discrete Memoryless Broadcast Channel
Katalin Marton
IEEE Trans. Info. Theory, 25(3), 1979

This is a short paper that had a reputation (in my mind at least) for being hard to figure out. I think after my 3rd pass through it I kind of get it. The ingredients are: strongly typical sequences and auxiliary random variables that mimic the broadcast.

Suppose you want to send two different messages to two different people. Unfortunately, you can only broadcast, so that if you send Y then one person gets X via a communication channel F(X|Y) and the other gets Z via G(Z|Y). Given F and G, how should you design a Y so that the messages are decodable by their respective recipients, and so that the number of bits in the message is maximized? If we consider data rates, what are the maximum reliable rates of communication to the two users?

It turns out that the converse to this problem is not tight in many cases. That is, the set of data rates which are achievable is smaller than the set which may be achievable. This paper expands the region of achievable rates by a little bit more than had been known before 1979.

If Rx and Rz are the rates to the two users, then Marton shows that for random variables (U, V, W) which form Markov chains (U, V, W) – Y – X and (U, V, W) – Y – Z :

Rx ≤ I(W,U ; X)
Rx ≤ I(W,V ; Z)
Rx + Rz ≤ min{ I(W ; X), I(W ; Z) } + I(U ; X | W) + I(V ; X | W) – I(U ; V | W)

The main benefit of this theorem is that before U, V, and W had to be independent, and now they can be dependent.

The interesting part of this paper is of course the code construction (plus some applications at the end to the Blackwell channel that I won’t bother with here). The sketch is as follows — generate codewords wi iid with respect to W, then U codewords iid with P(u | wi), and V codewords iid with P(v | wi). Then you look at triples of (u, v, w) that are strongly typical, and for each one choose a random Y that is strongly typical conditioned on (u, v, w). Then the decoding is via strong typicality.

The details of the proof are interesting, but what I think it’s a great example of is how the pre-encoding can mimic broadcasting (W sends to U and V) and that this mimickry can be brought over to the channels at hand and is provably achievable. Marton writes that she thinks the novelty lies in the case when W is deterministic. In any case, it was an interesting read, and has given me a few more ideas to think about with respect to my own research.