computer science publication habits are deadly

Over at Lance Fortnow‘s Computational Complexity Blog is a discussion on the IEEE Foundations of Computer Science (FOCS) conference, which is one of the two “major” conferences in theoretical computer science. Apparently an undergrad at MIT has 3 papers in FOCS, which is somewhat unprecendented. This sparked a long argument about the nature of conference publications and how they are used to measure grad students viz. applying to faculty jobs, and so on.

What is interesting (to an outsider) is that CS Theory and CS in general has this “conference publication is everything” culture. In most every other academic field (from various kinds of engineering to performance studies) conferences are expensive to attend and the real results appear in journals. One exception is the Modern Language Association (MLA) conference to which grad students go for job interviews. Personally, I think a lot of this FOCS/STOC vs. SODA pissing contest would get sidelined if journal publication became the norm rather than the exception in CS. Or am I missing something here?

paper a day (month?) : isomap

A Global Geometric Framework for Nonlinear Dimensionality Reduction
J. B. Tenenbaum, V. de Silva, J. C. Langford
Science, v290, p2319 – 2323, 22 December 2000.

I have to present this paper for a computational neuroscience reading group that I can (finally) attend. The basic problem is something called manifold learning. Imagine you have a very large data set in a huge number of dimensions — for example, 64 x 64 pixel images of faces, which live in a 4096-dimensional space. Furthermore, suppose that all of the pictures are of the same person, with only two parameters changing — the angle of rotation of the face, and the illumination. The data has only two degrees of freedom, so you would think it would live on a 2-dimensional subspace.

Unfortunately, the data you have doesn’t occupy a linear subspace in the observed variables. Instead, they live on a manifold, which is like a surface in your high dimensional space that may be strangely curved or twisted, and so may be very poorly approximated by a linear subspace. However, the manifold does have its own coordinate system, and you can
calculate distances between points on the manifold. The shortest path between two points is called a geodesic.

Another way to visualize this is a ribbon that is curled into a spiral. A point A on the ribbon might look close to a point B that is on an outer ring, but if you unwrapped the ribbon they would be far apart. So one thing you might think to do is somehow figure out what the real distance (on the surface of the ribbon) between A and B is. You might be able to do that by somehow hopping from data-point to data-point in short hops that would hopefully follow the contour of the ribbon.

That is exactly what the Isomap algorithm, described in this paper, does to perform its dimensionality reduction. Given a set of points {xk : k = 1, 2,…, K} in n-dimensional space X, they first make a graph G with vertex set {xk} by putting an edge between two points if the distance between them in X is less than some threshold. The edge has a weight given by the distance. Then they find a K x K matrix D whose (i,j)-th entry is the minimum-weight (distance) path in the graph G between xi and xj. Assuming the threshold is not too large and there are a lot of data points, these lengths should closely approximate the true (geodesic) distance on the manifold.

Armed with this matrix of distances between points, we can try to embed the manifold into a d-dimensional Euclinean space without distorting the distances too much. This is an easy problem to visualize — just imagine taking a globe, fixing a few cities, and trying to make a flat map in which the distances between those cities are preserved. There is an algorithm called multidimensional scaling (MDS) that can do this. You can trade off the embedding distortion with the number of dimensions.

This paper comes with its own homepage, which has some data sets and MATLAB code. If only all practitioners were so generous — too often the algorithm implementation is kept under wraps, which makes me wonder if there are some dirty secrets hiding behind the pretty plots.

One thing about reading this paper that annoyed me is that all of the technical details (which I care about) are hidden in tiny-print footnotes. Furthermore, all the citations do not include the paper titles, so you can’t tell cited papers are actually about. I know that page space is precious, but it’s just plain stupid. Shame on you, Science. I expected better.

As as amusing postscript, the commentary on this paper and the locally linear embedding paper (Roweis and Saul) written by Seung and Lee has pictures of Bush and Gore in the print edition but due to copyright issues the online version had to be changed.

paper a day : a counterexample for Gaussian feedback capacity

This is kind of a cop-out, since it’s a two-page “note” rather than a real paper, but since I’m so out of practice of reading and synthesizing papers I think that it’ll be a baby-step back into the swing of things. This paper is another information-theory one, but it uses a more general strategy that is easily explained to a general audience.

A Counterexample to Cover’s 2P Conjecture on Gaussian Feedback Capacity
Young-Han Kim
ArXiV : cs.IT/0511019

Imagine you have a communication channel that can be modeled as:

where X is the input, Y is the output, and Z is a stationary Gaussian process (not necessarily white noise, but possibly shaped with some spectrum). The approach information theory uses is to say that we’ll limit ourselves to a block of n time steps and then pick a set of 2nR sequences of length n that will encode 2nR messages. These sequences are called codewords and R is the rate of the code, measured in bits/time. If you use n = 10 time steps at rate R, you get a total of 10 R bits, which can stand for one of 210 R messages. A coding system for this model picks a set of codewords and a decoder that will take the noisy output Y = X + Z and guess which codeword was sent. Shannon’s Noisy Coding Theorem says that there’s an number C, called the capacity of the channel, such that for any rate R below C, the probabiliy of making a decoding error goes to 0 as the blocklength n goes to infinity.

Of course, you have constrained resources, and in the above Gaussian problem we assume that the power (squared Euclidean norm) of the codewords is constrained to n P. If we fix the noise power as well, the capacity is neatly parameterized in terms of P and we can denote it by C(P). The paper here is about the role of feedback — if the encoder knows exactly what the decoder receives, then picking codewords “in advance” may not be a good idea, since the encoder can adapt what it sends based on “how confused” the decoder is. There’s a longstanding conjecture that feedback can’t improve the actual capacity by more than doubling the effective power, so that CFB(P) is smaller than C(2P).

It turns out this is false, and the scheme used to show it is a modification of an old one by Schalkwijk and Kailath (Transactions on Information Theory, 1966). The intution is that with perfect feedback, the encoder knows exactly what the decoder does, and therefore can mimic the decoder’s operation perfectly. Imagine Alice is talking to Bob, but Alice knows exactly what Bob thinks she is saying. Using this knowledge, she can tailor the next thing she says to correct Bob’s misunderstanding as best she can. Bob will still not quite get it, but will be closer, so Alice can iteratively zoom on the precise meaning until she’s satisfied that Bob has gotten close enough.

Another way of thinking about it is helping someone hang a picture. You can see if the picture is level or not, and you issue commands: “two inches higher on the left,” “one inch on the right,” “half an inch on the left,” “a nudge more on the right,” until it’s hung straight. This is precisely the intuition behind the Schalkwijk-Kailath coding scheme. With this feedback you can get more precise than you could without, and similarly in this Gaussian communication example, you can communicate more bits reliably with feedback than without.

The result is not earth-shattering, since there’s another result that says you can get at most half a bit per unit time with feedback than without. But it’s a nice application of the coding scheme and a cool demonstration of the usefulness of concrete examples.

we’re jammin’

I’ve been meaning to write some notes towards a contextualization of the research I’ve been doing lately, as part of an effort to find a better angle on things and also spur myself to actually tackle some harder/more interesting problems than I’ve looked at so far. At the moment I’m looking at communication scenarios in which there is some unknown interference. The crux of the matter is that the communication strategy must be designed knowing that the interference is there, but not what it is like. There is a disconnect between two different strands of research on this subject (so much so that the papers in one don’t cite the other as relevant). In both the interference is viewed as coming from a jammer who wants to screw over your communication.

In the world of arbitrarily varying channels (Blackwell, Breiman and Thomasian, Ahlswede, Csiszar and Narayan, Hughes and Narayan, etc), mostly populated by mathematicians, interference is treated as malicious but not listening in. Perhaps a better way to view it is that you want to provide guarantees even if the interference behaves in the worst way possible. This worst-case analysis is often no different from the case when you have a statistical description of the analysis — the average and worst-case scenarios are quite similar. However, in some situations the worst-case is significantly worse than the average case. Here, if a secret key is shared between the transmitter and receiver, we can recover the average-case behavior, although the reliabiliity of the communication may be worse.

On the other side, we have wire-tap situations in which the jammer not only knows your encoding scheme, but also can try to figure out what you’re sending and actively cancel it out. Most of these analyses initially took the form of game-theory set-ups in which one player tries to maximize a function (the capacity) and the other tries to minimize it. The space of strategies for each player take the form of choosing probability distributions. In this highly structured framework you can prove minimax theories about the jammer and and encoder strategies. Extensions to this view take into account channel fading for wireless, or interference for multiple access (Shafiee and Ulukus).

But never the twain shall meet, as Kipling would say, and it’s hard to sort out the connections between interference, jamming, statistical knowledge, and robustness. Part of what I have to do is sort out what is what and see if I can make any more statements stitching the two together and finding real scenarios in which the unreasonableness of the models can be made more reasonable.

paper a day : exponential error bounds for AVCs

Exponential Error Bounds for Random Codes in the Arbitrarily Varying Channel
Thomas Ericson
IEEE Transactions on Information Theory, 31(1) : 42–48

I really wish I had read this paper fully before writing my ISIT submission, because the vocabulary and terminology is less confusing here. The title is pretty self-explanatory, as long as you know information theory. An arbitrarily varying channel (AVC) is a probabilistic/mathematical model for a communication scenario in which there is a sender, receiver, and jammer. The jammer wants to minimize the communication rate between the sender and receiver. Assuming there exists some nonzero communication rate that is achievable over all possible strategies of the jammer, Ericson wants to find how the probability of making a decoding error decays as a function of the blocklength of the code. In a heuristic sense, if we make the packets of data larger and larger, how does the error probability scale?

AVCs have a lot of technical junk around them, but one important thing is the distinction between random and deterministic codes. A deterministic code C is a fixed mapping from messages {1, 2, 3, … N} into codewords {x1,x2, …, xN }. The jammer knows this mapping and so can tailor its strategy to be as bad as possible for this codebook. A random code is a random variable C taking values in the set of deterministic codes. If the values that C can take on is exp(n Rz), Ericson calls Rz the key rate of the code. That is, for a blocklength of n, we have nRz bits securely shared between the sender and receiver in order to choose a codebook without the jammer’s knowledge.

Ericson wants to find the largest attainable E such that the probability of decoding error is upper bounded by exp(-n (E – d)) for some small d. This is called the reliability function of the channel. Clearly it will depend on the code rate Rx = n-1 log N and the key rate Rz . Let F(Rx) is the reliability function when Rz is infinite. The first result is that E(Rx, Rz) >= min[ F(Rx), Rz].

In some cases the AVC capacity for deterministic codes is zero. This happens intuitively when the jammer can “pretend” to be the sender, so that the receiver can’t tell how to decode the message. In this case we say the channel is symmetrizable. If the channel is symmetrizable, then in fact E(Rx, Rz) = min[ F(Rx), Rz]. This basically says that the best decay in the error probability is limited by the key rate.

Interestingly enough, this is a similar result to the one that I proved in my ISIT paper, only in my case I did it for the Gaussian channel and I only have an upper bound on the error decay. Strict equality is a bit of a bear to prove.

Anyway, this paper is highly relevant to my current research and probably of interest to about 15 people in the world.

paper a day : pseudodifferential operators for wireless communications

It should really be “paper every two weeks” or something — mostly it’s laziness in the blogging half of the reading a paper a day rather than the reading part. This one was pretty heavy going though.

Pseudodifferential operators and Banach algebras in mobile communications
Thomas Strohmer
Appl.Comp.Harm.Anal., to appear.

This paper is pretty heavy on the math but does a nice job of explaining the engineering concepts as well. There are two main thrusts here : modeling wirelss channels and OFDM pulse design. Multipath fading and Doppler shift in wireless communication channels can be mathematized via pseudodifferential operators, and designing OFDM pulses can be viewed as trying to make a certain matrix that represent the channel equalization sparse or concentrated on its diagonal. The tools Strohmer uses are pretty advanced (for me), but that’s because I’m not up on my harmonic analysis.

Wireless fading channels are typically modeled by a convolution with a time-varying filter. If the filter were time-invariant we could take Fourier transforms and write the channel as a multiplication by a transfer function. In the time-varying case we can make a similar representation in terms of a time-varying transfer function, which makes the channel law into a pseudodifferential operator whose Kohn-Nirenberg symbol is precisely the time-varying transfer function. Thus modeling constraints on the symbol can be translated into constraints on the operator.

The decay profile from multipath fading and the effect of Doppler shift provide constraints on the localization of the symbol in the time-frequency plane. The engineering constraints don’t give a nice characterization of the symbol per se, but we can embed the class of channels into a Banach algebra of operators with certein weight functions. We can also embed the symbol into a specific modulation space called a Sjöstrand class.

Turning to the equalization problem, OFDM pulses form a Gabor system, which is a kind of time-frequency basis for a function space. We would like to choose a basis so that recovering the data modulated on these pulses is easy. It turns out that the whole equalization operation can be written as a matrix that is related to the set of pulses, so the condition we want is for this matrix and its inverse to be nearly diagonal or sparse.

The big theorem (Theorem 4.1) in the paper essentially states that for pulses with good time-frequency localization, if the channel’s K-N symbol is invertible, then the inverse of the equalization matrix belongs to a certain algebra. This is the mathematical statement of the pulses being a “good system for communication.” This plus some more advanced relationship between different spaces gives a way of actually engineering a pulse design that can trade off the spectral efficiency versus inter-symbol and inter-channel interference.

All in all, this was a good paper to read but I don’t think I have the background to go and use these tools and techniques on anything because the math is pretty far above my head.

am I a peer?

I’ve actually been invited to review a paper, which is a first for me. I guess all that stuff on peer review I’ve been reading during my procrastinating moments will actually come in use now. They’ve asked me to turn in my review in two months, which is good for me, since I have too many looming deadlines. I guess this is why publishing takes so long — reviewers take 2 months, editorial checking and decision, revisions, another round of reviewing, more revisions, and submission. It’s no wonder that in computer science they put all their energy into conferences — it keps the pace of research higher.

paper a day : marton’s broadcast region

This is pretty heavy on the information theory I guess…

A Coding Theorem for the Discrete Memoryless Broadcast Channel
Katalin Marton
IEEE Trans. Info. Theory, 25(3), 1979

This is a short paper that had a reputation (in my mind at least) for being hard to figure out. I think after my 3rd pass through it I kind of get it. The ingredients are: strongly typical sequences and auxiliary random variables that mimic the broadcast.

Suppose you want to send two different messages to two different people. Unfortunately, you can only broadcast, so that if you send Y then one person gets X via a communication channel F(X|Y) and the other gets Z via G(Z|Y). Given F and G, how should you design a Y so that the messages are decodable by their respective recipients, and so that the number of bits in the message is maximized? If we consider data rates, what are the maximum reliable rates of communication to the two users?

It turns out that the converse to this problem is not tight in many cases. That is, the set of data rates which are achievable is smaller than the set which may be achievable. This paper expands the region of achievable rates by a little bit more than had been known before 1979.

If Rx and Rz are the rates to the two users, then Marton shows that for random variables (U, V, W) which form Markov chains (U, V, W) – Y – X and (U, V, W) – Y – Z :

Rx ≤ I(W,U ; X)
Rx ≤ I(W,V ; Z)
Rx + Rz ≤ min{ I(W ; X), I(W ; Z) } + I(U ; X | W) + I(V ; X | W) – I(U ; V | W)

The main benefit of this theorem is that before U, V, and W had to be independent, and now they can be dependent.

The interesting part of this paper is of course the code construction (plus some applications at the end to the Blackwell channel that I won’t bother with here). The sketch is as follows — generate codewords wi iid with respect to W, then U codewords iid with P(u | wi), and V codewords iid with P(v | wi). Then you look at triples of (u, v, w) that are strongly typical, and for each one choose a random Y that is strongly typical conditioned on (u, v, w). Then the decoding is via strong typicality.

The details of the proof are interesting, but what I think it’s a great example of is how the pre-encoding can mimic broadcasting (W sends to U and V) and that this mimickry can be brought over to the channels at hand and is provably achievable. Marton writes that she thinks the novelty lies in the case when W is deterministic. In any case, it was an interesting read, and has given me a few more ideas to think about with respect to my own research.