Shannon theory helps decipher Pictish?

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s F_2 statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

U_r = F_2 / \log\left( \frac{N_d}{N_u} \right)

where N_d and N_u are the number of di-grams and un-grams, respectively. Why normalize? The statistic F_2 by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If S_d is the number of digrams appearing once and T_d is the total number of digrams, they use a “di-gram repetition factor”

C_r = \frac{N_d}{N_u} + a \cdot \frac{S_d}{T_d}

where the tradeoff factor a is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare C_r to a threshold — if it is small then they deem the system to be more “heraldic”. If C_r is large then then do a three-way decision based on U_r. If U_r is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day

Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

The “parity check” on credit card numbers

Via Lifehacker, I came across this short description of how credit card numbers are coded, and how the last digit is a parity check. It’s a cute example of “real world” error-detection that pretty much anyone could understand. Cute extra-credit problem: how many valid credit card numbers are there out there? (This reminds me of a USAMTS problem from ages past).

ITA Workshop Aftermath

The 2010 ITA Workshop is over, and now that I have almost caught up on sleep, I can give a short report. I think this year was the largest workshop so far, with over 300 speakers.

One of the additions this year was a poster session to give those students who didn’t get a chance to speak at the “Graduation Day” talks on Wednesday an opportunity to present their research. The posters went up on Monday and many stayed up most of the week. I am usually dubious of poster sessions for more theoretical topics; from past experience I have had a hard time conveying anything useful in the poster and they are poorly attended. However, this poster session seemed to be a rousing success, and I hope they continue to do it in future years.

The theme this year was “networks,” broadly construed, and although I didn’t end up going to a lot of the “pure” networking sessions, the variety of topics was nice to see, from learning in networks to network models. Jon Kleinberg gave a great plenary lecture on cascading phenomena. I particularly enjoyed Mor Harchol-Balter’s tutorial on job scheduling in server farms. I learned to re-adjust my intuition for what good scheduling policies might be. The tutorials should be online sometime and I’ll post links when that happens.

The “Senseless Decompression” session organized by Rudi Urbanke and Ubli Mitra should also be online sometime. Apparently 20+ schools contributed short videos on the theme of \frac{1}{2} \log(1 + \mathrm{SNR}). Maybe we can make a YouTube channel for them.

Perhaps later I’ll touch on specific talks that I went to, but this time around I didn’t take too many notes, probably because I was a little exhausted from the organizing. I may post a more technical thing on the talk I gave about my recent work on privacy-preserving machine learning, but that will have to wait a bit in the queue. Mor’s tutorial suggests I should use Shortest Remaining Processing Time (SRPT) to make sure things don’t wait too long, and I have some low-hanging tasks that I can dispatch first.

Another take on matched sources and channels

Ehsan Ardestanizadeh posed a little problem in Young-Han Kim‘s group meeting before break and Paolo Minero presented a cute solution which I thought might be of interest. It’s related to ideas my advisor, Michael Gastpar worked on in his thesis.

Suppose I have a Gaussian variable S that is \mathcal{N}(0,1), so zero-mean and unit variance. If I observe Y = S + Z where Z is also \mathcal{N}(0,1) and independent of S, the minimum mean-square error (MMSE) estimate of S given Y is just

\mathsf{MMSE}( S | Y) = \frac{1}{2} Y

and the MSE is \mathbb{E}[ (S - \hat{S})^2 ] = 1/2. The question is this: can we somehow get an MSE of less than 1/2 by “encoding” S? Suppose now we can take any function f and let X = f(S), and Y = X + Z but with the restriction that \mathsf{Var}(X) \le 1. That is, the encoding cannot take any more power.

The answer is no, and comes via an “information theoretic” argument. Consider the vector version of the problem where you have n iid unit variance Gaussians S^n and you want to estimate S^n from Y^n = f(S^n) + Z^n where Z^n is iid unit variance Gaussian as well. The goal is to minimize the average per-letter distortion:

d(\hat{S}^n, S^n) = \frac{1}{n} \sum_{i=1}^{n} \mathbb{E}[ (S_i - \hat{S}_i)^2 ]

This is just the problem of joint source-channel coding of a Gaussian source over an AWGN channel with quadratic distortion and encoding function f must satisfy a unit power constraint. For this problem the rate distortion function is

R(D) = \frac{1}{2} \log \frac{1}{D}

and the capacity of the channel is \frac{1}{2} \log(1 + 1) = \frac{1}{2}. Since separate source coding (compression) followed by channel coding (error control) is optimal, in order to get distortion D the rate R(D) \le 1/2 so D \ge 1/2. Furthermore, this is achievable with no coding at all by just setting f(S^n) = S^n.

Now if there was a scheme for the single-letter case which got MSE less than 1/2, we could concatenate it to get a vector scheme with distortion less than 1/2. But since D \ge 1/2 in the optimal code, we get a contradiction. Thus encoding does not help in the single-letter case either. If S isn’t Gaussian the whole story changes though.

paper a (long time period) : Assouad, Fano, and Le Cam

Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.

Suppose we have a parametric family of distributions \mathcal{P} = \{P_{\theta} : \theta \in \mathcal{D}\}, where \mathcal{D} is a metric space with metric d(\cdot,\cdot). For two distributions P_1 and P_2 define the affinity \|P_1 \wedge P_2 \| by:

\|P_1 \wedge P_2 \| = 1 - \frac{1}{2} \| P_1 - P_2 \|_1

Let \mathop{\rm co}(\cdot) denote the convex hull. Then Le Cam’s lemmas is the following.

Le Cam’s Lemma. Let \hat{\theta} be an estimator of \theta(P) on \mathcal{P}. Suppose D_1 and D_2 be two sets such that d(s_1,s_2) \ge 2 \delta for all (s_1,s_2) \in D_1 \times D_2, and \mathcal{P}_1 and \mathcal{P}_2 be two subsets of \mathcal{P} such that \theta(P) \in D_i when P \in \mathcal{P}_i. Then

\sup_{P \in \mathcal{P}} \mathbb{E}_P[ d(\hat{\theta}, \theta(P)) ] \ge \delta \cdot \sup_{P_i \in \mathop{\rm co}(\mathcal{P}_i)} \| P_1 \wedge P_2 \|

This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.

Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the H = \{-1, 1\}^m, the vertices of the hypercube. We’ll write t \sim_j t' for t,t' \in H if they differ in the j-th coordinate.

Assouad’s Lemma. Let \mathcal{F}_m = \{P_{t} : t \in H\} be a set of 2^m probability measures indexed by H, and suppose there are m pseudo-distances d_m on \mathcal{D} such that for any pair (x,y)

d(x,y) = \sum_{j}^m d_j(x,y)

and that if t \sim_j t'

d_j( \theta(P_t), \theta(P_{t'}) ) \ge \alpha_m

Then

\max_{P_t \in \mathcal{F}_m} \mathbb{E}_{t}[ d(\hat{\theta},\theta(P_t))] \ge m \cdot \frac{\alpha_m}{2} \min\{ \|P_t \wedge P_{t'} \| : t \sim_j t', j \le m\}

The min comes about because it is the weakest over all neighbors (that is, over all j) of P_t in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.

Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.

Fano’s Lemma. Let \mathcal{M}_r \subset \mathcal{P} contain r probability measures such that for all j \ne j' with j,j' \le r

d(\theta(P_j), \theta(P_{j'})) \ge \alpha_r

and

D(P_j~\|~P_{j'}) \le \beta_r

Then

\max_j \mathbb{E}_j[ d(\hat{\theta},\theta(P_j)) ] \ge \frac{\alpha_r}{2}\left( 1 - \frac{\beta_r + \log 2}{\log r} \right)

Here D(\cdot\|\cdot) is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message W uniformly in \{1, 2, \ldots, r\} and then setting the output Y to have the distribution P_j conditioned on W = j.

The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.

Allerton 2009 : quick takes

I was at the Allerton conference last week for a quick visit home — I presented a paper I wrote with Can Aysal and Alex Dimakis on analyzing broadcast-based consensus algorithms when channels vary over time. The basic intuition is that if you can get some long-range connections “enough of the time,” then you’ll get a noticeable speed up in reaching a consensus value, but with high connectivity the variance of the consensus value from the true average increases. It was a fun chance to try out some new figures that I made in OmniGraffle.

The plenary was given by Abbas El Gamal on a set of new lecture notes on network information theory that he has been developing with Young-Han Kim. He went through the organization of the material and tried to show how much of the treatment could be developed using robust typicality (see Orlitsky-Roche) and a few basic packing and covering lemmas. This in turn simplifies many of the proofs structurally and can give a unified view. Since I’ve gotten an inside peek at the notes, I can say they’re pretty good and clear, but definitely dense going at times. They are one way to answer the question of “well, what do we teach after Cover and Thomas.” They’re self-contained and compact .

What struck me is that these are really notes on network information theory and not on advanced information theory, which is the special topics class I took at Berkeley. It might be nice to have some of the “advanced but non-network” material in a little set of notes. Maybe the format of Körner’s Fourier Analysis book would work : it could cover things like strong converses, delay issues, universality, wringing lemmas, AVCs, and so on. Almost like a wiki of little technical details and so on that could serve as a reference, rather than a class. The market’s pretty small though…

I felt a bit zombified during much of the conference, so I didn’t take particularly detailed notes, but here were a few talks that I found interesting.

  • Implementing Utility-Optimal CSMA (Lee, Jinsung (KAIST), Lee, Junhee (KAIST), Yi, Yung (KAIST), Chong, Song (KAIST), Proutiere, Alexandre (Microsoft Research), Chiang, Mung (Princeton University)) — There are lots of different models and algorithms for distributed scheduling in interference-limited networks. Many of these protocols involve message passing and the overhead from the messages may become heavy. This paper looked at how to use the implicit feedback from CSMA. They analyze a simple two-link system with two parameters (access and hold) and then use what I though was a “effective queue size” scheduling method. Some of the analysis was pretty tricky, using stochastic approximation tools. There were also extensive simulation results from a real deployment.
  • LP Decoding Meets LP Decoding: A Connection between Channel Coding and Compressed Sensing (Dimakis, Alexandros G. (USC), Vontobel, Pascal O. (Hewlett-Packard Laboratories)) — This paper proved some connections between Linear Programming (LP) decoding of LDPC codes and goodness of compressed sensing matrices. A simple example : if a parity check matrix is good for LP decoding then it is also good for compressed sensing. In particular, if it can correct k errors it can detect k-sparse signals, and if it can correct a specific error pattern \vec{e} it can detect all signals whose support is the same as \vec{e}. There were many other extensions of this results as well, to other performance metrics, and also the Gaussian setting.
  • The Compound Capacity of Polar Codes (Hassani, Hamed (EPFL), Korada, Satish Babu (EPFL), Urbanke, Ruediger (EPFL)) — Rudi gave an overview of polar codes from two different perspectives and then showed that in general polar codes do not achieve the compound channel capacity, but it’s not clear if the problem is with the code or with the decoding algorithm (so that’s still an open question).
  • Source and Channel Simulation Using Arbitrary Randomness (Altug, Yucel (Cornell University), Wagner, Aaron (Cornell University)) — The basic source simulation problem is to simulate an arbitrary source Y_1^n with an iid process (say equiprobable coin flips) X_1^n. Using the information spectrum method, non-matching necessary and sufficient conditions can be found for when this can be done. These are in terms of sup-entropy rates and so on. Essentially though, it’s a theory built on the limits or support of the spectra of the two processes X and Y. If the entropy spectrum in X dominates that of Y then you’re in gravy. This paper proposed a more refined notion of dominance which looks a bit like majorization of some sort. I think they showed that was sufficient, but maybe it’s also necessary too. Things got a bit rushed towards the end.
  • Channels That Die (Varshney, Lav R. (MIT), Mitter, Sanjoy K. (MIT), Goyal, Vivek K (MIT)) — This paper looked at a channel which has two states, alive (a) and dead (d), and at some random time during transmission, the channel will switch from alive to dead. For example, it might switch according to a Markov chain. Once dead, it stays dead. If you want to transmit over this channel in a Shannon-reliable sense you’re out of luck, but what you can do is try to maximize the expected number of bits you get through before the channel dies. I think they call this the “volume.” So you try to send a few bits at a time (say b_1, b_2, \ldots b_k are the number of bits in each chunk). How do you size the chunks to maximize the volume? This can be solved with dynamic programming. There are still several open questions left, but the whole construction relies on using “optimal codes of finite blocklength” which also needs to be solved. Lav has some interesting ideas along these lines that he told me about when I was in Boston two weeks ago…
  • The Feedback Capacity Region of El Gamal-Costa Deterministic Interference Channel (Suh, Changho (UC Berkeley), Tse, David (UC Berkeley)) — This paper found a single letter expression for the capacity region, which looks like the Han-Kobayashi achievable region minus the two weird 2 R_i + R_j constraints. Achievability uses Han-Kobayashi in a block-Markov encoding with backward decoding, and the converse uses the El Gamal-Costa argument with some additional Markov properties. For the “bit-level” deterministic channel model, there is an interpretation of the feedback as filling a “hole” in the interference graph.
  • Upper Bounds to Error Probability with Feedback (Nakiboglu, Baris (MIT), Zheng, Lizhong (MIT)) — This is a follow-on to their ISIT paper, which uses a kind of “tilted posterior matching” (a phrase which means a lot to people who really know feedback and error exponents in and out) in the feedback scheme. Essentially there’s a tradeoff between getting close to decoding quickly and minimizing the error probability, and so if you change the tilting parameter in Baris’ scheme you can do a little better. He analyzes a two phase scheme (more phases would seem to be onerous).
  • Infinite-Message Distributed Source Coding for Two-Terminal Interactive Computing (Ma, Nan (Boston University), Ishwar, Prakash (Boston University)) — This looked at the interactive function computation problem, in which Alice has X^n, Bob has Y^n and they want to compute some functions \{f_A(X_i,Y_i) : i \le n\} and \{f_B(X_i,Y_i) : i \le n\}. The pairs (X_i, Y_i) are iid and jointly distributed according to some distribution p_{XY}. As an example, (X_i,Y_i) could be correlated bits and f_A and f_B could be the AND function. In previous work the authors characterized the the rate region (allowing vanishing probability of error) for t rounds of communication, and here they look at the case when there are infinitely many back-and-forth messages. The key insight is to characterize the sum-rate needed as a function of p_{XY} — that is, to look at the function R : \mathcal{P}_{XY} \to \mathbb{R} and look at that surface’s analytic properties. In particular, they show it’s the minimum element of a class \mathcal{F} of functions which have some convexity properties. There is a preprint on ArXiV.

Information theory, emotion, music

From The Information Theory of Emotions of Musical Chords, posted on ArXiV today:

My basic hypothesis is as follows. While perceiving a separate musical chord, the value of a relative
objective function L that is directly related to the main proportion of pitches of its constituent
sounds is generated in the mind of the subject. The idea of an increasing value of the objective
function ( L>1 ) accompanied by positive utilitarian emotions corresponds to a major chord, while
the idea of a decreasing value of the objective function ( L<1 ) accompanied by negative utilitarian
emotions corresponds to a minor chord.

I naturally thought of The Mathematics of Love:

ISIT 2009 : talks part four

The Gelfand-Pinsker Channel: Strong Converse and Upper Bound for the Reliability Function
Himanshu Tyagi, Prakash Narayan
Strong Converse for Gel’fand-Pinsker Channel
Pierre Moulin

Both of these papers proved the strong converse for the Gel’fand-Pinsker channel, e.g. the discrete memoryless channel with iid state sequence P_S, where the realized state sequence is known ahead of time at the encoder. The first paper proved a technical lemma about the image size of “good codeword sets” which are jointly typical conditioned on a large subset of the typical set of S^n sequences. That is, given a code and a set of almost \exp(n H(P_S)) typical sequences in S^n for which the average probability of error is small, then they get a bound on the rate of the code. They then derive bounds on error exponents for the channel. The second paper has a significantly more involved argument, but one which can be extended to multiaccess channels with states known to the encoder.

Combinatorial Data Reduction Algorithm and Its Applications to Biometric Verification
Vladimir B. Balakirsky, Anahit R. Ghazaryan, A. J. Han Vinck

The goal of this paper was to compute short fingerprints f(\mathbf{x}) from long binary strings \mathbf{x} so that a verifier can look at a new long vector \mathbf{y} and tell whether or not \mathbf{y} is close to \mathbf{x} based on f(\mathbf{x}). This is a little different from hashing, where we could first compute f(\mathbf{y}). They develop a scheme which stores the index of a reference vector \mathbf{c} that is “close” to \mathbf{x} and the distance d(\mathbf{x},\mathbf{c}). This can be done with low complexity. They calculated false accept and reject rates for this scheme. Since the goal is not reconstruction or approximation, but rather a kind of classification, they can derive a reference vector set which has very low rate.

Two-Level Fingerprinting Codes
N. Prasanth Anthapadmanabhan, Alexander Barg

This looks at a variant of the fingerprinting problem, in which you a content creator makes several fingerprinted versions of an object (e.g. a piece of software) and then a group of pirates can take their versions and try to create a new object with a valid fingerprint. The marking assumption mean that the pirates can only alter the positions in their copies which are different. The goal is to build a code such that a verifier looking at an object produced by t pirates can identify at least one of the pirates. In the two-level problem, the objects are coarsely classified into groups (e.g. by geographic region) and the verifier wants to be able to identify one of the groups of one of the pirates when there are more than t pirates. They provide some conditions for traceability and constructions, This framework can also be extended to multiple levels.