Varshneys’ got a brand new blog.
(to be sung to the James Brown tune of the similar name)
Varshneys’ got a brand new blog.
(to be sung to the James Brown tune of the similar name)
Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:
Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.
The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.
After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:
where and
are the number of di-grams and un-grams, respectively. Why normalize? The statistic
by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.
Another feature which the authors think is important is the number of digrams which are repeated in the text. If is the number of digrams appearing once and
is the total number of digrams, they use a “di-gram repetition factor”
where the tradeoff factor is chosen via cross-validation on known corpora.
They then propose a two-step decision process. First they compare to a threshold — if it is small then they deem the system to be more “heraldic”. If
is large then then do a three-way decision based on
. If
is small then the text corresponds to letters, if larger, syllables, and larger still, words.
In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day…
Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.
Via Lifehacker, I came across this short description of how credit card numbers are coded, and how the last digit is a parity check. It’s a cute example of “real world” error-detection that pretty much anyone could understand. Cute extra-credit problem: how many valid credit card numbers are there out there? (This reminds me of a USAMTS problem from ages past).
The 2010 ITA Workshop is over, and now that I have almost caught up on sleep, I can give a short report. I think this year was the largest workshop so far, with over 300 speakers.
One of the additions this year was a poster session to give those students who didn’t get a chance to speak at the “Graduation Day” talks on Wednesday an opportunity to present their research. The posters went up on Monday and many stayed up most of the week. I am usually dubious of poster sessions for more theoretical topics; from past experience I have had a hard time conveying anything useful in the poster and they are poorly attended. However, this poster session seemed to be a rousing success, and I hope they continue to do it in future years.
The theme this year was “networks,” broadly construed, and although I didn’t end up going to a lot of the “pure” networking sessions, the variety of topics was nice to see, from learning in networks to network models. Jon Kleinberg gave a great plenary lecture on cascading phenomena. I particularly enjoyed Mor Harchol-Balter’s tutorial on job scheduling in server farms. I learned to re-adjust my intuition for what good scheduling policies might be. The tutorials should be online sometime and I’ll post links when that happens.
The “Senseless Decompression” session organized by Rudi Urbanke and Ubli Mitra should also be online sometime. Apparently 20+ schools contributed short videos on the theme of . Maybe we can make a YouTube channel for them.
Perhaps later I’ll touch on specific talks that I went to, but this time around I didn’t take too many notes, probably because I was a little exhausted from the organizing. I may post a more technical thing on the talk I gave about my recent work on privacy-preserving machine learning, but that will have to wait a bit in the queue. Mor’s tutorial suggests I should use Shortest Remaining Processing Time (SRPT) to make sure things don’t wait too long, and I have some low-hanging tasks that I can dispatch first.
I’ve been working hard with the organizing committee putting together the 2010 ITA Workshop, which promises to be a blast. With that and job applications I’m probably going to eschew blogging the workshop this year, but I’ll try to post some highlights. If nothing else, it will get me back to blogging again.
Ehsan Ardestanizadeh posed a little problem in Young-Han Kim‘s group meeting before break and Paolo Minero presented a cute solution which I thought might be of interest. It’s related to ideas my advisor, Michael Gastpar worked on in his thesis.
Suppose I have a Gaussian variable that is
, so zero-mean and unit variance. If I observe
where
is also
and independent of
, the minimum mean-square error (MMSE) estimate of
given
is just
and the MSE is . The question is this: can we somehow get an MSE of less than 1/2 by “encoding”
? Suppose now we can take any function
and let
, and
but with the restriction that
. That is, the encoding cannot take any more power.
The answer is no, and comes via an “information theoretic” argument. Consider the vector version of the problem where you have iid unit variance Gaussians
and you want to estimate
from
where
is iid unit variance Gaussian as well. The goal is to minimize the average per-letter distortion:
This is just the problem of joint source-channel coding of a Gaussian source over an AWGN channel with quadratic distortion and encoding function must satisfy a unit power constraint. For this problem the rate distortion function is
and the capacity of the channel is . Since separate source coding (compression) followed by channel coding (error control) is optimal, in order to get distortion
the rate
so
. Furthermore, this is achievable with no coding at all by just setting
.
Now if there was a scheme for the single-letter case which got MSE less than 1/2, we could concatenate it to get a vector scheme with distortion less than 1/2. But since in the optimal code, we get a contradiction. Thus encoding does not help in the single-letter case either. If
isn’t Gaussian the whole story changes though.
Kamalika pointed me to this paper by Bin Yu in a Festschrift for Lucien Le Cam. People who read this blog who took information theory are undoubtedly familiar with Fano’s inequality, and those who are more on the CS theory side may have heard of Assouad (but not for this lemma). This paper describes the relationship between several lower bounds on hypothesis testing and parameter estimation.
Suppose we have a parametric family of distributions , where
is a metric space with metric
. For two distributions
and
define the affinity
by:
Let denote the convex hull. Then Le Cam’s lemmas is the following.
Le Cam’s Lemma. Let be an estimator of
on
. Suppose
and
be two sets such that
for all
, and
and
be two subsets of
such that
when
. Then
This lemma gives a lower bound on the error of parameter estimates in terms of the total variational distance between the distributions associated to different parameter sets. It’s a bit different than the bounds we usually think of like Stein’s Lemma, and also a bit different than bounds like the Cramer-Rao bound.
Le Cam’s lemma can be used to prove Assouad’s lemma, which is a statement about a more structured set of distributions indexed by the , the vertices of the hypercube. We’ll write
for
if they differ in the j-th coordinate.
Assouad’s Lemma. Let be a set of
probability measures indexed by
, and suppose there are
pseudo-distances
on
such that for any pair
and that if
Then
The min comes about because it is the weakest over all neighbors (that is, over all j) of in the hypercube. Assouad’s Lemma has been used in various different places, from covariance estimation, learning, and other minimax problems.
Yu then shows how to prove Fano’s inequality from Assouad’s inequality. In information theory we see Fano’s Lemma as a statement about random variables and then it gets used in converse arguments for coding theorems to bound the entropy of the message set. Note that a decoder is really trying to do a multi-way hypothesis test, so we can think about the result in terms of hypothesis testing instead. This version can also be found in the Wikipedia article on Fano’s inequality.
Fano’s Lemma. Let contain
probability measures such that for all
with
and
Then
Here is the KL-divergence. The proof follows from the regular Fano’s inequality by choosing a message
uniformly in
and then setting the output
to have the distribution
conditioned on
.
The rest of the paper is definitely worth reading, but to me it was nice to see that Fano’s inequality is interesting beyond coding theory, and is in fact just one of several kinds of lower bound for estimation error.
I was at the Allerton conference last week for a quick visit home — I presented a paper I wrote with Can Aysal and Alex Dimakis on analyzing broadcast-based consensus algorithms when channels vary over time. The basic intuition is that if you can get some long-range connections “enough of the time,” then you’ll get a noticeable speed up in reaching a consensus value, but with high connectivity the variance of the consensus value from the true average increases. It was a fun chance to try out some new figures that I made in OmniGraffle.
The plenary was given by Abbas El Gamal on a set of new lecture notes on network information theory that he has been developing with Young-Han Kim. He went through the organization of the material and tried to show how much of the treatment could be developed using robust typicality (see Orlitsky-Roche) and a few basic packing and covering lemmas. This in turn simplifies many of the proofs structurally and can give a unified view. Since I’ve gotten an inside peek at the notes, I can say they’re pretty good and clear, but definitely dense going at times. They are one way to answer the question of “well, what do we teach after Cover and Thomas.” They’re self-contained and compact .
What struck me is that these are really notes on network information theory and not on advanced information theory, which is the special topics class I took at Berkeley. It might be nice to have some of the “advanced but non-network” material in a little set of notes. Maybe the format of Körner’s Fourier Analysis book would work : it could cover things like strong converses, delay issues, universality, wringing lemmas, AVCs, and so on. Almost like a wiki of little technical details and so on that could serve as a reference, rather than a class. The market’s pretty small though…
I felt a bit zombified during much of the conference, so I didn’t take particularly detailed notes, but here were a few talks that I found interesting.
From The Information Theory of Emotions of Musical Chords, posted on ArXiV today:
My basic hypothesis is as follows. While perceiving a separate musical chord, the value of a relative
objective function L that is directly related to the main proportion of pitches of its constituent
sounds is generated in the mind of the subject. The idea of an increasing value of the objective
function ( L>1 ) accompanied by positive utilitarian emotions corresponds to a major chord, while
the idea of a decreasing value of the objective function ( L<1 ) accompanied by negative utilitarian
emotions corresponds to a minor chord.
I naturally thought of The Mathematics of Love:
The Gelfand-Pinsker Channel: Strong Converse and Upper Bound for the Reliability Function
Himanshu Tyagi, Prakash Narayan
Strong Converse for Gel’fand-Pinsker Channel
Pierre Moulin
Both of these papers proved the strong converse for the Gel’fand-Pinsker channel, e.g. the discrete memoryless channel with iid state sequence , where the realized state sequence is known ahead of time at the encoder. The first paper proved a technical lemma about the image size of “good codeword sets” which are jointly typical conditioned on a large subset of the typical set of
sequences. That is, given a code and a set of almost
typical sequences in
for which the average probability of error is small, then they get a bound on the rate of the code. They then derive bounds on error exponents for the channel. The second paper has a significantly more involved argument, but one which can be extended to multiaccess channels with states known to the encoder.
Combinatorial Data Reduction Algorithm and Its Applications to Biometric Verification
Vladimir B. Balakirsky, Anahit R. Ghazaryan, A. J. Han Vinck
The goal of this paper was to compute short fingerprints from long binary strings
so that a verifier can look at a new long vector
and tell whether or not
is close to
based on
. This is a little different from hashing, where we could first compute
. They develop a scheme which stores the index of a reference vector
that is “close” to
and the distance
. This can be done with low complexity. They calculated false accept and reject rates for this scheme. Since the goal is not reconstruction or approximation, but rather a kind of classification, they can derive a reference vector set which has very low rate.
Two-Level Fingerprinting Codes
N. Prasanth Anthapadmanabhan, Alexander Barg
This looks at a variant of the fingerprinting problem, in which you a content creator makes several fingerprinted versions of an object (e.g. a piece of software) and then a group of pirates can take their versions and try to create a new object with a valid fingerprint. The marking assumption mean that the pirates can only alter the positions in their copies which are different. The goal is to build a code such that a verifier looking at an object produced by pirates can identify at least one of the pirates. In the two-level problem, the objects are coarsely classified into groups (e.g. by geographic region) and the verifier wants to be able to identify one of the groups of one of the pirates when there are more than
pirates. They provide some conditions for traceability and constructions, This framework can also be extended to multiple levels.