Steven Strogatz has a column up on why it’s easier to think about natural frequencies rather than conditional probabilities.

### April 2010

April 26, 2010

## Natural frequencies instead of Bayes

Posted by Anand Sarwate under Uncategorized | Tags: probability, teaching |[6] Comments

April 23, 2010

## swimming in such extravagant grammatical constructions as dependent clauses

Posted by Anand Sarwate under Uncategorized | Tags: academia, humor, science |1 Comment

A hilarious account of trying to publish a Comment written by Rick Trebino. Via Aaron Clauset, via Cosma.

April 15, 2010

## truly random numbers?

Posted by Anand Sarwate under Uncategorized | Tags: randomness, science |[5] Comments

I heard this interesting story on All Things Considered on random number generation via quantum entanglement. The result was reported in Nature (the full paper is also available). I bet Scott will have something more to say about it (eventually), but it seems interesting to me, at least.

Perhaps I should go learn some quantum physics…

April 6, 2010

## Big setback for net neutrality

Posted by Anand Sarwate under Uncategorized | Tags: internet, politics |Leave a Comment

April 2, 2010

## Shannon theory helps decipher Pictish?

Posted by Anand Sarwate under Uncategorized | Tags: information theory, language, statistics |1 Comment

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

where and are the number of di-grams and un-grams, respectively. Why normalize? The statistic by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If is the number of digrams appearing once and is the total number of digrams, they use a “di-gram repetition factor”

where the tradeoff factor is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare to a threshold — if it is small then they deem the system to be more “heraldic”. If is large then then do a three-way decision based on . If is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in *Independence Day*…

**Update**: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

April 2, 2010

## Everyone hates NCLB

Posted by Anand Sarwate under Uncategorized | Tags: education, news, politics |[8] Comments

Via Kevin Drum, I read this Economist poll about the popularity of No Child Left Behind. A rather overwhelming plurality of those surveyed said that it has hurt our schools. I don’t think I’ve met a single person who likes the law, although I chalked that up to the general political leanings of my friends. Perhaps repealing it would be something that can get “bipartisan support.”

On another note, the Wikipedia article says that people pronounce NCLB as “nicklebee.” Really? I have never heard that before. (Brandy, I’m looking at you).