I will post more about Allerton soon (I’m still on the road), but I wanted to clear out some old links before doing that. I’m starting my new gig at TTIC this week, and the last few weeks have been a whirlwind of travel and internetlessness, so blogging has been curtailed.

And a (not-so-recent) tour around the ArXiV — I haven’t had a chance to read these yet, but maybe once I am settled…

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s $F_2$ statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

$U_r = F_2 / \log\left( \frac{N_d}{N_u} \right)$

where $N_d$ and $N_u$ are the number of di-grams and un-grams, respectively. Why normalize? The statistic $F_2$ by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If $S_d$ is the number of digrams appearing once and $T_d$ is the total number of digrams, they use a “di-gram repetition factor”

$C_r = \frac{N_d}{N_u} + a \cdot \frac{S_d}{T_d}$

where the tradeoff factor $a$ is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare $C_r$ to a threshold — if it is small then they deem the system to be more “heraldic”. If $C_r$ is large then then do a three-way decision based on $U_r$. If $U_r$ is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day

Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

I came across this blog post today while trying to figure out how to write the Romanian breve (the symbol ă) in a document, and it was an amusingly angry rant about Romanian orthography. The fact that the Romanian currency even got it wrong is pretty funny. But it seems a bit like a futile battle; things always change and I bet the orthography gets merged eventually. I, for one, miss the ess-zett (ß) in German, but it’s gone the way of the dinosaurs.

That would be a great name for an diacritic mark — a dinosaur. A stegosaurus sitting on top of a U. But how would it be pronounced?

As I reread the Burnashev-Zingagirov paper on interval estimation today, I came across a new (to me) spelling of the mathematician Chebyshev‘s name. I found a page with variant spellings, including

• Chebyshev
• Chebyshov
• Chebishev
• Chebysheff
• Tschebischeff
• Tschebyshev
• Tschebyscheff
• Tschebyschef
• Tschebyschew

I know that “Tsch” comes from French/German transliterations. But today I saw “Chebysgev,” which is a totally new one to me. Where does the “g” come in? The name is actually Чебышев, which may or may not show up depending on your Unicode support.

UPDATE : Hari found “Tchebichev” in Loève’s Probability Theory book.

I saw an instance of the the dreaded loose/lose error in the latest issue of the Transactions on Information Theory. Of course, for many authors, English is their second (or third, or fourth!) language, so errors will happen. But whither copy editing, I ask?

Chris Bertram over at has a post on speech regulation with which I’m not sure I agree, but I do wholeheartedly agree with this sentiment:

The Americans have a long tradition of trying to discuss these things using the language of an 18th-century document. Given the difficulties of shoehorning a lot of real-world problems into that frame, that gives them a long history of acrobatic hermeneutics somewhere in the vague area of free speech. Some of it is even relevant. The trouble is that many Americans (at least the ones who comment on blogs!) can’t tell the difference between discussing the free speech and discussing the application of their constitution.

Not only true on blogs, but in person as well.

I went to the keynote for the Global Conversations conference, sponsored by the UC Irvine International Center for Writing and Translation, this morning. It was given by Ngugi wa Thiong’o, whose books I have always meant to read but never have. The theme of the conference is how to address marginalized languages, and his keynote made a number of points that I thought were interesting.

Firstly, he had to address the issue of the rich body of literature, especially postcolonial literature, that is written in the langugage of the colonizers. It’s not just a colonial issue, so the appropriate binary here is dominant/marginalized. The overarching point was that writing in the language of the dominant impoverishes the local — it enables the access to the world stage but disables the home culture by taking away new cultural products. “Visibility in the dominant becomes invisibility in the marginalized,” he said. What then, is the place of conversation between different marginalized communities? While not outright calling for an activism or solidarity movement, he posed a goal of the conference as to kickstart the interactions that might initiate.

A second smaller point had to do with paralleling the language of technology transfer from industrialized to developing nations to more general knowledge and cultural production. While it’s true that strategies for preservation and revitalization can be transferred, the “working together” is what’s really interesting. Can different marginalized linguistic communities work together without losing something?

In a news story about the terrible fires raging across SoCal, I read that the flames forced “more than 265,000 evacuations from Malibu to San Diego, including a jail, a hospital and nursing homes.” Is there a subtle comparison going on here? Is the author suggesting they are more similar or dissimilar to each other?

Let’s call the whole thing off. One might say the enemy army was “routed,” but do we ever use the word “routing” in that sense? It sounds wrong to me — does it only appear as a participle?

In the networking context though, people say “routing” either to rhyme with “pouting” or with “tooting.” I’d use the latter for “route 66″ but I usually use the former for networking.

In case these linguistic musings bore people, fear not — I will write about other things soon.

I think that learning math has colored my ideas about the connotations of words. In particular, the words “sequence” and “series” have the meanings “a set of related things that follow each other in a particular order” and “a number of things of a similar nature coming one after the other.” They appear to be mostly interchangeable. But consider the phrase “in a sequence/series of of papers, Csiszár and Narayan proved…” Is there a difference in meaning?

To me, “series” connotes a cumulative effect — the set of papers build upon each other, as in the summation of a series encountered in calculus. The word “sequence” is milder — these are set of related papers that follow chronologically, but may look at different angles of the same problem rather than building on each other. Clearly this difference is leaking in from the technical definitions into my writing. Does this happen to anyone else?