Linkage

Posted on April 18, 2016 by Anand Sarwate

At a DARPA PI meeting recently, I met some folks from Cybernetica who told me about the hot new startup CountryOS! (EDIT: it’s not their startup).

A recent 99% Invisible episode describes the history of the SIGSALY, a secure communication system developed during WWII that used white noise one-time pads printed on vinyl to analog-encrypt communications lines.

Thanks to The Allusionist, I learned about EuroSpeak and discovered this guide on Misused English words and expressions in EU publications, which is hilarious.

“The needs of the many,” privilege, and power

Posted on December 14, 2014 by Anand Sarwate

There’s a certain set of sentiments which undergird a lot of thinking in engineering, and especially engineering about data. You want a method which has good performance “on average” over the population. The other extreme is worst-case, but there are things you can only do in the average case. By focusing on average-case gain, you get a kind of “the needs of the many outweigh the needs of the few” way of thinking about the world.

The needs of the many outweigh the needs of the one… or the few?

Now in the abstract land of mathematical models and algorithms, this might seem like a reasonable principle — if you have to cram everything into a single population utility function you might as well then optimize that. However, this gets messier when you start implementing it in the real world (unless of course you’re an economist of a certain stripe). The needs of the many are often the needs of the more powerful or dominant groups in society. The needs of the few are perhaps those who have been historically marginalized or victimized. Extolling the benefits to the many is often taking a stand for the powerful against the weak. It’s at best deeply insensitive.

Two instances of this have appeared on the blogosphere recently. Scott Aaronson blogged recently about MIT’s decision to take down Walter Lewin’s online videos after Lewin was found to have sexually harassed students in connection with the course. Scott believes that depriving students of Lewin’s materials is a terrible outcome, even (possibly) if he were a murderer. Ignoring the real hurt and trauma felt by those who are affected by Lewin’s actions is an exercise in privilege — because he is not hurt by it, he values the “the good of the many” trumping the “good of the few.”

The whole downplaying of sexual harassment as being somehow “not serious” enough to warrant a serious response (or that the response “makes the most dramatic possible statement about a broader social issue”) in fact trivializes the whole experience of sexual violence. Indeed, by this line of argument, because the content created by Lewin is so valuable, it may be ok to keep online even “had [he] gone on a murder spree.” The subtext of this is “as opposed to merely harassed some women.” I recommend reading Priya Phadnis on this case — she comes to a very different conclusion, namely that special pedestal that we put Walter Lewin on is itself the problem. Being able to downplay the female victims’ claims is exercising the sort of privilege that members of the male professoriat (myself included) indulge in overtly, covertly, and inadvertently. If STEM has a gender problem, it’s in a large part because we do not pay attention to the ways in which our words and actions reinforce existing tropes.

The second post was by Lance Fornow on dying languages in response to an op-ed by John McWhorter on why we should care about language diversity. Lance thinks that speaking a common language is a good thing:

I understand the desire of linguists and social scientists to want to keep these languages active, but to do so may make it harder for them to take advantage of our networked society. Linguists should study languages but they shouldn’t interfere with the natural progression. Every time a language dies, the world gets more connected and that’s not a bad thing.

I guess those poor bleeding-heart social scientists don’t understand that those languages are dying for a good reason. The good of the many — everyone speaking English, the dominant language — outweighs the good of the few. This attitude again speaks from a place of privilege and power, and it reinforces a kind cultural superiority (although I am sure Lance doesn’t think of it that way). Indeed, in many parts of the world, there is and continues to be “a strong reason to learn multiple languages.” By casually (and incorrectly) dismissing the importance of linguistic diversity, such a statement reinforces a chauvinist view of the relationship between language and technology.

We start with desirable outcomes: free quality educational materials that lower the barrier to access or speaking a common language to help facilitate communication and cooperation. By choosing to focus on those outcomes and their benefits to the many, we value their well-being and delegitimize the harm done to others. If we furthermore are speaking from a position of power, our privilege reinforces stigmas, casting a value judgement on the rights, experiences, and beliefs of the few. It’s something to be careful about.

Linkage

Posted on October 2, 2011 by Anand Sarwate

I will post more about Allerton soon (I’m still on the road), but I wanted to clear out some old links before doing that. I’m starting my new gig at TTIC this week, and the last few weeks have been a whirlwind of travel and internetlessness, so blogging has been curtailed.

The 12 coolest libraries in the world (via MeFi).
Todd Coleman shows you how to peel garlic efficiently via “shaking the dickens out of it.” No, not that Todd Coleman!
Some scraps from an exterminated language have been found.
Florence Nightingale’s statistical diagrams (via MeFi), but also covered by the BBC (part 1, part 2).

And a (not-so-recent) tour around the ArXiV — I haven’t had a chance to read these yet, but maybe once I am settled…

Active Ranking using Pairwise Comparisons by Kevin G. Jamieson and Robert D. Nowak — this is related to a talk given by Constantine Caramanis at Allerton. Instead of looking at how to learn from total orderings, we have to learn the total ordering from pairwise ordererings (I like chocolate more than vanilla).
Distributed Algorithms for Consensus and Coordination in the Presence of Packet-Dropping Communication Links – Part I and Part II by Nitin H. Vaidya, Christoforos N. Hadjicostis, and Alejandro D. Dominguez-Garcia (in different orders). This paper looks at consensus in asymmetric communication settings with packet drops and modify the update rule to achieve almost sure convergence. The analysis seems to rely on the “coefficient of ergodicity” approach for inhomogeneous Markov chains. It’s doubly appropriate for the blog!
Distributed Algorithms for Optimal Power Flow Problem by Albert Y.S. Lam, Baosen Zhang, and David Tse. Power networks are hot and this paper studies an interesting problem of cost minimization in power flow networks. I found it a bit weird that the abstract and introduction assume you already know what the problem is… but that’s what happens when you are an outsider.
Optimal Sensor Placement for Intruder Detection by Waseem A. Malik, Nuno C. Martins, and Ananthram Swami
The Projection Method for Reaching Consensus and the Regularized Power Limit of a Stochastic Matrix by R. P. Agaev, P. Yu. Chebotarev
Tropical Algebraic approach to Consensus over Networks, by Joel George Manathara, Ambedkar Dukkipati, Dabasish Ghose
Fundamentals of Stein’s method by Nathan Ross
A Learning Theory Approach to Non-Interactive Database Privacy by Avrim Blum, Katrina Ligett, Aaron Roth
Bandits with an Edge by Dotan Di Castro, Claudio Gentile, Shie Mannor
State-of-the-Art in Sequential Change-Point Detection by Aleksey S. Polunchenko, Alexander G. Tartakovsky
Wasserstein distances for discrete measures and convergence in nonparametric mixture models by XuanLong Nguyen
High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity by Po-Ling Loh, Martin J. Wainwright
Canonical Estimation in a Rare-Events Regime by Mesrob I. Ohannessian, Vincent Y. F. Tan, Munther A. Dahleh

Shannon theory helps decipher Pictish?

Posted on April 2, 2010 by Anand Sarwate

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s $F_2$ statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

$U_r = F_2 / \log\left( \frac{N_d}{N_u} \right)$

where $N_d$ and $N_u$ are the number of di-grams and un-grams, respectively. Why normalize? The statistic $F_2$ by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If $S_d$ is the number of digrams appearing once and $T_d$ is the total number of digrams, they use a “di-gram repetition factor”

$C_r = \frac{N_d}{N_u} + a \cdot \frac{S_d}{T_d}$

where the tradeoff factor $a$ is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare $C_r$ to a threshold — if it is small then they deem the system to be more “heraldic”. If $C_r$ is large then then do a three-way decision based on $U_r$ . If $U_r$ is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day…

Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

Romanian diacritics

Posted on July 15, 2009 by Anand Sarwate

I came across this blog post today while trying to figure out how to write the Romanian breve (the symbol ă) in a document, and it was an amusingly angry rant about Romanian orthography. The fact that the Romanian currency even got it wrong is pretty funny. But it seems a bit like a futile battle; things always change and I bet the orthography gets merged eventually. I, for one, miss the ess-zett (ß) in German, but it’s gone the way of the dinosaurs.

That would be a great name for an diacritic mark — a dinosaur. A stegosaurus sitting on top of a U. But how would it be pronounced?

You say Tschebyscheff, I say Chebyshev

Posted on November 7, 2007 by Anand Sarwate

As I reread the Burnashev-Zingagirov paper on interval estimation today, I came across a new (to me) spelling of the mathematician Chebyshev‘s name. I found a page with variant spellings, including

Chebyshev
Chebyshov
Chebishev
Chebysheff
Tschebischeff
Tschebyshev
Tschebyscheff
Tschebyschef
Tschebyschew

I know that “Tsch” comes from French/German transliterations. But today I saw “Chebysgev,” which is a totally new one to me. Where does the “g” come in? The name is actually Чебышев, which may or may not show up depending on your Unicode support.

UPDATE : Hari found “Tchebichev” in Loève’s Probability Theory book.

IT Transactions, lose the loose “loose!”

Posted on November 7, 2007 by Anand Sarwate

I saw an instance of the the dreaded loose/lose error in the latest issue of the Transactions on Information Theory. Of course, for many authors, English is their second (or third, or fourth!) language, so errors will happen. But whither copy editing, I ask?

free speech and America-centrism

Posted on October 29, 2007 by Anand Sarwate

Chris Bertram over at has a post on speech regulation with which I’m not sure I agree, but I do wholeheartedly agree with this sentiment:

The Americans have a long tradition of trying to discuss these things using the language of an 18th-century document. Given the difficulties of shoehorning a lot of real-world problems into that frame, that gives them a long history of acrobatic hermeneutics somewhere in the vague area of free speech. Some of it is even relevant. The trouble is that many Americans (at least the ones who comment on blogs!) can’t tell the difference between discussing the free speech and discussing the application of their constitution.

Not only true on blogs, but in person as well.

writing in the language of the dominant

Posted on October 24, 2007 by Anand Sarwate

I went to the keynote for the Global Conversations conference, sponsored by the UC Irvine International Center for Writing and Translation, this morning. It was given by Ngugi wa Thiong’o, whose books I have always meant to read but never have. The theme of the conference is how to address marginalized languages, and his keynote made a number of points that I thought were interesting.

Firstly, he had to address the issue of the rich body of literature, especially postcolonial literature, that is written in the langugage of the colonizers. It’s not just a colonial issue, so the appropriate binary here is dominant/marginalized. The overarching point was that writing in the language of the dominant impoverishes the local — it enables the access to the world stage but disables the home culture by taking away new cultural products. “Visibility in the dominant becomes invisibility in the marginalized,” he said. What then, is the place of conversation between different marginalized communities? While not outright calling for an activism or solidarity movement, he posed a goal of the conference as to kickstart the interactions that might initiate.

A second smaller point had to do with paralleling the language of technology transfer from industrialized to developing nations to more general knowledge and cultural production. While it’s true that strategies for preservation and revitalization can be transferred, the “working together” is what’s really interesting. Can different marginalized linguistic communities work together without losing something?

are the three the same, or different?

Posted on October 22, 2007 by Anand Sarwate

In a news story about the terrible fires raging across SoCal, I read that the flames forced “more than 265,000 evacuations from Malibu to San Diego, including a jail, a hospital and nursing homes.” Is there a subtle comparison going on here? Is the author suggesting they are more similar or dissimilar to each other?

	Zonghong Liu on A story about Canvas
	anonymousskimmer on “The needs of the many,…
	Chanterelle Recipes… on Broiled shrimp with chanterell…
	kvarsh on ICML 2019 encouraged code subm…
	Pulkit Grover on gender inclusivity in communic…

An Ergodic Walk

a process whose average over time converges to the true average

Tag Archives: language

Linkage

“The needs of the many,” privilege, and power

Linkage

Shannon theory helps decipher Pictish?

Romanian diacritics

You say Tschebyscheff, I say Chebyshev

IT Transactions, lose the loose “loose!”

free speech and America-centrism

writing in the language of the dominant

are the three the same, or different?