Better late than never

Posted on May 6, 2010 by Anand Sarwate

(to be sung to the James Brown tune of the similar name)

Natural frequencies instead of Bayes

Posted on April 26, 2010 by Anand Sarwate

Steven Strogatz has a column up on why it’s easier to think about natural frequencies rather than conditional probabilities.

swimming in such extravagant grammatical constructions as dependent clauses

Posted on April 23, 2010 by Anand Sarwate

A hilarious account of trying to publish a Comment written by Rick Trebino. Via Aaron Clauset, via Cosma.

truly random numbers?

Posted on April 15, 2010 by Anand Sarwate

I heard this interesting story on All Things Considered on random number generation via quantum entanglement. The result was reported in Nature (the full paper is also available). I bet Scott will have something more to say about it (eventually), but it seems interesting to me, at least.

Perhaps I should go learn some quantum physics…

Big setback for net neutrality

Posted on April 6, 2010 by Anand Sarwate

The U.S. Court of Appeals for the District of Columbia ruled that the FCC lacks the authority to require broadband providers to give equal treatment to all Internet traffic flowing over their networks.

Shannon theory helps decipher Pictish?

Posted on April 2, 2010 by Anand Sarwate

Well, if not decipher, at least claim that there is something to read. A recent paper claims that Pictish inscriptions are a form of written language:

Lo and behold, the Shannon entropy of Pictish inscriptions turned out to be what one would expect from a written language, and not from other symbolic representations such as heraldry.

The full paper has more details. From reading the popular account I thought it was just a simple hypothesis test using the empirical entropy as a test statistic and “heraldry” as the null hypothesis, but it is a little more complicated than that.

After identifying the set of symbols in Pictish inscriptions, the question is how related adjacent symbols are to each other. That is, can the symbols be read sequentially? What they do is renormalize Shannon’s $F_2$ statistic (from the paper “Prediction and entropy of printed English”), which is essentially the empirical conditional entropy of the current symbol conditioned on the past symbols. They compute:

$U_r = F_2 / \log\left( \frac{N_d}{N_u} \right)$

where $N_d$ and $N_u$ are the number of di-grams and un-grams, respectively. Why normalize? The statistic $F_2$ by itself does not discriminate well between semasiographic (symbolic systems like heraldry) and lexigraphic (e.g. alphabets or syllabaries) systems.

Another feature which the authors think is important is the number of digrams which are repeated in the text. If $S_d$ is the number of digrams appearing once and $T_d$ is the total number of digrams, they use a “di-gram repetition factor”

$C_r = \frac{N_d}{N_u} + a \cdot \frac{S_d}{T_d}$

where the tradeoff factor $a$ is chosen via cross-validation on known corpora.

They then propose a two-step decision process. First they compare $C_r$ to a threshold — if it is small then they deem the system to be more “heraldic”. If $C_r$ is large then then do a three-way decision based on $U_r$ . If $U_r$ is small then the text corresponds to letters, if larger, syllables, and larger still, words.

In this paper “entropy” is being used here as some statistic with discriminatory value. It is not clear a priori that human writing systems should display empirical entropies with certain values, but since it works well on other known corpora, it seems like reasonable evidence. I think the authors are relatively careful about this, which is nice, since popular news might make one think that purported alien transmissions could easily fall to a similar analysis. Maybe that’s how Jeff Goldblum mnanaged to get his Mac to reprogram the alien ship in Independence Day…

Update: I forgot to link to a few related things. The statistics in this paper are a little more convincing than the work on the Indus script (see Cosma’s lengthy analysis. In particular, they do a little better job of justifying their statistic as discriminating in known corpora. Pictish would seem to be woefully undersampled, so it is important to justify the statistic as discriminatory for small data sets.

Everyone hates NCLB

Posted on April 2, 2010 by Anand Sarwate

Via Kevin Drum, I read this Economist poll about the popularity of No Child Left Behind. A rather overwhelming plurality of those surveyed said that it has hurt our schools. I don’t think I’ve met a single person who likes the law, although I chalked that up to the general political leanings of my friends. Perhaps repealing it would be something that can get “bipartisan support.”

On another note, the Wikipedia article says that people pronounce NCLB as “nicklebee.” Really? I have never heard that before. (Brandy, I’m looking at you).

The 2008 measles outbreak in San Diego

Posted on March 23, 2010 by Anand Sarwate

A pediatrician friend of mine pointed out this bit of news in Pediatrics on the January 2008 outbreak of measles in San Diego:

The outbreak began in January 2008 when a 7-year-old boy whose parents refused to vaccinate him returned to the U.S. from Switzerland. Before symptoms appeared, he infected his 3-year-old brother and 9-year-old sister. Neither was vaccinated.

Neither were 11% of the boy’s classmates, whose parents shared similar beliefs that a healthy lifestyle protected against disease while vaccines were riskier than the illnesses they prevented.

In the end, 839 people were exposed to measles. Eleven were infected, and 48 exposed kids too young to be vaccinated were quarantined — forbidden to leave their homes — for 21 days. Jane Seward, MBBS, MPH, was the CDC’s senior investigator for the outbreak.

…

Despite the extraordinary efforts of health workers, what really ended the San Diego outbreak wasn’t quarantine or post-exposure vaccination. It was the high vaccination rate in the rest of the community that kept the outbreak from becoming an epidemic.

This is the summary of the study:

The importation resulted in 839 exposed persons, 11 additional cases (all in unvaccinated children), and the hospitalization of an infant too young to be vaccinated. Two-dose vaccination coverage of 95%, absence of vaccine failure, and a vigorous outbreak response halted spread beyond the third generation, at a net public-sector cost of $10376 per case. Although 75% of the cases were of persons who were intentionally unvaccinated, 48 children too young to be vaccinated were quarantined, at an average family cost of $775 per child. Substantial rates of intentional undervaccination occurred in public charter and private schools, as well as public schools in upper-socioeconomic areas. Vaccine refusal clustered geographically and the overall rate seemed to be rising. In discussion groups and survey responses, the majority of parents who declined vaccination for their children were concerned with vaccine adverse events.

CONCLUSIONS Despite high community vaccination coverage, measles outbreaks can occur among clusters of intentionally undervaccinated children, at major cost to public health agencies, medical systems, and families. Rising rates of intentional undervaccination can undermine measles elimination.

The medical and public health community needs to really get going on this. The article ends by saying the researchers met parents with “real fears” about the risk of autism from vaccines. I’m sure their fears are real, but how on earth do you convince them otherwise?

Talk at USC Wednesday

Posted on March 23, 2010 by Anand Sarwate

In case you’re at USC or in the area, I’m giving a talk tomorrow there on some of the work I’ve been doing with Kamalika Chaudhuri (whose website seems to have moved) and Claire Monteleoni on privacy-preserving machine learning.

Learning from sensitive data – balancing accuracy and privacy

Wednesday, March 24th, 2010
2:00pm-3:00pm
EEB 248

The advent of electronic databases has made it possible to perform data mining and statistical analyses of populations with applications from public health to infrastructure planning. However, the analysis of individuals’ data, even for aggregate statistics, raises questions of privacy which in turn require formal mathematical analysis. A recent measure called differential privacy provides a rigorous statistical privacy guarantee to every individual in the database. We develop privacy-preserving support vector machines (SVMs) that give an improved tradeoff between misclassification error and the privacy level. Our techniques are an application of a more general method for ensuring privacy in convex optimization problems.

Joint work with Kamalika Chaudhuri (UCSD) and Claire Monteleoni (Columbia)

As I spent most of today making slides for a talk

Posted on March 22, 2010 by Anand Sarwate

This felt apropos.

I’m using LaTeX, not PowerPoint, but I don’t think Tufte makes these distinctions.

	Zonghong Liu on A story about Canvas
	anonymousskimmer on “The needs of the many,…
	Chanterelle Recipes… on Broiled shrimp with chanterell…
	kvarsh on ICML 2019 encouraged code subm…
	Pulkit Grover on gender inclusivity in communic…

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

An Ergodic Walk

a process whose average over time converges to the true average

Author Archives: Anand Sarwate