Quote of the day : the ethical “success” of probability

I am reading Ian Hacking’s The Taming of Chance, which I picked up from The Seminary Coop upon arriving here. They just had it on the shelf! The book, as he puts it, is a way to understand why probability has been “an incredible success story” in the realms of metaphysics, epistemology, logic, and ethics. By success he means probabilistic ideas have radically changed these areas. On the last point:

Ethics is in part the study of what to do. Probability cannot dictate values, but it now lies at the basis of all reasonable choice made by officials. No public decision, no risk analysis, no environmental impact, no military strategy can be conducted without decision theory couched in terms of probabilities. By covering opinion with a veneer of objectivity, we replace judgement by computation.

Allerton 2011

Despite Yury‘s attempts to get me to “stop blogging,” here is my much-delayed recap of Allerton. Because of moving to TTI-Chicago right after Allerton and then almost immediately shuttling back to UCSD for the iDASH Privacy Workshop, things have been a bit delayed. I could only attend for two days but I wanted to highlight a few interesting talks that I saw. More Allerton blogging was done by Michael Mitzenmacher (part 1, part 2) and a bit by Maxim Raginsky about his talk on causal calculus (since he blogged about it I don’t have to, ha!). The conference has gotten too big and the rooms are too small to hold the audience, so it probably is time to move the thing. We have similar issues at ITA and the 2012 ITA Workshop is moving off campus next year (you heard it here first, folks!)

But here are some interesting talks I saw:

Continue reading

Linkage

I will post more about Allerton soon (I’m still on the road), but I wanted to clear out some old links before doing that. I’m starting my new gig at TTIC this week, and the last few weeks have been a whirlwind of travel and internetlessness, so blogging has been curtailed.

And a (not-so-recent) tour around the ArXiV — I haven’t had a chance to read these yet, but maybe once I am settled…

Banach spaces, super-reflexivity, and martingale type

I started reading Convex Games in Banach Spaces, by Sridharan and Tewari and then got interested in the work of G. Pisier on martingales in Banach spaces, which studies how probability and geometry interact. Some of the definitions differ a bit between these papers (the former does things in terms of martingale difference sequences).

A Banach space {\mathcal{B}} has Martingale-type {p} (or M-type {p}) if there is a constant {C} such that for any {T \ge 1} and {\mathcal{B}}-valued martingale {Y_1, Y_2, \ldots}, we have

\displaystyle  	\sup_{t} \mathop{\mathbb E}\left[ \left\| Y_t \right\| \right] \le C \left( \sum_{t=1}^{\infty} \mathop{\mathbb E}[ \left\| Y_t - Y_{t-1} \right\| ^p ] \right)^{1/p}.

A Banach space {\mathcal{B}} has Martingale-cotype {q} (or M-cotype {q}) if there is a constant {C} such that for any {T \ge 1} and {\mathcal{B}}-valued martingale {Y_1, Y_2, \ldots}, we have

\displaystyle  	 \left( \sum_{t=1}^{\infty} \mathop{\mathbb E}[ \left\| Y_t - Y_{t-1} \right\|^q ] \right)^{1/q} 	 \le 	 C \sup_{t} \mathop{\mathbb E}\left[ \left\| Y_t \right\| \right].

So far these are statements about how martingales behave in these spaces, and you can see a little geometry in the {(\sum (\cdot)^p )^{1/p}} kind of stuff, but how does this relate to overall geometric properties of the space?

A Banach space {\mathcal{B}} is reflexive if it is isomorphic to its double dual (the dual of its dual space). The space {\mathcal{B}} is reflexive if and only if its dual {\mathcal{B}^{\ast}} is reflexive. A space {\mathcal{C}} is finitely representable in {\mathcal{B}} if for any {x \in \mathcal{C}} and any {\lambda > 1} there exists an isomorphism {\phi : \mathcal{C} \rightarrow \mathcal{B}} such that

\displaystyle  	\lambda^{-1} \|x\| \le \|\phi(x)\| \le \lambda{x}

A Banach space {\mathcal{B}} is super-reflexive if no non-reflexive Banach space is finitely representable in {\mathcal{B}}. A result of James shows that a space is super-reflexive if and only if its dual is super-reflexive. Reflexivity means a space is “nice” and super-reflexivity is a kind of “uniform niceness.”

The norm for a Banach space is uniformly convex if for every {\epsilon > 0} there exists a {\delta > 0} such that for two unit vectors {x} and {y},

\displaystyle  	\|x + y\| > 2 - \delta

implies

\displaystyle  	\|x - y\| < \epsilon.

For example, for {p > 1} the spaces {L_p} and {\ell_p} are uniformly convex. Another way to picture it is that if the midpoint of the chord between {x} and {y} is close to the surface of the unit ball, then the chord has to be short. Per Enflo showed that if a Banach space {\mathcal{B}} is super-reflexive, it has an equivalent norm which is uniformly convex.

It turns out that the M-type is related to the degree of convexity of a space, namely how uniformly convex the norm is (and in turn connected to super-reflexivity). A space is {q}uniformly convex if

\displaystyle  	\left\| \frac{x + y}{2} \right\|^q \le \frac{1}{2} \left( \|x\|^q + \|y\|^q \right) 		- C^{-q} \left\| \frac{x - y}{2} \right\|^q.

Gilles Pisier made the connection to M-types : a Banach space is super-reflexive if and only if it has M-type {p > 1} (or M-cotype {q < \infty}). Furthermore, a Banach space has M-cotype {q} if and only if {\mathcal{B}} has a {q}-uniformly convex norm.

So the big picture is this: super-reflexivity is enough to guarantee a uniformly convex norm, but the M-cotype characterizes the degree of how convex that norm is.

References:

Sridharan, Kartik and Tewari, Ambuj, Convex Games in Banach Spaces, Proceedings of COLT, 2010.

Enflo, Per, Banach spaces which can be given an equivalent uniformly convex norm, Israel Journal of Mathematics 13 (1972), 281-288.

James, Robert C., Super-reflexive Banach spaces., Canadian Journal of Mathematics 24 (1972), 896Ð904.

Pisier, Gilles, Probabilistic Methods in the Geometry of Banach spaces, in G. Letta and M. Pratelli, ed., Probability and Analysis, v.1206 of Lecture Notes in Mathematics, p167–241, Springer, 1986.

Pisier, Gilles, Martingales with values in uniformly convex spaces, Israel Journal of Mathematics 20 (1975), no. 3-4, 326Ð350.

Pisier, Gilles, Martingales in Banach Spaces (in connection with Type and Cotype), IHP Winter Course online lecture notes, 2011.

Lovaglia, A. R., Locally uniformly convex Banach spaces, Trans. Amer. Math. Soc. 78, (1955). 225Ð238.

Readings

Shing-Tung Yau and Steve Nadis, The Shape of Inner Space — This book was about the Calabi conjecture, Calabi-Yau manifolds, string theory, and all that jazz. It’s supposed to be for a general/lay audience, but I found it rather daunting and often confusing. Perhaps I know just enough math to get confused, whereas other readers might gloss over things. I definitely would not recommend it to those without some serious mathematical background (like a few college classes). That being said, I found it pretty interesting, and now I know (kind of) what a Calabi-Yau space is.

Donald G. Saari, Decisions and Elections : Explaining the Unexpected — This sums up a large chunk of the analysis of social choice problems and voting systems done by Donald Saari. It’s a bit overwritten for my taste and veers between some mathematical formalism and a chatty form of argumentation. I don’t think I fit in the right “audience” for this book, which is part of the problem. It discusses Arrow’s Theorem and Sen’s Theorem via a bunch of examples and spends a fair bit of time on the “paradoxes” and perversities of different choice systems. The chattiness makes it feel less than systematic. Towards the end Saari puts on more of an advocate hat and argues that symmetry (in a particular sense) is a desirable property of election systems and puts out a case for the Borda count. That in a sense is the least convincing part of the book. This might be a good present for a precocious high school student, since the math is not so complicated but there are a lot of ideas to chew on in there.

Hannu Nurmi, Voting Procedures under Uncertainty — This also fits into the “slightly math-y books for political scientists” genre, so I found it insufficiently math-y. It’s a survey of different models of uncertainty in voting procedures and a survey of the work in that area. As such, it discusses alternatives to “traditional” social choice theory including Euclidean models of preference and so on. There’s a little more “survey” and less “integrating the different perspectives” but that’s ok. I am not sure who should read it, but it did point out some new literatures of which I had previously been unaware.

Moez Draif and Laurent Massoulié, Epidemics and Rumors in Complex Networks — A nice and compact introduction to rumor-spreading processes, including branching processes, small world graphs, SIS/SIR type models, and concluding with some models for “viral marketing.” I really liked this book because it was concise and to the point, but others may find that it lacks some context and connections to literature with which they are familiar. It doesn’t feel like a tutorial in that respect, but it’s self-contained and great for someone who has seen some of the material before but not all of it.

John Mortimer, Rumpole and the Primrose Path — Reading Rumpole short stories is kind of like relaxing in a pair of old slippers. Enjoyable, but probably not his best work.

Readings

The Gangster We Are All Looking For (lê thi diem thúy) — This is a fragmented and short narrative of a young Vietnamese immigrant to the US and her time growing up in various neighborhoods in San Diego. It’s the KPBS One Book, One San Diego selection so there were 25 copies at the library. The little vignettes are fleeting but touching, but in a sense you don’t feel that the narrator is particularly introspective, at least not in a direct way. However, I think it was definitely worth reading, if for no other reason than to hear her unique perspective.

The Terrible Privacy of Maxwell Sim (Jonathan Coe) — A satirical novel which came recommended but which in the end I felt cheated by. Maxwell Sim embarks on a new job after a recent divorce and few months off for depression, and ends up learning new things about himself and his family. He’s a bit of a loser, to be honest, but in the end you kind of feel for him as he muddles through emails, old letters, facebook, and the like. What is a big cheat is the ending, in which the author (!) appears. Blech.

Symmetry and Its Discontents (Sheridan Zabell) — A lovely collection of essays on the philosophy, history, and mathematics of symmetry assumptions in problems of induction. The last two chapters are especially good as they discuss a bit of the history and background of such things as Good-Turing estimators and exchangeable partition processes. I learned about this book a while ago from Susan Holmes at the AIM Workshop on estimating probability distributions.

Electronic Elections (R. Michael Alvarez and Thad E. Hall) — A short but dense book that makes the case for a “risk management” approach to assessing the value of electronic voting machines. Electronic voting machines have all sorts of benefits, including better accessibility for the disabled, no “hanging chads,” and so on. But they are also woefully unsecure and hackable, as has been demonstrated time and again by computer security folks. Alvarez and Hall feel like the CS folks are being unfair and think (in a very nebulous way) that the benefits outweigh the risks. I found the data about voter confusion and error rates, etc. interesting, but I think the authors completely miss the point of the security community’s critique of electronic voting systems. Designing a shoddy electronic voting system is bad, regardless of the purported benefits.

Two takes on preferential attachment graphs

I recently read a paper from STOC 2011 by Doerr, Fouz, and Friedrich on rumor spreading in preferential attachment graphs: Social networks spread rumors in sublogarithmic time, and it cites a 2004 paper by Bollob´s and Riordan from Combinatorica on The diameter of a scale-free random graph (i.e. a preferential attachment graph). The latter paper has a characterization of the graph growth process which is fun and geometric, so I thought it might make a good topic for a post.

Continue reading

The strong converse for the MAC, part 3

This post was written using Luca’s LATEX2WP tool.

The topic of this (late) post is “wringing” techniques, which are a way of converting a statement like “{ P(x^n)} and { Q(x^n)} are close” into a statement about per-letter probabilities. Such a statement looks like “we can find some indices and values at those indices such that if we condition on those values, the marginals of { P} and { Q} are close.” I’m going to skip straight to the simpler of the two “Wringing Lemmas” in this paper and sketch its proof. Then I’ll go back and state the first one (without proof). There are some small typos in the original which I will try to correct here.

Lemma 1 Let { P} and { Q} be probability distributions on { \mathcal{X}^n} such that for a positive constant { c},

\displaystyle  	P(x^n) = (1 + c) Q(x^n) \qquad \forall x^n \in \mathcal{X}^n 	 	\ \ \ \ \ (1)

then for any { 0 < \gamma < c} and {0 \le \epsilon < 1} there exist { t_1, t_2, \ldots, t_k \in \{1, 2, \ldots, n\}}, where

\displaystyle  	0 \le k \le c/\gamma

such that for some { \bar{x}_{t_1}, \bar{x}_{t_2}, \ldots, \bar{x}_{t_k}},

\displaystyle  	P(x_t | \bar{x}_{t_1}, \bar{x}_{t_2}, \ldots, \bar{x}_{t_k}) 		\le \max\{ (1 + \gamma) Q(x_t | \bar{x}_{t_1}, 			\bar{x}_{t_2}, \ldots, \bar{x}_{t_k}), \epsilon \}

for all { x_t \in \mathcal{X}} and all { t = 1, 2, \ldots, n}. Furthermore,

\displaystyle  	P(\bar{x}_{t_1}, \bar{x}_{t_2}, \ldots, \bar{x}_{t_k}) \ge \epsilon^{k}.

The proof works pretty mechanically. Fix a {\gamma} and {\epsilon}. Suppose that the conclusion doesn’t hold for {k = 0}. Then there is a {t_1} and {\bar{x}_{t_1}} such that

\displaystyle  	(1 + c) Q(\bar{x}_{t_1}) \ge P(\bar{x}_{t_1}) \ge \max((1 + \gamma) Q(\bar{x}_{t_1}),\epsilon),

where the first bound is from the assumption and the second because the conclusion doesn’t hold. Then we have {P(\bar{x}_{t_1}) > \epsilon}. Now dividing (1) by {P(\bar{x}_{t_1})} and using the lower bound on {P(\bar{x}_{t_1})} we get for all {x^n \in \mathcal{X}^n},

\displaystyle  	P(x^n | \bar{x}_{t_1}) \le \frac{1 + c}{1 + \gamma} Q(x^n | \bar{x}_{t_1}) 	 	\ \ \ \ \ (2)

Now, either the conclusion is true or we use (2) in place of (1) to find a {t_2} and a {\bar{x}_{t_2}} such that

\displaystyle  	P(x^n | \bar{x}_{t_1},\bar{x}_{t_2}) \le \frac{1 + c}{(1 + \gamma)^2} Q(x^n | \bar{x}_{t_1},\bar{x}_{t_2}) 	 	\ \ \ \ \ (3)

We can keep going in this way until we get {(1 + \gamma)^k} in the denominator. This yields the condition on {k}.

Rather than describe how to apply this to our coding problem, I want to contrast this rather simple statement to the other wringing lemma in the paper, which is a generalization of the lemma developed by Dueck. This one is more “information theoretic” in the sense that it says if the block mutual information is small, then we can find positions to condition on such that the per-letter mutual informations are small too.

Lemma 2 Let {X^n} and {Y^n} be {n}-tuples of random variables from discrete finite alphabets {\mathcal{X}^n} and {\mathcal{Y}^n} and assume that

\displaystyle  	I(X^n; Y^n) \le \sigma.

Then for any { 0 < \delta < \sigma} and there exist { t_1, t_2, \ldots, t_k \in \{1, 2, \ldots, n\}}, where

\displaystyle  	0 \le k \le 2 \sigma/\delta

such that for some {\{ (\bar{x}_{t_i}, \bar{y}_{t_i}) : i = 1 ,2 \ldots, k\}},

\displaystyle  	I(X_t ; Y_t ~|~ (X_{t_i},Y_{t_i}) = (\bar{x}_{t_i}, \bar{y}_{t_i}), i = 1 ,2 \ldots, k) \le \delta.

for all { t = 1, 2, \ldots, n}. Furthermore,

\displaystyle  	\mathbb{P}((X_{t_i},Y_{t_i}) = (\bar{x}_{t_i}, \bar{y}_{t_i}), i = 1 ,2 \ldots, k) \ge \left( \frac{\delta}{|\mathcal{X}||\mathcal{Y}| (2 \sigma - \delta)} \right)^{k}.

This lemma proved in basically the same way as the previous lemma, but it’s a bit messier. Dueck’s original version was the similar but gave a bound on the conditional mutual information, rather than this bound where we condition on particular values. This way of taking a small mutual information and getting small per-letter mutual informations can be useful outside of this particular problem, I imagine.

Next time I’ll show how to apply the first lemma to our codebooks in the converse for the MAC.

Readings

Robert Tavernor, Smoot’s Ear : The Measure of Humanity – This is an interesting, albeit dry, history of measurement in Europe, starting from the Greeks and Romans, more or less, up through the development of the metric system. It’s chock full of interesting little facts and also highlights the real problems that arise when there is no standard as well as when trying to define a standard.

Naguib Mahfouz, Palace Walk – The first in Mahfouz’s masterpiece trilogy, this novel follows a very traditional family of an Egyptian merchant, who spends his time partying every night while terrorizing his family during the day. It’s set during the end of the British occupation at the end of WWI and the protests against the British that start at the end of the novel seem eerily relevant today.

Nell Irvin Painter, The History of White People – This is a popular history of theories of race, beauty, and intelligence and how they became entwined with skin color, head-shape, and other measurable quantities. It was an interesting read but felt a little incomplete somehow. Also, she uses the work “pride of place” too many times. It was distracting!

Vivek Borkar, Stochastic Approximation : a Dynamical Systems Viewpoint – This slim book gives a concise, albeit very technical, introduction to the basic methods and results in stochastic approximation. It’s fairly mathematically challenging, but because it’s to-the-point, I found it easier going than the book by Kushner and Yin.

Sampling one distribution out of another

I am reading this nice paper by Harsha et al on The Communication Complexity of Correlation and some of their results rely on the following cute lemma (slightly reworded).

Let P and Q be two distributions on a finite set \mathcal{X} such that the KL-divergence D(P \| Q)  > 0. Then there exists a sampling procedure which, when given a sequence of samples x_1, x_2, \ldots drawn i.i.d. according to Q, will output (with probability 1) an index i^{\ast} such that the sample x_{i^{\ast}} is distributed according to the distribution P and the expected encoding length of the index i^{\ast} is at most

D(P \| Q) + 2 \log( D(P \| Q) + 1 ) + O(1)

where the expectation is taken over the sample sequence and the internal randomness of the procedure.

So what is the procedure? It’s a rejection sampler that does the following. First set p_0(x) = 0 for all x \in \mathcal{X} and set p_0^{\ast} = 0 as well. Then for each i = 1, 2, \ldots do the following:

  1. \alpha_i(x) = \min\{ P(x) - p_{i-1}(x), \ (1 - p^{\ast}_{i-1}) Q(x) \}
  2. p_i(x) = p_{i-1}(x) + \alpha_i(x)
  3. p_{i^{\ast}} = \sum_{x} p_{i}(x)
  4. \beta_i = \alpha_i(x_i)/( (1 - p_{i-1}^{\ast}) Q(x_i) )
  5. Output i with probability \beta_i.

Ok, so what is going on here? It turns out that \alpha_i(x) is tracking the probability that the procedure halts at i with x_i = x (this is not entirely clear at first). Thus p_i(x) is the probability that we halted before time i and the sample is x, and p_i^{\ast} is the probability that we halt at time i. Thus (1 - p_{i-1}^{\ast}) is the probability that the procedure gets to step i. Getting back to \alpha_i(x), we see that x_i = x with probability Q(x_i) and we halt with probability \beta_i, so indeed, \alpha_i(x) is the probability that we halt at time i with x_i = x.

What we want then is that P(x) = \sum_{i=1}^{\infty} \alpha_i(x). So how should we define \alpha_i(x)? It’s the minimum of two terms : in order to stop at time i such that x_i = x, the factor \alpha_i(x) is at most (1 - p_{i-1}^{\ast}) Q(x). However, we don’t want to let p_{i}(x) = \sum_{j=1}^{i} \alpha_j(x) exceed the target probability P(x), so \alpha_i(x) must be less than P(x) - p_{i-1}(x). The authors say that in this sense the procedure is greedy, since it tries to get p_i(x) as close to P(x) as it can.

In order to show that the sampling is correct, i.e. that the output distribution is P, we want to show what p_i(x) \to P(x). This follows by induction. To get a bound on the description length of the output index, they have to show that

\mathbb{E}[ \log i^{\ast} ] \le D(P \| Q) + O(1).

This is an interesting little fact on its own. It comes out because

i \le \frac{1}{1 - p_{i-1}^{\ast}} \cdot \frac{P(x)}{Q(x)} + 1

and then expanding out \mathbb{E}[ \log i^{\ast} ] = \sum_{x} \sum_{i} \alpha_i(x) \cdot \log i.

One example application of this Lemma in the paper is the following problem of simulation. Suppose Alice and Bob share an unlimited amount of common randomness. Alice has some X drawn according to P(x) and wants Bob to generate a Y according to the distribution P(y | x). So ideally, (X,Y) have joint distribution P(x) P(y | x). They can both generate a sequence of common y_1, y_2, \ldots according to the marginal distribution of Y. Then Alice tries to generate the distribution P(y | x) according to this sampling scheme and sends the index i^{\ast} to Bob, who chooses the corresponding y_{i^{\ast}}. How many bits does Alice have to send on average? It’s just

\mathbb{E}_{P(x)}[ D( P(y|x) \| P(y) ) + 2 \log( D( P(y|x) \| P(y) ) + 1) + O(1) ].

which, after some love from Jensen’s inequality, turns out to be upper bounded by

I(X ; Y) + 2 \log( I(X; Y) + 1) + O(1) ].

Pretty cute, eh?