More like an old paper… this (finally) a journal version of some older work that we did on analyzing Bayesian nonparametric estimators form an information-theoretic (redundancy) perspective.

Exchangeable random partition processes are the basis for Bayesian approaches to statistical inference in large alphabet settings. On the other hand, the notion of the pattern of a sequence provides an information-theoretic framework for data compression in large alphabet scenarios. Because data compression and parameter estimation are intimately related, we study the redundancy of Bayes estimators coming from Poisson-Dirichlet priors (or “Chinese restaurant processes”) and the Pitman-Yor prior. This provides an understanding of these estimators in the setting of unknown discrete alphabets from the perspective of universal compression. In particular, we identify relations between alphabet sizes and sample sizes where the redundancy is small, thereby characterizing useful regimes for these estimators.

In the large alphabet setting, one thing we might be interested in is sequential prediction: I observe a sequence of butterfly species and want to predict whether the next butterfly I collect will be new or one that I have seen before. One simple way to do this prediction is to put a prior on the set of all distributions on infinite supports and do inference on that model given the data. This corresponds to the so-called Chinese Restaurant Process (CRP) approach to the problem. The information-theoretic view is that sequential prediction is equivalent to compression: the estimator is assigning a probability $q(x^n)$ to the sequence $x^n$ seen so far. An estimator is good if for any distribution $p$, if $x^n$ is drawn i.i.d. according to $p$, then the divergence between $p(x^n)$ and $q(x^n)$ is “small.” The goal of this work is to understand when CRP estimators are good in this sense.

This sort of falls in with the “frequentist analysis of Bayesian procedures” thing which some people work on.

While trying to show a student a generic example of a paper’s structure, I came across this gem:

A sample from the IT Transactions of 1992

I feel like I am reading a MacWrite document while wearing a flannel shirt.

Some shorter takes on these papers, some of which I should read in more detail later. I figure I’ll use the blog for some quick notes and to see if any readers have any comments/ideas about these:

Differentially Private Convex Optimization with Piecewise Affine Objectives (Shuo Han, Ufuk Topcu, George J. Pappas) — arXiv:1403.6135 [math.OC]. The idea here is to look at minimizing functions of the form
$f(x) = \max_{i = 1,2, \ldots, m} \{ a_i^{\top} x + b_i \}$
subject to $x$ belonging to some convex polytope $\mathcal{P}$. This is a bit different than the kind of convex programs I’ve been looking at (which are more ERM-like). Such programs occur often in resource allocation problems. Here the private information of users are the offsets $b_i$. They propose a number of methods for generating differentially private approximations to this problem. Analyzing the sensitivity of this optimization is tricky, so they use an upper bound based on the diameter of the feasible set $\mathcal{P}$ to find an appropriate noise variance. The exponential mechanism also gives a feasible mechanism, although the exact dependence of the suboptimality gap on $\epsilon$ is unclear. They also propose a noisy subgradient method where, instead of using SGD, they alter the sampling distribution using the exponential mechanism to choose a gradient step. Some preliminary experiments are also given (although none exploring the dependence on $\epsilon$, which would also be very interesting)!

Assisted Common Information with an Application to Secure Two-Party Sampling (Vinod M. Prabhakaran, Manoj M. Prabhakaran) — arXiv:1206.1282 [cs.IT]. This is the final version of the journal version of a few conference papers that Vinod and Manoj have done on an interesting variant of the Gács-Körner problem. The motivation is from secure multiparty computation — the problem also touches on some work Vinod and I started but is sadly languishing due to the utter overwhelmingness of starting a new job. Hopefully I can get back to it this summer.

Analysis of Distributed Stochastic Dual Coordinate Ascent (Tianbao Yang, Shenghuo Zhu, Rong Jin, Yuanqing Lin) — arXiv:1312.1031 [cs.DC]. The title pretty much sums it up. I’m interested in looking a bit more at the analysis method, since I had a similar algorithm bouncing around my head that I would like to analyze. The main idea is also update the primal variables to achieve a speedup/use a larger step size.

Convergence of Stochastic Proximal Gradient Algorithm (Lorenzo Rosasco, Silvia Villa, Bang Công Vũ) — arXiv:1403.5074 [math.OC]. This is a similar setup as my last post, with a convex objective that has a smooth and non-smooth component. They show convergence in expectation and almost surely. The key here is that they show convergence in an infinite-dimentionsal Hilbert space instead of, say, $\mathbb{R}^d$.

There’s this nice calculation in a paper of Shannon’s on optimal codes for Gaussian channels which essentially provides a “back of the envelope” way to understand how noise that is correlated with the signal can affect the capacity. I used this as a geometric intuition in my information theory class this semester, but when I described it to other folks I know in the field, they said they hadn’t really thought of capacity in that way. Perhaps it’s all of the AVC-ing I did in grad school.

Suppose I want to communicate over an AWGN channel

$Y = X + Z$

where $Z \sim \mathcal{N}(0,N)$ and $X$ satisfies a power constraint $P$. The lazy calculation goes like this. For any particular message $m$, the codeword $X^n(m)$ is going to be i.i.d. $\mathcal{N}(0,P)$, so with high probability it has length $\sqrt{n P}$. The noise is independent and $\mathbb{E}[ X Z ] = 0$ so $\langle X^n, Z^n \rangle \approx 0$ with high probability, so $Z^n$ is more-or-less orthogonal to $X^n(m)$ with high probability and it has length $\sqrt{nN}$ with high probability. So we have the following right triangle:

Geometric picture of the AWGN channel

Looking at the figure, we can calculate $\sin \theta$ using basic trigonometry:

$\sin \theta = \sqrt{ \frac{N}{N + P} }$,

so

$\log \frac{1}{\sin \theta} = \frac{1}{2} \log \left( 1 + \frac{P}{N} \right)$,

which is the AWGN channel capacity.

We can do the same thing for rate-distortion (I learned this from Mukul Agarwal and Anant Sahai when they were working on their paper with Sanjoy Mitter). There we have Gaussian source $X^n$ with variance $\sigma^2$, distortion $D$ and a quantization vector $\hat{X}^n$. But now we have a different right triangle:

Geometry of the rate-distortion problem

Here the distortion is the “noise” but it’s dependent on the source $X^n$. The “test channel” view says that the quantization $\hat{X}^n$ is corrupted by independent (approximately orthogonal) noise to form the original source $X^n$. Again, basic trigonometry shows us

$\log \frac{1}{\sin \theta} = \frac{1}{2} \log \left( \frac{\sigma^2}{D} \right)$.

Turning back to channel coding, what if we have some intermediate picture, where the noise slightly negatively correlated with the signal, so $\mathbb{E}[ X Z ] = - \rho$? Then the cosine of the angle between $X$ and $Z$ in the picture is $\frac{\rho}{\sqrt{NP}}$ and we have a general triangle like this:

Geometry for channels with correlated noise

Where we’ve calculated the length of $Y^n$ using the law of cosines:

$\|Y^n\|^2 = n (P + N) - 2 n \sqrt{ N P } \cdot \frac{\rho}{\sqrt{NP}}$.

So now we just need to calculate $\theta$ again. The cosine is easy to find:

$\cos \theta = \frac{ P + N - 2 \rho + P - N }{ 2 \sqrt{ P (P + N - 2 \rho) } } = \frac{P - \rho}{ \sqrt{P (P + N - 2 \rho) } }$.

Then solving for the sine:

$\sin^2 \theta = 1 - \frac{ (P - \rho)^2 }{ P( P + N - 2 \rho) }$

and applying our formula, for $\rho < \sqrt{PN}$,

$\log \frac{1}{\sin \theta} = \frac{1}{2} \log\left( \frac{ P ( P + N - 2 \rho) }{ P (P + N - 2 \rho) - (P - \rho)^2 } \right) = \frac{1}{2} \log\left( \frac{ P (P + N - 2 \rho) }{ PN - \rho^2 } \right)$

If we plug in $\latex \rho = 0$ we get back the AWGN capacity and if we plug in $\rho = N$ we get the rate distortion function, but this formula gives the capacity for a range of correlated noise channels.

I like this geometric interpretation because it's easy to work with and I get a lot of intuition out of it, but your mileage may vary.

I saw a paper on ArXiV yesterday called Kalman meets Shannon, which got me thinking: in how many papers has someone met Shannon, anyway? Krish blogged about this a few years ago, but since then Shannon has managed to meet some more people. I plugged “meets Shannon” into Google Scholar, and out popped:

Sometimes people are meeting Shannon, and sometimes he is meeting them, but each meeting produces at least one paper.

Via Serdar Yüksel and the IT Society, the 2014 IEEE North American School on Information Theory will be held at the Fields Institute in Toronto this June. The lecturers for this school are:

Some of my office furniture is on backorder, like the standing desk unit and my actual desktop, but in the meantime I have found a use for the hardcopy IEEE Transactions that I’ve been carting around with me from job to job:

A way to use IEEE Transactions

Zoltán Szabó of the Gatsby Unit forwarded me a link to his Information Theoretical Estimators Toolbox, which has MATLAB-friendly estimators for standard information-theoretic quantities. It might be of interest to readers of the blog.

I took a look at this interesting paper by Sriperumbudur et al., On the empirical estimation of integral probability metrics (Electronic Journal of Statistics Vol. 6 (2012) pp.1550–1599). The goal of the paper is to estimate a distance or divergence between two distributions $P$ and $Q$ based on samples from each distribution. This sounds pretty vague at first… what kind of distributions? How many samples? This paper looks at integral probability metrics, which have the form

$\gamma(P,Q) = \sup_{f \in \mathcal{F}} \left| \int_{S} f dP - \int_{S} f dQ \right|$

where $S$ is a measurable space on which $P$ and $Q$ are defined, and $\mathcal{F}$ is a class of real-valued bounded measurable functions on $S$. This class doesn’t contain Csiszár $\phi$-divergences (also known as Ali-Silvey distances), but does contain the total variational distance.

Different choices of the function class give rise to different measures of difference used in so-called two-sample tests, such as the Kolmogorov-Smirnov test. The challenge in practically using these tests is that it’s hard to get bounds on how fast an estimator of $\gamma(P,Q)$ converges if we have to estimate it from samples of $P$ and $Q$. The main result of the paper is to provide estimators with consistency and convergence guarantees. In particular, they estimators are based on either linear programming or (in the case of kernel tests) in closed form.

The second section of the paper connects tests based on IPMs with the risk associated to classification rules for separating $P$ and $Q$ when the separation rule is restricted to come from the function class $\mathcal{F}$ associated to the rule. This is a nice interpretation of these two-sample tests — they are actually doing similar things for restricted classes of classifiers/estimators.

Getting back to KL divergence and non-IPM measures, since total variation gives a lower bound on the KL divergence, they also provide lower bounds on the total variation distance in terms of other IPM metrics. This is important since the total variation distance can’t be estimated itself in a strongly consistent way. This could be useful for algorithms which need to estimate the total variation distance for continuous data. In general, estimating distances between multivariate continuous distributions can become a bit of a mess when you have to use real data — doing a plug-in estimate using, e.g. a kernel density estimator is not always the best way to go, and instead attacking the distance you want to measure directly could yield better results.

I’m at EPFL right now on my way back from a trip to India. My last work-related stop there was the Tata Institute of Fundamental Research, where I am working with Vinod Prabhakaran on some follow-up work to our ISIT paper this year on correlated sampling. TIFR has a lovely campus in Navy Nagar, right next to the ocean in the south of Mumbai. It’s a short walk across the lawn to the ocean:

The beach at TIFR, looking north to the haze-obscured skyline of Mumbai

The grounds themselves are pretty rigorously landscaped and manicured:

The main building seen across the lawn in front of the ocean.

Earlier in my trip I visited IIT-Madras, hosted by Andrew Thangaraj and Radha Krishna Ganti and IIT-Bombay, where I was working with Bikash Dey on extensions to some of our AVCs-with-delay work. It’s been a great trip and it’s nice to visit so many different schools. The challenges facing Indian higher education are very different than those in the US, and it’s a good change of perspective from my usual US-centric view of things. Maybe I’ll write more on that later.

But for now, I wanted to get back to some actual technical stuff, so I thought I’d mention something that came up in the course of my discussions with Vinod today. For a joint distribution $P(X,Y)$, Wyner defined the common information in the the following way:

$\mathsf{CI}(X;Y) = \min_{X--U--Y} I(XY ; U)$

One fun fact about the common information is that it’s larger than the mutual information, so $I(X;Y) \le \mathsf{CI}(X;Y)$. One natural question that popped up as part of a calculation we had to do was whether for doubly-symmetric binary sources we could have a bound like

$\mathsf{CI}(X;Y) \le \alpha I(X;Y)$

for some “nice” $\alpha$. In particular, it would have been nice for us if the inequality held with $\alpha = 2$ but that turns out to not be the case.

Suppose $(X,Y)$ and are a doubly-symmetric binary source, where $Y$ is formed from $X \sim \mathsf{Bern}(1/2)$ by passing it through a binary symmetric channel (BSC) with crossover probability $\alpha$. Clearly the mutual information is $I(X;Y) = 1 - h(\alpha)$. For the common information, we turn to Wyner’s paper, which says $\mathsf{CI}(X;Y) = 1 + h(\alpha) - 2 h( \frac{1}{2}( 1 - \sqrt{1 - 2 \alpha} ) )$, which is a bit of a weird expression. Plotting the two for $\alpha$ between 0 and 1/2 we get:

Mutual and Common Information for a DSBS

If we plot the ratio $\mathsf{CI}(X;Y)/I(X;Y)$ versus $\alpha$, we get the following:

Ratio of Common to Mutual Information for a DSBS

The ratio blows up! So not only is Common Information larger than mutual information, it can be an arbitrarily large factor larger than the mutual information.

It might seem like this is bad news for us, but it just made the problem we’re working on more interesting. If we get any results on that I might be less engimatic and blog about it.