Via Serdar Yüksel and the IT Society, the 2014 IEEE North American School on Information Theory will be held at the Fields Institute in Toronto this June. The lecturers for this school are:

Some of my office furniture is on backorder, like the standing desk unit and my actual desktop, but in the meantime I have found a use for the hardcopy IEEE Transactions that I’ve been carting around with me from job to job:

A way to use IEEE Transactions

Zoltán Szabó of the Gatsby Unit forwarded me a link to his Information Theoretical Estimators Toolbox, which has MATLAB-friendly estimators for standard information-theoretic quantities. It might be of interest to readers of the blog.

I took a look at this interesting paper by Sriperumbudur et al., On the empirical estimation of integral probability metrics (Electronic Journal of Statistics Vol. 6 (2012) pp.1550–1599). The goal of the paper is to estimate a distance or divergence between two distributions $P$ and $Q$ based on samples from each distribution. This sounds pretty vague at first… what kind of distributions? How many samples? This paper looks at integral probability metrics, which have the form

$\gamma(P,Q) = \sup_{f \in \mathcal{F}} \left| \int_{S} f dP - \int_{S} f dQ \right|$

where $S$ is a measurable space on which $P$ and $Q$ are defined, and $\mathcal{F}$ is a class of real-valued bounded measurable functions on $S$. This class doesn’t contain Csiszár $\phi$-divergences (also known as Ali-Silvey distances), but does contain the total variational distance.

Different choices of the function class give rise to different measures of difference used in so-called two-sample tests, such as the Kolmogorov-Smirnov test. The challenge in practically using these tests is that it’s hard to get bounds on how fast an estimator of $\gamma(P,Q)$ converges if we have to estimate it from samples of $P$ and $Q$. The main result of the paper is to provide estimators with consistency and convergence guarantees. In particular, they estimators are based on either linear programming or (in the case of kernel tests) in closed form.

The second section of the paper connects tests based on IPMs with the risk associated to classification rules for separating $P$ and $Q$ when the separation rule is restricted to come from the function class $\mathcal{F}$ associated to the rule. This is a nice interpretation of these two-sample tests — they are actually doing similar things for restricted classes of classifiers/estimators.

Getting back to KL divergence and non-IPM measures, since total variation gives a lower bound on the KL divergence, they also provide lower bounds on the total variation distance in terms of other IPM metrics. This is important since the total variation distance can’t be estimated itself in a strongly consistent way. This could be useful for algorithms which need to estimate the total variation distance for continuous data. In general, estimating distances between multivariate continuous distributions can become a bit of a mess when you have to use real data — doing a plug-in estimate using, e.g. a kernel density estimator is not always the best way to go, and instead attacking the distance you want to measure directly could yield better results.

I’m at EPFL right now on my way back from a trip to India. My last work-related stop there was the Tata Institute of Fundamental Research, where I am working with Vinod Prabhakaran on some follow-up work to our ISIT paper this year on correlated sampling. TIFR has a lovely campus in Navy Nagar, right next to the ocean in the south of Mumbai. It’s a short walk across the lawn to the ocean:

The beach at TIFR, looking north to the haze-obscured skyline of Mumbai

The grounds themselves are pretty rigorously landscaped and manicured:

The main building seen across the lawn in front of the ocean.

Earlier in my trip I visited IIT-Madras, hosted by Andrew Thangaraj and Radha Krishna Ganti and IIT-Bombay, where I was working with Bikash Dey on extensions to some of our AVCs-with-delay work. It’s been a great trip and it’s nice to visit so many different schools. The challenges facing Indian higher education are very different than those in the US, and it’s a good change of perspective from my usual US-centric view of things. Maybe I’ll write more on that later.

But for now, I wanted to get back to some actual technical stuff, so I thought I’d mention something that came up in the course of my discussions with Vinod today. For a joint distribution $P(X,Y)$, Wyner defined the common information in the the following way:

$\mathsf{CI}(X;Y) = \min_{X--U--Y} I(XY ; U)$

One fun fact about the common information is that it’s larger than the mutual information, so $I(X;Y) \le \mathsf{CI}(X;Y)$. One natural question that popped up as part of a calculation we had to do was whether for doubly-symmetric binary sources we could have a bound like

$\mathsf{CI}(X;Y) \le \alpha I(X;Y)$

for some “nice” $\alpha$. In particular, it would have been nice for us if the inequality held with $\alpha = 2$ but that turns out to not be the case.

Suppose $(X,Y)$ and are a doubly-symmetric binary source, where $Y$ is formed from $X \sim \mathsf{Bern}(1/2)$ by passing it through a binary symmetric channel (BSC) with crossover probability $\alpha$. Clearly the mutual information is $I(X;Y) = 1 - h(\alpha)$. For the common information, we turn to Wyner’s paper, which says $\mathsf{CI}(X;Y) = 1 + h(\alpha) - 2 h( \frac{1}{2}( 1 - \sqrt{1 - 2 \alpha} ) )$, which is a bit of a weird expression. Plotting the two for $\alpha$ between 0 and 1/2 we get:

Mutual and Common Information for a DSBS

If we plot the ratio $\mathsf{CI}(X;Y)/I(X;Y)$ versus $\alpha$, we get the following:

Ratio of Common to Mutual Information for a DSBS

The ratio blows up! So not only is Common Information larger than mutual information, it can be an arbitrarily large factor larger than the mutual information.

It might seem like this is bad news for us, but it just made the problem we’re working on more interesting. If we get any results on that I might be less engimatic and blog about it.

I am traveling all over India at the moment so I’m not really able to write contentful posts. Here are even more links instead, sigh. Maybe later I’ll talk about log-Sobolev inequalities so I can be cool like Max.

Speaking of Max, he posted this hilarious bad lip reading version of Game of Thrones. Probably NSFW. I don’t even like the series but it’s pretty funny.

For those who are fans of Rejected, Don Hertzfeldt’s new film is available on Vimeo.

Those who were at Berkeley may remember seeing Ed Reed perform at the Cheeseboard. His album (which I helped fund via indiegogo, was named a Downbeat Editors’ Pick. It’s a great album.

In light of the Snowden leaks, some doubt has been cast on NIST’s crypto standards.

I’m super late to this, but I endorse Andrew’s endorsement of Sergio‘s interview with Robert Fano in the IT Newsletter. Here’s just the article, if you want that.

I’ll round out the end of my ISIT blogging with very brief takes on a few more papers. I took it pretty casually this year in terms of note taking, and while I attended many more talks, my notes for most of them consist of a title and a star next to the ones where I want to look at the paper more closely. That’s probably closer to how most people attend conferences, only they probably use the proceedings book. I actually ended up shredding the large book of abstracts to use as bedding for my vermicompost (I figured they might appreciate eating a little Turkish paper for a change of diet).

On Connectivity Thresholds in Superposition of Random Key Graphs on Random Geometric Graphs
B Santhana Krishnan (Indian Institute of Technology, Bombay, India); Ayalvadi Ganesh (University of Bristol, United Kingdom); D. Manjunath (IIT Bombay, India)
This looked at a model where you have a random geometric graph (RGG) together with a uniformly chosen random subset $S_i$ of $\{ 1, 2, \ldots, P_n\}$ of size $K_n$ at each node. The subset is the set of keys available at each node; two nodes can talk (securely) if they share a key in common. We keep the edge in the RGG is if the link can be secured. The question is whether the secure-link graph is connected. It turns out that the important scaling is in terms of $r_n^2 K_n^2/P_n$, where $r_n$ is the connectivity radius of the RGG. This sort of makes sense, as the threshold is more or less $\Theta(\log n/n)$, so the keys provide a kind of discount factor on effective radius needed for connectivity — if the number of keys per node is small then you need a larger radius to compensate.

Secure Network Coding for Distributed Secret Sharing with Low Communication Cost
Nihar B Shah (University of California at Berkeley, USA); K. v. Rashmi (University of California at Berkeley, USA); Kannan Ramchandran (University of California at Berkeley, USA)
This paper was on secret sharing — a dealer wants to distribute $n$ shares of a secret such that any $k$ of them can be used to reconstruct the secret but $k-1$ or fewer cannot. The idea here is that the dealer has to distribute these shares over the network, which means that if a receiver is not connected directly to the dealer then the share will be passed insecurely through another node. Existing approaches based on pairwise agreement protocols are communication intensive. The idea here is use ideas from network coding to share masked versions of shares so that intermediate nodes will not get valid shares from others. To do this the graph needs to satisfy a particular condition ($k$-propagating), which is defined in the paper. A neat take on the problem, and worth looking at if you’re interested in that sort of thing.

Conditional Equivalence of Random Systems and Indistinguishability Proofs
Ueli Maurer (ETH Zurich, Switzerland)
This was scheduled to be in the same session as my paper with Vinod, but was moved to an earlier session. Maurer’s “programme” as it were, is to think about security via three kinds of systems — real systems with real protocols and pseudorandomness, idealized systems with real protocols but real randomness, and perfect systems which just exist on paper. The first two are trivially indistinguishable from a computational perspective, and the goal is to show that the last two are information-theoretically indistinguishable. This conceptual framework is actually useful for me to separate out the CS and IT sides of the security design question. This paper tried to set up a framework in which there is a distinguisher D which tries to make queries to two systems and based on the answers has to decide if they are different or not. I think if you’re interested in sort of a systems-theoretic take on security you should take a look at this.

Tight Bounds for Universal Compression of Large Alphabets
Jayadev Acharya (University of California, San Diego, USA); Hirakendu Das (University of California San Diego, USA); Ashkan Jafarpour (UCSD, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
The main contribution of this paper was to derive bounds on compression of patterns of sequences over unknown/large alphabets. The main result is that the worst case pattern redundancy for i.i.d. distributions is basically $n^{1/3}$ where $n$ is the blocklength. The main result is a new upper bound which uses some tricks like sampling a random number of points, where the number of samples is Poisson distributed, and a partition of the set of distributions induced by Poisson sampling.

To Surprise and Inform
Lav R. Varshney (IBM Thomas J. Watson Research Center, USA)
Lav talked about communication over a channel where the goal is to communicate subject to a constraint on the Bayesian surprise $s(x) = D( p(Y|x) \| P(Y) )$ where $X$ and $Y$ are the input and output of the channel. He gets a single-letter expression for the capacity under a bound on the max surprise and gives an example for which the same distribuion maximizes mutual information and achieves the minimax surprise. The flip side is to ask for capacity when each output should be surprising (or “attention seeking”). He gets a single letter capacity here as well, but the structure of the solution seems to be a bit more complicated.

Logarithmic Sobolev inequalities and strong data processing theorems for discrete channels
Maxim Raginsky (University of Illinois at Urbana-Champaign, USA)
Max talked about how the strong data processing inequality (DPI) is basically a log-Sobolev inequality (LSI) that is used in measure concentration. The strong DPI says that
$D(QW \| PW) \le \alpha D(Q \| P)$
for some $\alpha < 1$, so the idea is to get bounds on
$\delta^*(P,W) = \sup_{Q} \frac{D(QW \| PW)}{D(Q \| P)}$.
What he does is construct a hierarchy of LSIs in which the strong DPI fits and then gets bounds on this ratio in terms of best constants for LSIs. The details are a bit hairy, and besides, Max has his own blog so he can write more about it if he wants…

Building Consensus via Iterative Voting
Farzad Farnoud (University of Illinois, Urbana-Champaign, USA); Eitan Yaakobi (Caltech, USA); Behrouz Touri (University of Illinois Urbana-Champaign, USA); Olgica Milenkovic (University of Illinois, USA); Jehoshua Bruck (California Institute of Technology, USA)
This paper was about rank aggregation, or how to take a bunch of votes expressed as permutations/rankings of options to produce a final option. The model is one in which people iteratively change their ranking based on the current ranking. For example, one could construct the pairwise comparison graph (a la Condorcet) and then have people change their rankings when they disagree with the majority on an edge. They show conditions under which this process converges (the graph should not have a cycle) and show that if there is a Condorcet winner, then after this process everyone will rank the Condorcet winner first. They also look at a Borda count version of this problem but to my eye that just looked like an average consensus method, but it was at the end of the talk so I might have missed something.

Information-Theoretic Study of Voting Systems
Eitan Yaakobi (Caltech, USA); Michael Langberg (Open University of Israel, Israel); Jehoshua Bruck (California Institute of Technology, USA)
Eitan gave this talk and the preceding talk. This one was about looking at voting through the lens of coding theory. The main issue is this — what sets of votes or distribution of vote profiles will lead to a Condorcet winner? Given a set of votes, one could look at the fraction of candidates who rank candidate j in the i-th position and then try to compute entropies of the resulting distributions. The idea is somehow to characterize the existence or lack of a Condorcet winner in terms of distances (Kendall tau) and these entropy measures. This is different than looking at probability distributions on permutations and asking about the probability of there existing a Condorcet cycle.

Brute force searching, the typical set and Guesswork
Mark Chirstiansen (National University of Ireland Maynooth, Ireland); Ken R Duffy (National University of Ireland Maynooth, Ireland); Flávio du Pin Calmon (Massachusetts Institute of Technology, USA); Muriel Médard (MIT, USA)
Suppose an item $X$ is chosen $\sim P$ from a list and we want to guess the choice that is made. We’re only allowed to ask questions of the form “is the item $Y$?” Suppose now that the list is a list of codewords of blocklength $k$ drawn i.i.d. according to $Q$. This paper looks at the number of guesses one needs if $P$ is uniform on the typical set according to $Q$ versus when $P$ is distributed according the distribution $Q^k$ conditioned on $X$ being in the typical set. The non-uniformity of the latter turns out to make the guessing problem a lot easier.

Rumor Source Detection under Probabilistic Sampling
Nikhil Karamchandani (University of California Los Angeles, USA); Massimo Franceschetti (University of California at San Diego, USA)
This paper looked at an SI model of infection on a graph — nodes are either Susceptible (S) or Infected (I), and there is a probability of transitioning from S to I based on your neighbors’ states. There in exponential waiting time $\tau_{ij}$ for the $i$ to infect $j$ if $i$ is infected. The idea is that the rumor starts somewhere and infects a bunch of people and then you get to observe/measure the network. You want to find the source. This was studied by Zaman and Shah under the assumption of perfect observation of all nodes. This work looked at the case where nodes randomly report their infection state, so you only get an incomplete picture of the infection state. They characterize the effect of the reporting probability on the excess error and show that for certain tree graphs, incomplete reporting is as good as full reporting.

I added the slides from the ISIT plenaries to the ITSOC website. We’re still not sure when the video will arrive — figuring out media storage is one of the goals of the online committee (as reported to me by Matthieu.

Here are my much-belated post-ISIT notes. I didn’t do as good a job of taking notes this year, so my points may be a bit cursory. Also, the offer for guest posts is still open! On a related note the slides from the plenary lectures are now available on Dropbox, and are also linked to from the ISIT website.

From compression to compressed sensing
Shirin Jalali (New York University, USA); Arian Maleki (Rice University, USA)
The title says it, mostly. Both data compression and compressed sensing use special structure in the signal to achieve a reduction in storage, but while all signals can be compressed (in a sense), not all signals can be compressively sensed. Can one get a characterization (with an algorithm) that can take a lossy source code/compression method, and use it to recover a signal via compressed sensing? They propose an algorithm called compressible signal pursuit to do that. The full version of the paper is on ArXiV.

Dynamic Joint Source-Channel Coding with Feedback
Tara Javidi (UCSD, USA); Andrea Goldsmith (Stanford University, USA)
This is a JSSC problem with a Markov source, which can be used to model a large range of problems, including some sequential search and learning problems (hence the importance of feedback). The main idea is to map the problem in to a partially-observable Markov decision problem (POMDP) and exploit the structure of the resulting dynamic program. They get some structural properties of the solution (e.g. what are the sufficient statistics), but there are a lot of interesting further questions to investigate. I usually have a hard time seeing the difference between finite and infinite horizon formulations, but here the difference was somehow easier for me to understand — in the infinite horizon case, however, the solution is somewhat difficult to compute.

Unsupervised Learning and Universal Communication
Vinith Misra (Stanford University, USA); Tsachy Weissman (Stanford University, USA)
This paper was about universal decoding, sort of. THe idea is that the decoder doesn’t know the codebook but it knows the encoder is using a random block code. However, it doesn’t know the rate, even. The question is really what can one say in this setting? For example, symmetry dictates that the actual message label will be impossible to determine, so the error criterion has to be adjusted accordingly. The decoding strategy that they propose is a partition of the output space (or “clustering”) followed by a labeling. They claim this is a model for clustering through an information theoretic lens, but since the number of clusters is exponential in the dimension of the space, I think that it’s perhaps more of a special case of clustering. A key concept in their development is something they call the minimum partition information, which takes the place of the maximum mutual information (MMI) used in universal decoding (c.f. Csiszár and Körner).

This paper was on graph-based codes where the encoder makes errors, but the channel is ideal and the decoder makes no errors. That is, given a generator matrix $G$ for a code, the encoder wiring could be messed up and bits could be flipped or erased when parities are being computed. The resulting error model can’t just be folded into the channel. Furthermore, a small amount of error in the encoder (in just the right place) could be catastrophic. They focus just on edge erasures in this problem and derive a new distance metric between codewords that helps them characterize the maximum number of erasures that an encoder can tolerate. They also look at a random erasure model.