# CFP: 2015 Information Theory Workshop (ITW), Jeju Island

I am on the TPC for ITW 2015 in Jeju Island, South Korea.

The 2015 IEEE Information Theory Workshop will take place in Jeju Island, Korea, from October 11 to October 15, 2015. Jeju Island is the largest island in Korea and is located in the Pacific Ocean just off the south-western tip of the Korean peninsula. Jeju Island is a volcanic island with a mountainous terrain, a dramatic rugged coastline and spectacular watershed courses. The Island has a unique culture as well as natural beauty. It is a living folk village, with approximately 540,000 people. As a result of its isolated location and romantic tropical image, Jeju Island has become a favorite retreat with honeymooners and tourists. The tour programs of the conference will also provide participants with the opportunity to feel and enjoy some of the island’s fascinating attractions.

Special topics of emphasis include:

• Big data
• Coding theory
• Communication theory
• Computational biology
• Interactive communication
• Machine learning
• Network information theory
• Privacy and security
• Signal processing

# ISIT Deadline Extended to Monday

Apparently not everyone got this email, so here it is. I promise this blog will not become PSA-central.

Dear ISIT-2015-Submission Reviewers:

In an effort to ensure that each paper has an appropriate number of reviews, the deadline for the submission of all reviews has been extended to March 2nd. If you have not already done so, please submit your review by March 2nd as we are working to a very tight deadline.

(a) all submissions are eligible to be considered for presentation in a semi-plenary session — Please ensure that your review provides an answer to Question 11
(b) in the case of a submission that is eligible for the 2015 IEEE Jack Keil Wolf ISIT Student Paper Award, the evaluation form contains a box at the top containing the text:
Notice: This paper is to be considered for the 2015 IEEE Jack Keil Wolf ISIT Student Paper Award, even if the manuscript itself does not contain a statement to that effect.
– Please ensure that your review provides an answer to Question 12 if this is the case.

Thanks very much for helping out with the review process for ISIT, your inputs are of critical importance in ensuring that the high standards of an ISIT conference are maintained. We know that reviewing a paper takes much effort and we are grateful for all the time you have put in!

With regards,

Pierre, Suhas and Vijay
(TPC Co-Chairs, ISIT 2015)

# ITA 2015: quick takes

Better late than never, I suppose. A few weeks ago I escaped the cold of New Jersey to my old haunts of San Diego. Although La Jolla was always a bit fancy for my taste, it’s hard to beat a conference which boasts views like this:

A view from the sessions at ITA 2015

I’ll just recap a few of the talks that I remember from my notes — I didn’t really take notes during the plenaries so I don’t have much to say about them. Mostly this was due to laziness, but finding the time to blog has been challenging in this last year, so I think I have to pick my battles. Here’s a smattering consisting of

$\{ \mathrm{talks\ attended} \} \cap \{ \mathrm{talks\ with\ understandable\ notes} \}$

(Information theory)
Emina Soljanin talked about designing codes that are good for fast access to the data in distributed storage. Initial work focused on how to repair codes under disk failures. She looked at how easy it is to retrieve the information afterwords to guarantee some QoS for the storage system. Adam Kalai talked about designing compression schemes that work for an “audience” of decoders. The decoders have different priors on the set of elements/messages so the idea is to design an encoder that works for this ensemble of decoders. I kind of missed the first part of the talk so I wasn’t quite sure how this relates to classical work in mismatched decoding as done in the information theory world. Gireeja Ranade gave a great talk about defining notions of capacity/rate need to control a system which as multiplicative uncertainty. That is, $x[n+1] = x[n] + B[n] u[n]$ where $B[n]$ has the uncertainty. She gave a couple of different notions of capacity, relating to the ratio $| x[n]/x[0] |$ — either the expected value of the square or the log, appropriately normalized. She used a “deterministic model” to give an explanation of how control in this setting is kind of like controlling the number of significant bits in the state: uncertainty increases this and you need a certain “amount” of control to cancel that growth.

(Learning and statistics)
I learned about active regression approaches from Sivan Sabato that provably work better than passive learning. The idea there is do to use a partition of the X space and then do piecewise constant approximations to a weight function that they use in a rejection sampler. The rejection sampler (which I thought of as sort of doing importance sampling to make sure they cover the space) helps limit the number of labels requested by the algorithm. Somehow I had never met Raj Rao Nadakuditi until now, and I wish I had gotten a chance to talk to him further. He gave a nice talk on robust PCA, and in particular how outliers “break” regular PCA. He proposed a combination of shrinkage and truncation to help make PCA a bit more stable/robust. Laura Balzano talked about “estimating subspace projections from incomplete data.” She proposed an iterative algorithm for doing estimation on the Grassmann manifold that can do subspace tracking. Constantine Caramanis talked about a convex formulation for mixed regression that gives a guaranteed solution, along with minimax sample complexity bounds showing that it is basically optimal. Yingbin Liang talked about testing approaches for understanding if there is an “anomalous structure” in a sequence of data. Basically for a sequence $Y_1, Y_2, \ldots, Y_n$, the null hypothesis is that they are all i.i.d. $\sim p$ and the (composite) alternative is that there an interval of indices which are $\sim q$ instead. She proposed a RKHS-based discrepancy measure and a threshold test on this measure. Pradeep Ravikumar talked about a “simple” estimator that was a “fix” for ordinary least squares with some soft thresholding. He showed consistency for linear regression in several senses, competitive with LASSO in some settings. Pretty neat, all said, although he also claimed that least squares was “something you all know from high school” — I went to a pretty good high school, and I don’t think we did least squares! Sanmi Koyejo talked about a Bayesian devision theory approach to variable selection that involved minimizing some KL-divergence. Unfortunately, the resulting optimization ended up being NP-hard (for reasons I can’t remember) and so they use a greedy algorithm that seems to work pretty well.

(Privacy)
Cynthia Dwork gave a tutorial on differential privacy with an emphasis on the recent work involving false discovery rate. In addition to her plenary there were several talks on differential privacy and other privacy measures. Kunal Talwar talked about their improved analysis of the SuLQ method for differentially private PCA. Unfortunately there were two privacy sessions in parallel so I hopped over to see John Duchi talk about definitions of privacy and how definitions based on testing are equivalent to differential privacy. The testing framework makes it easier to prove minimax bounds, though, so it may be a more useful view at times. Nadia Fawaz talked about privacy for time-series data such as smart meter data. She defined different types of attacks in this setting and showed that they correspond to mutual information or directed mutual information, as well as empirical results on a real data set. Raef Bassily studied a estimation problem in the streaming setting where you want to get a histogram of the most frequent items in the stream. They reduce the problem to one of finding a “unique heavy hitter” and develop a protocol that looks sort of like a code for the MAC: they encode bits into a real vector, had noise, and then add those up over the reals. It’s accepted to STOC 2015 and he said the preprint will be up soon.

# Feature Engineering for Review Times

The most popular topic of conversation among information theory afficionados is probably the long review times for the IEEE Transactions on Information Theory. Everyone has a story of a very delayed review — either for their own paper or for a friend of theirs. The Information Theory Society Board of Governors and Editor-in-Chief have presented charts of “sub-to-pub” times and other statistics and are working hard on ways to improve the speed of reviews without impairing their quality. These are all laudable. But it occurs to me that there is room for social engineering on the input side of things as well. That is, if we treat the process as a black box, with inputs (papers) and outputs (decisions), what would a machine-learning approach to predicting decision time do?

Perhaps the most important (and overlooked in some cases) aspects of learning a predictor from real data is figuring out what features to measure about each of the inputs. Off the top of my head, things which may be predictive include:

• length
• number of citations
• number of equations
• number of theorems/lemmas/etc.
• number of previous IT papers by the authors
• h-index of authors
• membership status of the authors (student members to Fellows)
• associate editor handling the paper — although for obvious reasons we may not want to include this

I am sure I am missing a bunch of relevant measurable quantities here, but you get the picture.

I would bet that paper length is a strong predictor of review time, not because it takes a longer time to read a longer paper, but because the activation energy of actually picking up the paper to review it is a nonlinear function of the length.

Doing a regression analysis might yield some interesting suggestions on how to pick coauthors and paper length to minimize the review time. This could also help make the system go faster, no? Should we request these sort of statistics from the EiC?

# Fenchel duality, entropy, and the log partition function

[Update: As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is that they are true for deeper reasons than the ones given when you learn them — this is a special case that requires undergraduate probability and calculus, basically.]

One of my collaborators said to me recently that it’s well known that the “negative entropy is the Fenchel dual of the log-partition function.” Now I know what these words meant, but it somehow was not a fact that I had learned elsewhere, and furthermore, a sentence like that is frustratingly terse. If you already know what it means, then it’s a nice shorthand, but for those trying to figure it out, it’s impenetrable jargon. I tried running it past a few people here who are generally knowledgeable but are not graphical model experts, and they too were unfamiliar with it. While this is just a simple thing about conjugate duality, I think it doesn’t really show up in information theory classes unless the instructor talks a bit more about exponential family distributions, maximum entropy distributions, and other related concepts. Bert Huang has a post on Jensen’s inequality as a justification.

We have a distribution in the exponential family:

$p(x; \theta) = \exp( \langle \theta, \phi(x) \rangle - A(\theta) )$

As a side note, I often find that the exponential family is not often covered in systems EE courses. Given how important it is in statistics, I think it should be a bit more of a central concept — I’m definitely going to try and work it in to the detection and estimation course.

For the purposes of this post I’m going to assume $x$ takes values in a discrete alphabet $\mathcal{X}$ (say, n-bit strings). The function $\phi(x)$ is a vector of statistics calculated from $x$, and $\theta$ is a vector of parameters. the function $A(\theta)$ is the log partition function:

$A(\theta) = \log\left( \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \right)$

Where the partition function is

$Z(\theta) = \sum_{x} \exp( \langle \theta, \phi(x) \rangle )$

The entropy of the distribution is easy to calculate:

$H(p) = \mathbb{E}[ - \log p(x; \theta) ] = A(\theta) - \langle \theta, \mathbb{E}[\phi(x)] \rangle$

The Fenchel dual of a function $f(\theta)$ is the function

$g(\psi) = \sup_{\theta} \left\{ \langle \psi, \theta \rangle - f(\theta) \right\}$.

So what’s the Fenchel dual of the log partition function? We have to take the gradient:

$\nabla_{\theta} \left( \langle \psi, \theta \rangle - A(\theta) \right) = \psi - \frac{1}{Z(\theta)} \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \phi(x) = \psi - \mathbb{E}[ \phi(x) ]$

So now setting this equal to zero, we see that at the optimum $\theta^*$:

$\langle \psi, \theta^* \rangle = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle$

And the dual function is:

$g(\psi) = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle - A(\theta^*) = - H(p)$

The standard approach seems to go the other direction by computing the dual of the negative entropy, but that seems more confusing to me (perhaps inspiring Bert’s post above). Since the log partition function and negative entropy are both convex, it seems easier to exploit the duality to prove it in one direction only.

# Allerton 2014: hypercontracting interference channels while noisily feeding back what you detected on the graph

Before the expiration window passes, here are few more short takes from Allerton… for some talks I couldn’t take notes because I didn’t get a seat or I missed half the talk shuttling between sessions.

The Gaussian Channel with Noisy Feedback: Near-Capacity Performance Via Simple Interaction
Assaf Ben-Yishai, Ofer Shayevitz
This was a really nice talk by Ofer on trying to get practical codes for AWGN channels with noisy feedback by using the intuition given by the Schalkwijk-Kailath scheme plus some tricks from using the mod operation. This is reminiscent of lattices (which may be an interesting future direction). The SK scheme has a problem with noise accumulation, which they deal with using these mode operations, and can get to errors around 10^(-6) with around 19 rounds, or blocklength 19 at reasonable SNRs. Blocklength is misleading here since there is feedback every symbol. The other catch is that the feedback link must have much higher SNR than the forward link, but this is true in applications such as sensing, where the receiver may be plugged into the wall, but the transmitter may be on a swallowable medical monitoring device.

Point-To-Point Codes for Interference Channels: A Journey Toward High Performance at Low Complexity
Young-Han Kim
Continuing with my UCSD bias, I also wanted to mention Young-Han’s talk, which was on using COTS (commercial, off-the-shelf) coding schemes on the interference channel (in particular, the 2 user IC). He talked about rate splitting approaches and block Markov schemes. Much of this work is with Lele Wang, who may be graduating soon…

Signal Detection on Graphs
Venkatesh Saligrama
This was a hypothesis testing problem where the observations come from nodes on graph. Under the null, they are Gaussian noise, and under the other hypothesis, there is a connected subgraph with an elevated mean. How should we do detection in this scenario? This is a compound hypothesis testing problem because there are (too) many possible connected subgraphs to consider. He gets around this by looking at a convex program parameterized by a measure of the size/shape of the connected component. This is where my notes get messy though, so you might want to look at the paper if it sounds interesting to you…

Hypercontractivity in Hamming Space
Yury Polyanskiy
I’ve hypercontractivity before, and Yury talked about his paper on ArXiV, which is about functions on the binary hypercube. This talk felt more like a tour of results on hypercontractivity and less like a “here is my new result” talk, which I actually appreciated because I felt it tied together ideas well and made me realize how strange the hypercontractivity parameter of an operator is.