# Fenchel duality, entropy, and the log partition function

One of my collaborators said to me recently that it’s well known that the “negative entropy is the Fenchel dual of the log-partition function.” Now I know what these words meant, but it somehow was not a fact that I had learned elsewhere, and furthermore, a sentence like that is frustratingly terse. If you already know what it means, then it’s a nice shorthand, but for those trying to figure it out, it’s impenetrable jargon. I tried running it past a few people here who are generally knowledgeable but are not graphical model experts, and they too were unfamiliar with it. While this is just a simple thing about conjugate duality, I think it doesn’t really show up in information theory classes unless the instructor talks a bit more about exponential family distributions, maximum entropy distributions, and other related concepts. Bert Huang has a post on Jensen’s inequality as a justification.

We have a distribution in the exponential family:

$p(x; \theta) = \exp( \langle \theta, \phi(x) \rangle - A(\theta) )$

As a side note, I often find that the exponential family is not often covered in systems EE courses. Given how important it is in statistics, I think it should be a bit more of a central concept — I’m definitely going to try and work it in to the detection and estimation course.

For the purposes of this post I’m going to assume $x$ takes values in a discrete alphabet $\mathcal{X}$ (say, n-bit strings). The function $\phi(x)$ is a vector of statistics calculated from $x$, and $\theta$ is a vector of parameters. the function $A(\theta)$ is the log partition function:

$A(\theta) = \log\left( \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \right)$

Where the partition function is

$Z(\theta) = \sum_{x} \exp( \langle \theta, \phi(x) \rangle )$

The entropy of the distribution is easy to calculate:

$H(p) = \mathbb{E}[ - \log p(x; \theta) ] = A(\theta) - \langle \theta, \mathbb{E}[\phi(x)] \rangle$

The Fenchel dual of a function $f(\theta)$ is the function

$g(\psi) = \sup_{\theta} \left\{ \langle \psi, \theta \rangle - f(\theta) \right\}$.

So what’s the Fenchel dual of the log partition function? We have to take the gradient:

$\nabla_{\theta} \left( \langle \psi, \theta \rangle - A(\theta) \right) = \psi - \frac{1}{Z(\theta)} \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \phi(x) = \psi - \mathbb{E}[ \phi(x) ]$

So now setting this equal to zero, we see that at the optimum $\theta^*$:

$\langle \psi, \theta^* \rangle = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle$

And the dual function is:

$g(\psi) = \langle \mathbb{E}[ \phi(x) ], \theta^* - A(\theta^*) = - H(p)$

The standard approach seems to go the other direction by computing the dual of the negative entropy, but that seems more confusing to me (perhaps inspiring Bert’s post above). Since the log partition function and negative entropy are both convex, it seems easier to exploit the duality to prove it in one direction only.

# Data: what is it good for? (Absolutely Something): the first few weeks

So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:

• engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
• it’s been more than a decade since we were undergraduates, let alone 18
• the students in our class are fresh out of high school and also haven’t had discussion-based classes

My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.

What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:

• Introduction: what is data?
• Potentials and perils of data science
• The importance of modeling
• Statistical considerations
• Machine learning and algorithms
• Data and society: ethics and privacy
• Data visualizaion
• Project Presentations

I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.

# MAP and ML in practice on the New Jersey Turnpike

Since I might be teaching detection and estimation next semester, I’ve been thinking a little bit about decision rules during my commute down the New Jersey Turnpike. The following question came to mind:

Suppose you see a car on the Turnpike who is clearly driving dangerously (weaving between cars, going 90+ MPH, tailgating an ambulance, and the like). You have to decide whether the car has New Jersey or New York plates [*]?

This is a hypothesis testing problem. I will assume for simplicity that New York drivers have cars with New York plates and New Jersey drivers have New Jersey plates [**]:
$H_0$: New Jersey driver
$H_1$: New York driver
Let $Y$ be a binary variable indicating whether or not I observe dangerous driving behavior. Based on my entirely subjective experience, I would say the in terms of likelihoods,
$\mathbb{P}(Y = 1 | H_1) > \mathbb{P}(Y = 1 | H_0)$
so the maximum likelihood (ML) rule would suggest that the driver is from New York.

However, if I take into account my (also entirely subjective) priors on the fraction of drivers $P(H_0), P_H(1)$ from New Jersey and New York, respectively, I would have to say
$\mathbb{P}(Y = 1 | H_1) P(H_1) < \mathbb{P}(Y = 1 | H_0) P(H_0)$
so the maximum a-posteriori probability (MAP) rule would suggest that the driver is from New Jersey.

Which is better?

[*] I am assuming North Jersey here, so Pennsylvania plates are negligible.
[**] This may be a questionable modeling assumption given suburban demographics.

# A joke for Max Raginsky

Setting: a lone house stands on a Scottish moor. The fog is dense here. It is difficult to estimate where your foot will fall. A figure in a cloak stands in front of the door.

Figure: [rapping on the door, in a Highland accent] Knock knock!

Voice from inside: Who’s there?

Figure: Glivenko!

Voice: Glivenko who?

Figure: Glivenko-Cantelli!

[The fog along the moor converges uniformly on the house, enveloping it completely in a cumulus.]

Scene.

# SPCOM 2014: some talks

Relevance Singular Vector Machine for Low-­rank Matrix Sensing
Martin Sundin; Saikat Chatterjee; Magnus Jansson; Cristian Rojas
This talk was on designing Bayesian priors for sparse-PCA problems — the key is to find a prior which induces a low-rank structure on the matrix. The model was something like $y = A \mathrm{vec}(X) + n$ where $X$ is a low-rank matrix and $n$ is noise. The previous state of the art is by Babacan et al., a paper which I obviously haven’t read, but the method they propose here (which involved some heavy algebra/matrix factorizations) appears to be competitive in several regimes. Probably more of interest to those working on Bayesian methods…

Non-Convex Sparse Estimation for Signal Processing
David Wipf
More Bayesian methods! Although David (who I met at ICML) was not trying to say that the priors are particularly “correct,” but rather that the penalty functions that they induce on the problems he is studying actually make sense. More of an algorithmist’s approach, you might say. He set up the problem a bit more generally, to minimize problems of the form
$\min_{X_i} \sum_{i} \alpha_i \mathrm{rank}[X_i] \ \ \ \ \ \ \ Y = \sum_{i} A_i(X_i)$
where $A_i$ are some operators. He made the case that convex relaxations of many of these problems, while analytically beautiful, have restrictions which are not satisfied in practice, and indeed they often have poor performance. His approach is via Empirical Bayes, but this leads to non-convex problems. What he can show is that the algorithm he proposes is competitive with any method that tries to separate the error from the “low-rank” constraint, and that the new optimization is “smoother.” I’m sure more details are in his various papers, for those who are interested.

PCA-HDR: A Robust PCA Based Solution to HDR Imaging
My apologies for taking fewer notes on this one, but I don’t know much about HDR imaging, so this was mostly me learning about HDR image processing. There are several different ways of doing HDR, from multiple exposures to flash/no-flash, and so on. The idea is that artifacts introduced by the camera can be modeled using the robust PCA framework and that denoting in HDR imaging may be better using robust PCA. I think that looking at some of the approaches David mentioned may be good in this domain, since it seems unlikely to me that these images will satisfy the conditions necessary for convex relaxations to work…

On Communication Requirements for Secure Computation
Vinod M Prabhakaran
Vinod showed some information theoretic approaches to understanding how much communication is needed for secure computation protocols like remote oblivious transfer: Xavier has $\{X_0, X_1\}$, Yvonne has $Y \in \{0,1\}$ and Zelda wants $Z = X_Y$, but nobody should be able to infer each other’s values. Feige, Killian, and Naor have a protocol for this, which Vinod and Co. can show is communication-optimal. There were several ingredients here, including cut-set bounds, distribution switching, data processing inequalities, and special bounds for 3-party protocols. More details in his CRYPTO paper (and others).

Artificial Noise Revisited: When Eve Has More Antennas Than Alice
Shuiyin Liu; Yi Hong; Emanuele Viterbo
In a MIMO wiretap setting, if the receiver has more antennas than the transmitter, then the transmitter can send noise in the nullspace of the channel matrix of the direct channel — as long as the eavesdropper has fewer antennas than the transmitter then secure transmission is possible. In this paper they show that positive secrecy capacity is possible even when the eavesdropper has more antennas, but as the number of eavesdropper antennas grows, the achievable rate goes to $0$. Perhaps a little bit of a surprise here!

# ISIT 2014: a few more talks

Annina Bracher (ETH Zurich, Switzerland); Amos Lapidoth (ETHZ, Switzerland)
The title pretty much describes it — there are two receivers which are both looking out for a particular message. This is the identification problem, in which the receiver only cares about a particular message (but we don’t know which one) and we have to design a code such that they can detect the message. The number of messages is $2^{2^{nC}}$ where $C$ is the Shannon capacity of the DMC. In the broadcast setting we run into the problem that the errors for the two receivers are entangled. However, their message sets are disjoint. The way out is to look at the average error for each (averaged over the other user’s message). The main result is that the rates only depend on the conditional marginals, and they have a strong converse.

Efficient compression of monotone and m-modal distributions
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
A monotone distribution is a distribution on $\mathbb{N}$ such that the probabilities are non-increasing. The redundancy for this class is infinite, alas, so they restrict the support to size $k$ (where $k$ can be large). They propose a two-step compression scheme in which the first step is to approximate the true distribution with a piecewise constant step distribution, and then use a compression scheme for step distributions.

Writing on a Dirty Paper in the Presence of Jamming
Amitalok J Budkuley (Indian Institute of Technology, Bombay, India); Bikash K Dey (Indian Institute of Technology Bombay, India); Vinod M Prabhakaran (Tata Institute of Fundamental Research, India)
Ahh, jamming. A topic near and dear to my heart. This paper takes a game-theoretic approach to jamming in a DPC setup: “the capacity of the channel in the presence of the jammer is the unique Nash equilibrium utility of the zero sum communication game between the user and the jammer.” This is a mutual information game, and they show that i.i.d. Gaussian jamming and dirty paper coding are a Nash equilibrium. I looked at an AVC version of this problem in my thesis, and the structure is quite a bit different, so this was an interesting different take on the same problem — how can we use the state information to render adversarial interference as harmless as noise?

Stable Grassmann Manifold Embedding via Gaussian Random Matrices
Hailong Shi (Tsinghua University & Department of Electronic Engineering, P.R. China); Hao Zhang (TsinghuaUniversity, P.R. China); Gang Li (Tsinghua University, P.R. China); Xiqin Wang (Tsinghua University, P.R. China)
This was in the session I was chairing. The idea is that you are given a subspace (e.g., a point on the Grassman manifold) and you want to see what happens when you randomly project this into a lower-dimensional subspace using an i.i.d. Gaussian matrix a la Johnson-Lindenstrauss. The JL Lemma says that projections are length-preserving. Are they also volume-preserving? It turns out that they are (no surprise). The main tools are measure concentration results together with a union bound over a covering set.

Is “Shannon capacity of noisy computing” zero?
Pulkit Grover (Carnegie Mellon University, USA)
Yes. I think. Maybe? Pulkit set up a physical model for computation and used a cut-set argument to show that the total energy expenditure is high. I started looking at the paper in the proceedings and realized that it’s significantly different than the talk though, so I’m not sure I really understood the argument. I should read the paper more carefully. You should too, probably.