Fenchel duality, entropy, and the log partition function

[Update: As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is that they are true for deeper reasons than the ones given when you learn them -- this is a special case that requires undergraduate probability and calculus, basically.]

One of my collaborators said to me recently that it’s well known that the “negative entropy is the Fenchel dual of the log-partition function.” Now I know what these words meant, but it somehow was not a fact that I had learned elsewhere, and furthermore, a sentence like that is frustratingly terse. If you already know what it means, then it’s a nice shorthand, but for those trying to figure it out, it’s impenetrable jargon. I tried running it past a few people here who are generally knowledgeable but are not graphical model experts, and they too were unfamiliar with it. While this is just a simple thing about conjugate duality, I think it doesn’t really show up in information theory classes unless the instructor talks a bit more about exponential family distributions, maximum entropy distributions, and other related concepts. Bert Huang has a post on Jensen’s inequality as a justification.

We have a distribution in the exponential family:

p(x; \theta) = \exp( \langle \theta, \phi(x) \rangle - A(\theta) )

As a side note, I often find that the exponential family is not often covered in systems EE courses. Given how important it is in statistics, I think it should be a bit more of a central concept — I’m definitely going to try and work it in to the detection and estimation course.

For the purposes of this post I’m going to assume x takes values in a discrete alphabet \mathcal{X} (say, n-bit strings). The function \phi(x) is a vector of statistics calculated from x, and \theta is a vector of parameters. the function A(\theta) is the log partition function:

A(\theta) = \log\left( \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \right)

Where the partition function is

Z(\theta) = \sum_{x} \exp( \langle \theta, \phi(x) \rangle )

The entropy of the distribution is easy to calculate:

H(p) = \mathbb{E}[ - \log p(x; \theta) ] = A(\theta) - \langle \theta, \mathbb{E}[\phi(x)] \rangle

The Fenchel dual of a function f(\theta) is the function

g(\psi) = \sup_{\theta} \left\{ \langle \psi, \theta \rangle - f(\theta) \right\}.

So what’s the Fenchel dual of the log partition function? We have to take the gradient:

\nabla_{\theta} \left( \langle \psi, \theta \rangle - A(\theta) \right) = \psi - \frac{1}{Z(\theta)} \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \phi(x) = \psi - \mathbb{E}[ \phi(x) ]

So now setting this equal to zero, we see that at the optimum \theta^*:

\langle \psi, \theta^* \rangle = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle

And the dual function is:

g(\psi) = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle - A(\theta^*) = - H(p)

The standard approach seems to go the other direction by computing the dual of the negative entropy, but that seems more confusing to me (perhaps inspiring Bert’s post above). Since the log partition function and negative entropy are both convex, it seems easier to exploit the duality to prove it in one direction only.

Data: what is it good for? (Absolutely Something): the first few weeks

So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:

  • engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
  • it’s been more than a decade since we were undergraduates, let alone 18
  • the students in our class are fresh out of high school and also haven’t had discussion-based classes

My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.

What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:

  • Introduction: what is data?
  • Potentials and perils of data science
  • The importance of modeling
  • Statistical considerations
  • Machine learning and algorithms
  • Data and society: ethics and privacy
  • Data visualizaion
  • Project Presentations

I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.

Allerton 2014: hypercontracting interference channels while noisily feeding back what you detected on the graph

Before the expiration window passes, here are few more short takes from Allerton… for some talks I couldn’t take notes because I didn’t get a seat or I missed half the talk shuttling between sessions.

The Gaussian Channel with Noisy Feedback: Near-Capacity Performance Via Simple Interaction
Assaf Ben-Yishai, Ofer Shayevitz
This was a really nice talk by Ofer on trying to get practical codes for AWGN channels with noisy feedback by using the intuition given by the Schalkwijk-Kailath scheme plus some tricks from using the mod operation. This is reminiscent of lattices (which may be an interesting future direction). The SK scheme has a problem with noise accumulation, which they deal with using these mode operations, and can get to errors around 10^(-6) with around 19 rounds, or blocklength 19 at reasonable SNRs. Blocklength is misleading here since there is feedback every symbol. The other catch is that the feedback link must have much higher SNR than the forward link, but this is true in applications such as sensing, where the receiver may be plugged into the wall, but the transmitter may be on a swallowable medical monitoring device.

Point-To-Point Codes for Interference Channels: A Journey Toward High Performance at Low Complexity
Young-Han Kim
Continuing with my UCSD bias, I also wanted to mention Young-Han’s talk, which was on using COTS (commercial, off-the-shelf) coding schemes on the interference channel (in particular, the 2 user IC). He talked about rate splitting approaches and block Markov schemes. Much of this work is with Lele Wang, who may be graduating soon…

Signal Detection on Graphs
Venkatesh Saligrama
This was a hypothesis testing problem where the observations come from nodes on graph. Under the null, they are Gaussian noise, and under the other hypothesis, there is a connected subgraph with an elevated mean. How should we do detection in this scenario? This is a compound hypothesis testing problem because there are (too) many possible connected subgraphs to consider. He gets around this by looking at a convex program parameterized by a measure of the size/shape of the connected component. This is where my notes get messy though, so you might want to look at the paper if it sounds interesting to you…

Hypercontractivity in Hamming Space
Yury Polyanskiy
I’ve hypercontractivity before, and Yury talked about his paper on ArXiV, which is about functions on the binary hypercube. This talk felt more like a tour of results on hypercontractivity and less like a “here is my new result” talk, which I actually appreciated because I felt it tied together ideas well and made me realize how strange the hypercontractivity parameter of an operator is.

Detection and Estimation: book recommendations?

It’s confirmed that I will be teaching Detection and Estimation next semester so I figured I would use the blog to conjure up some book recommendations (or even debate, if I can be so hopeful). Some of the contenders:

  • Steven M. Kay, Fundamentals of Statistical Signal Processing – Estimation Theory (Vol. 1), Prentice Hall, 1993.
  • H. Vincent Poor, An Introduction to Signal Detection and Estimation, 2nd Edition, Springer, 1998.
  • Harry L. Van Trees, Detection, Estimation, and Modulation Theory (in 4 parts), Wiley, 2001 (a reprint).
  • M.D. Srinath, P.K. Rajasekaran, P. K. and R. Viswanathan, Introduction to Statistical Signal Processing with Applications, Prentice Hall, 1996.

Detection and estimation is a fundamental class for the ECE graduate curriculum, but these “standard” textbooks are around 20 years old, and I can’t help but think there might be more “modern” take on the subject (no I’m not volunteering). Venu Veeravalli‘s class doesn’t use a book, but just has notes. However, I think the students at Rutgers (majority MS students) would benefit from a textbook, at least as a grounding.

Srinath et al. is what my colleague Narayan Mandyam uses. Kay is what I was leaning to before (because it seems to be the most widely used), but Poor’s book is the one I read. I think I am putting up the Van Trees as a joke, mostly. I mean, it’s a great book but I think a bit much for a textbook. So what do the rest of you use? Also, if you are teaching this course next semester, perhaps we can share some ideas. I think the curriculum might be ripe for some shaking up. If not in core material, at least in the kinds of examples we use. For example, I’m certainly going to cover differential privacy as a connection to hypothesis testing.

MAP and ML in practice on the New Jersey Turnpike

Since I might be teaching detection and estimation next semester, I’ve been thinking a little bit about decision rules during my commute down the New Jersey Turnpike. The following question came to mind:

Suppose you see a car on the Turnpike who is clearly driving dangerously (weaving between cars, going 90+ MPH, tailgating an ambulance, and the like). You have to decide whether the car has New Jersey or New York plates [*]?

This is a hypothesis testing problem. I will assume for simplicity that New York drivers have cars with New York plates and New Jersey drivers have New Jersey plates [**]:
H_0: New Jersey driver
H_1: New York driver
Let Y be a binary variable indicating whether or not I observe dangerous driving behavior. Based on my entirely subjective experience, I would say the in terms of likelihoods,
\mathbb{P}(Y = 1 | H_1) > \mathbb{P}(Y = 1 | H_0)
so the maximum likelihood (ML) rule would suggest that the driver is from New York.

However, if I take into account my (also entirely subjective) priors on the fraction of drivers P(H_0), P_H(1) from New Jersey and New York, respectively, I would have to say
\mathbb{P}(Y = 1 | H_1) P(H_1) < \mathbb{P}(Y = 1 | H_0) P(H_0)
so the maximum a-posteriori probability (MAP) rule would suggest that the driver is from New Jersey.

Which is better?

[*] I am assuming North Jersey here, so Pennsylvania plates are negligible.
[**] This may be a questionable modeling assumption given suburban demographics.

Postdoctoral position at University of Michigan

A postdoctoral position is available at the University of Michigan Electrical Engineering and Computer Science Department for a project related to anomaly detection in networked cyber-physical systems. The successful applicant will have knowledge in one or more of the following topics: convex optimization and relaxations, compressed sensing, distributed optimization, submodularity, control and dynamical systems or system identification. The project will cover both theory and algorithm development and some practical applications in fault and attack detection in transportation and energy networks. The position can start anytime in 2014 or early 2015. This is a one year position, renewable for a second year. Interested candidates should contact Necmiye Ozay at necmiye@umich.edu with a CV and some pointers to representative publications.

Linkage

Some old links I meant to post a while back but still may be of interest to some…

I prefer my okra less slimy, but to each their own.

Via Erin, A tour of the old homes of the Mission.

Also via Erin, Women and Crosswords and Autofill.

A statistician rails against computer science’s intellectual practices.

Nobel Laureate Randy Schekman is boycotting Nature, Science, and Cell. Retraction Watch is skeptical.