# Fenchel duality, entropy, and the log partition function

[Update: As Max points out in the comments, this is really a specialized version of the Donsker-Varadhan formula, also mentioned by Mokshay in a comment here. I think the difficulty with concepts like these is that they are true for deeper reasons than the ones given when you learn them — this is a special case that requires undergraduate probability and calculus, basically.]

One of my collaborators said to me recently that it’s well known that the “negative entropy is the Fenchel dual of the log-partition function.” Now I know what these words meant, but it somehow was not a fact that I had learned elsewhere, and furthermore, a sentence like that is frustratingly terse. If you already know what it means, then it’s a nice shorthand, but for those trying to figure it out, it’s impenetrable jargon. I tried running it past a few people here who are generally knowledgeable but are not graphical model experts, and they too were unfamiliar with it. While this is just a simple thing about conjugate duality, I think it doesn’t really show up in information theory classes unless the instructor talks a bit more about exponential family distributions, maximum entropy distributions, and other related concepts. Bert Huang has a post on Jensen’s inequality as a justification.

We have a distribution in the exponential family: $p(x; \theta) = \exp( \langle \theta, \phi(x) \rangle - A(\theta) )$

As a side note, I often find that the exponential family is not often covered in systems EE courses. Given how important it is in statistics, I think it should be a bit more of a central concept — I’m definitely going to try and work it in to the detection and estimation course.

For the purposes of this post I’m going to assume $x$ takes values in a discrete alphabet $\mathcal{X}$ (say, n-bit strings). The function $\phi(x)$ is a vector of statistics calculated from $x$, and $\theta$ is a vector of parameters. the function $A(\theta)$ is the log partition function: $A(\theta) = \log\left( \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \right)$

Where the partition function is $Z(\theta) = \sum_{x} \exp( \langle \theta, \phi(x) \rangle )$

The entropy of the distribution is easy to calculate: $H(p) = \mathbb{E}[ - \log p(x; \theta) ] = A(\theta) - \langle \theta, \mathbb{E}[\phi(x)] \rangle$

The Fenchel dual of a function $f(\theta)$ is the function $g(\psi) = \sup_{\theta} \left\{ \langle \psi, \theta \rangle - f(\theta) \right\}$.

So what’s the Fenchel dual of the log partition function? We have to take the gradient: $\nabla_{\theta} \left( \langle \psi, \theta \rangle - A(\theta) \right) = \psi - \frac{1}{Z(\theta)} \sum_{x} \exp( \langle \theta, \phi(x) \rangle ) \phi(x) = \psi - \mathbb{E}[ \phi(x) ]$

So now setting this equal to zero, we see that at the optimum $\theta^*$: $\langle \psi, \theta^* \rangle = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle$

And the dual function is: $g(\psi) = \langle \mathbb{E}[ \phi(x) ], \theta^* \rangle - A(\theta^*) = - H(p)$

The standard approach seems to go the other direction by computing the dual of the negative entropy, but that seems more confusing to me (perhaps inspiring Bert’s post above). Since the log partition function and negative entropy are both convex, it seems easier to exploit the duality to prove it in one direction only.

# Data: what is it good for? (Absolutely Something): the first few weeks

So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:

• engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
• it’s been more than a decade since we were undergraduates, let alone 18
• the students in our class are fresh out of high school and also haven’t had discussion-based classes

My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.

What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:

• Introduction: what is data?
• Potentials and perils of data science
• The importance of modeling
• Statistical considerations
• Machine learning and algorithms
• Data and society: ethics and privacy
• Data visualizaion
• Project Presentations

I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.

# Allerton 2014: hypercontracting interference channels while noisily feeding back what you detected on the graph

Before the expiration window passes, here are few more short takes from Allerton… for some talks I couldn’t take notes because I didn’t get a seat or I missed half the talk shuttling between sessions.

The Gaussian Channel with Noisy Feedback: Near-Capacity Performance Via Simple Interaction
Assaf Ben-Yishai, Ofer Shayevitz
This was a really nice talk by Ofer on trying to get practical codes for AWGN channels with noisy feedback by using the intuition given by the Schalkwijk-Kailath scheme plus some tricks from using the mod operation. This is reminiscent of lattices (which may be an interesting future direction). The SK scheme has a problem with noise accumulation, which they deal with using these mode operations, and can get to errors around 10^(-6) with around 19 rounds, or blocklength 19 at reasonable SNRs. Blocklength is misleading here since there is feedback every symbol. The other catch is that the feedback link must have much higher SNR than the forward link, but this is true in applications such as sensing, where the receiver may be plugged into the wall, but the transmitter may be on a swallowable medical monitoring device.

Point-To-Point Codes for Interference Channels: A Journey Toward High Performance at Low Complexity
Young-Han Kim
Continuing with my UCSD bias, I also wanted to mention Young-Han’s talk, which was on using COTS (commercial, off-the-shelf) coding schemes on the interference channel (in particular, the 2 user IC). He talked about rate splitting approaches and block Markov schemes. Much of this work is with Lele Wang, who may be graduating soon…

Signal Detection on Graphs
Venkatesh Saligrama
This was a hypothesis testing problem where the observations come from nodes on graph. Under the null, they are Gaussian noise, and under the other hypothesis, there is a connected subgraph with an elevated mean. How should we do detection in this scenario? This is a compound hypothesis testing problem because there are (too) many possible connected subgraphs to consider. He gets around this by looking at a convex program parameterized by a measure of the size/shape of the connected component. This is where my notes get messy though, so you might want to look at the paper if it sounds interesting to you…

Hypercontractivity in Hamming Space
Yury Polyanskiy
I’ve hypercontractivity before, and Yury talked about his paper on ArXiV, which is about functions on the binary hypercube. This talk felt more like a tour of results on hypercontractivity and less like a “here is my new result” talk, which I actually appreciated because I felt it tied together ideas well and made me realize how strange the hypercontractivity parameter of an operator is.

# Detection and Estimation: book recommendations?

It’s confirmed that I will be teaching Detection and Estimation next semester so I figured I would use the blog to conjure up some book recommendations (or even debate, if I can be so hopeful). Some of the contenders:

• Steven M. Kay, Fundamentals of Statistical Signal Processing – Estimation Theory (Vol. 1), Prentice Hall, 1993.
• H. Vincent Poor, An Introduction to Signal Detection and Estimation, 2nd Edition, Springer, 1998.
• Harry L. Van Trees, Detection, Estimation, and Modulation Theory (in 4 parts), Wiley, 2001 (a reprint).
• M.D. Srinath, P.K. Rajasekaran, P. K. and R. Viswanathan, Introduction to Statistical Signal Processing with Applications, Prentice Hall, 1996.

Detection and estimation is a fundamental class for the ECE graduate curriculum, but these “standard” textbooks are around 20 years old, and I can’t help but think there might be more “modern” take on the subject (no I’m not volunteering). Venu Veeravalli‘s class doesn’t use a book, but just has notes. However, I think the students at Rutgers (majority MS students) would benefit from a textbook, at least as a grounding.

Srinath et al. is what my colleague Narayan Mandyam uses. Kay is what I was leaning to before (because it seems to be the most widely used), but Poor’s book is the one I read. I think I am putting up the Van Trees as a joke, mostly. I mean, it’s a great book but I think a bit much for a textbook. So what do the rest of you use? Also, if you are teaching this course next semester, perhaps we can share some ideas. I think the curriculum might be ripe for some shaking up. If not in core material, at least in the kinds of examples we use. For example, I’m certainly going to cover differential privacy as a connection to hypothesis testing.

# MAP and ML in practice on the New Jersey Turnpike

Since I might be teaching detection and estimation next semester, I’ve been thinking a little bit about decision rules during my commute down the New Jersey Turnpike. The following question came to mind:

Suppose you see a car on the Turnpike who is clearly driving dangerously (weaving between cars, going 90+ MPH, tailgating an ambulance, and the like). You have to decide whether the car has New Jersey or New York plates [*]?

This is a hypothesis testing problem. I will assume for simplicity that New York drivers have cars with New York plates and New Jersey drivers have New Jersey plates [**]: $H_0$: New Jersey driver $H_1$: New York driver
Let $Y$ be a binary variable indicating whether or not I observe dangerous driving behavior. Based on my entirely subjective experience, I would say the in terms of likelihoods, $\mathbb{P}(Y = 1 | H_1) > \mathbb{P}(Y = 1 | H_0)$
so the maximum likelihood (ML) rule would suggest that the driver is from New York.

However, if I take into account my (also entirely subjective) priors on the fraction of drivers $P(H_0), P_H(1)$ from New Jersey and New York, respectively, I would have to say $\mathbb{P}(Y = 1 | H_1) P(H_1) < \mathbb{P}(Y = 1 | H_0) P(H_0)$
so the maximum a-posteriori probability (MAP) rule would suggest that the driver is from New Jersey.

Which is better?

[*] I am assuming North Jersey here, so Pennsylvania plates are negligible.
[**] This may be a questionable modeling assumption given suburban demographics.

# Postdoctoral position at University of Michigan

A postdoctoral position is available at the University of Michigan Electrical Engineering and Computer Science Department for a project related to anomaly detection in networked cyber-physical systems. The successful applicant will have knowledge in one or more of the following topics: convex optimization and relaxations, compressed sensing, distributed optimization, submodularity, control and dynamical systems or system identification. The project will cover both theory and algorithm development and some practical applications in fault and attack detection in transportation and energy networks. The position can start anytime in 2014 or early 2015. This is a one year position, renewable for a second year. Interested candidates should contact Necmiye Ozay at necmiye@umich.edu with a CV and some pointers to representative publications.

# Allerton 2014: privately charging your information odometer while realizing your group has pretenders

Here are a few papers that I saw at Allerton — more to come later.

Group Testing with Unreliable Elements
Arya Mazumdar, Soheil Mohajer
This was a generalization of the group testing problem: items are either positive or null, and can be tested in groups such that if any element of the group is positive, the whole group will test positive. The item states can be thought of as a binary column vector $\mathbf{x}$ and each test is the row or a matrix $A$: the $(i,j)$-the entry of $A$ is 1 if the $i$-th item is part of the $j$-th group. The multiplication $A \mathbf{x}$ is taken using Boolean OR. The twist in this paper is that they consider a situation where $\tau$ elements can “pretend” to be positive in each test, with a possibly different group in each test. This is different than “noisy group testing” which was considered previously, which is more like $A \mathbf{x} + \mathrm{noise}$. They show achievable rates for detecting the positives using random coding methods (i.e. a random $A$ matrix). There was some stuff at the end about the non-i.i.d. case but my notes were sketchy at that point.

The Minimal Realization Problems for Hidden Markov Models
Qingqing Huang, Munther Dahleh, Rong Ge, Sham Kakade
The realization problem for an HMM is this: given the exact joint probability distribution on length $N$ strings $(Y_1, Y_2, \ldots, Y_N)$ from an HMM, can we create a multilinear system whose outputs have the same joint distribution? A multilinear system looks like this: $w_{t+1} = A^{(\ell_t)} w_t$ for $w_t \in \mathbb{R}^k$ with $w_0 = v$ $z_{t} = u^t w_t$
We think of $A^{(\ell_t)} \in \mathbb{R}^{k \times k}$ as the transition for state $\ell_t$ (e.g. a stochastic matrix) and we want the output to be $z_t$. This is sort of at the nexus of control/Markov chains and HMMs and uses some of the tensor ideas that are getting hot in machine learning. As the abstract puts it, the results are that they can efficiently construct realizations if $N > max(4 log_d(k) + 1; 3)$, where $d$ is the size of the output alphabet size, and $k$ is the minimal order of the realization.”

Differentially Private Distributed Protocol for Electric Vehicle Charging
Shuo Han, Ufuk Topcu, George Pappas
This paper was about a central aggregator trying to get multiple electric vehicle users to report their charging needs. The central allocator has a utility maximization problem over the rates of charging: $\min_{r_i} U(\sum r_i)$ $\mathrm{s.t.}\ \le r_i \le \bar{r}_i$ $\qquad \mathbf{1}^{\top} r_i = z_i$
They focus on the case where the charge demands $z_i$ are private but the charging rates $r_i$ can be public. This is in contrast to the mechanism design literature popularized by Aaron Roth and collaborators. They do an analysis of a projected gradient descent procedure with Laplace noise added for differential privacy. It’s a stochastic gradient method but seems to converge in just a few time steps — clearly the number of iterations may impact the privacy parameter, but for the examples they showed it was only a handful of time steps.

An Interactive Information Odometer and Applications
Mark Braverman and Omri Weinstein
This was a communication complexity talk about trying to maintain an estimate of the information complexity of a protocol $\Pi$ for computing a function $f(X,Y)$ interactively, where Alice has $X$, Bob has $Y$, and $(X,Y) \sim \mu$: $\mathsf{IC}_{\mu}(\Pi) = I( \Pi ; X | Y ) + I( \Pi ; Y | X)$
Previous work has shown that the (normalized) communication complexity of computing $f$ on $n$-tuples of variables $\epsilon$-accurately approaches the minimum information complexity over all protocols for computing $f$ once $\epsilon$-accurately. At least I think this is what was shown previously — it’s not quite my area. The result in this paper is a strong converse for this — the goal is to maintain an online estimate of $\mathsf{IC}$ during the protocol. The mechanism seeems a bit like communication with feedback a la Horstein, but I got a bit confused as to what was going on — basically it seems that Alice and Bob need to be able to agree on the estimate during the protocol and use a few extra bits added to the communication to maintain this estimate. If I had an extra $n$ hours in the week I would read up more about this. Maybe on my next plane ride…

# Greetings from Allerton 2014

This is not a weeping angel. I hope. “In the battle of ideas for metaphors for explaining these phenomena, graphs are doing pretty well for themselves.” — Jon Kleinberg (at the plenary).

Greetings from Allerton! I know blogging has been light since the semester started. I chalk it up to the whole “starting as an assistant professor is time-consuming” thing. I really hope there isn’t a strong converse because I definitely feel like I am operating above capacity.

Regardless, posts to continue again soon. There have been lots of interesting talks here, and lots to follow up on, time permitting.