Concert bleg : White Horses

As a bit of a break from ISIT blogging, I am singing in this concert at the end of the week — the repertoire is a bit different than my normal fare of “high classical,” so if you’re in the area and like Bono more than Bach, this might be more up your alley.

WHITE HORSES: Heroes & Hope
Saturday June 26th, 2010
8 p.m.
Copley Auditorium, San Diego Museum of Art
$20 General Admission ($15 in advance) / $12 Students & Museum Members / $10 Seniors & Military

Following the success of their sold-out concert in November of 2009, SACRA/PROFANA returns to the San Diego Museum of Art with a program exploring the unique synergy of poetry, art and music. In conjunction with the Museum’s exhibition Heroes: Mortals and Myths, this genre-bending vocal ensemble will perform modern musical settings of timeless verse by such beloved poets as e.e. cummings, Emily Dickinson and John Keats. The poignant words of the great poets are given new life by the dynamic voices of modern composers- including Minnesota composer Joshua Shank, winner of the 2009 SACRA/PROFANA Choral Composition Contest.

ISIT 2010: Bethe Entropy and the Edge Zeta Function

Connecting the Bethe Entropy and the Edge Zeta Function of a Cycle Code
by Pascal Vontobel

Pascal talked about a connection of the Bethe entropy to the edge zeta function of a cycle code, building on previous work by Koetter, Li Vontobel and Walker. Starting with the first quantity, as is well known, Yedidia, Freeman and Weiss established that sum-product (if it converges) can only converge to fixed points of the Bethe free energy. The Bethe entropy is the non-linear term added in the ‘average energy’ (optimized by the LP decoder) and is therefore a fundamental quantity to understand sum-product and its connections to LP relaxations. A cycle code is a linear code where all the variables have degree two (and can be represented as edges).

Now what is the second quantity in this connection, the edge zeta function of a cycle code? As far as I understood (and this is probably inaccurate), for each edge (=variable of the code) define an indeterminate quantity U_{e}.

For a primitive cycle c_1 (without allowing backtracking) of this graph, consisting of edges e_1, e_2, \cdots, define a monomial \gamma(c_1) that is the product of terms \gamma(c_1)= U_{e1}U_{e2}\cdots for all the edges of the cycle.

The edge zeta function is a product over all the cycles of these monomials inverted:  \prod_{\forall c_i} \frac{1}{1-\gamma(c_i)}.

The edge function can therefore be evaluated at every point of the fundamental polytope by setting the U_{e} appropriately. Previous work by Koetter, Li Vontobel and Walker established that each monomial in the Taylor expansion of this function corresponds to an (unscaled) pseudo-codeword of the cycle code. In particular the powers appearing for each edge (=variable of the code) correspond to re-scaled pseudo-codeword coordinates. As far as I understood, this paper shows that the coefficients of the monomials associated with a given pseudo-codeword have a meaning: their asymptotic growth rate equals to the directional derivative of the (induced) Bethe entropy in the direction of that pseudo-codeword.

I was wondering on the connections to the self-avoiding walks of Dror Weitz [Counting independent sets up to the tree threshold] (also Local approximate inference algorithms)  and the connections of the random walks to the weights of the Arora,Daskalakis, Steurer Message Passing algorithms and improved LP decoding paper.

ISIT 2010 : gossip and consensus

THE MISSING PIECE SYNDROME IN PEER-TO-PEER COMMUNICATION (Bruce Hajek, Ji Zhu; University of Illinois at Urbana Champaign)
This paper proposes a model for peer-to-peer content distribution in a Bit-Torrent-like setup where there is a seed node and everybody wants to get K pieces of a file held by the seed. Users arrive according to a Poisson process and peers randomly collect and transfer (instantaneously) one piece. The paper provides a stability analysis for this system based on queueing. It’s a cool model, and the talk had some rather amusing moments for those who were there…

WEIGHTED GOSSIP: DISTRIBUTED AVERAGING USING NON-DOUBLY STOCHASTIC MATRICES (Florence Bénézit; Ecole Normale Supérieure-INRIA, Vincent Blondel; UC Louvain, Patrick Thiran; Ecole polytechnique fédérale de Lausanne, John Tsitsiklis; Massachusetts Institute of Technology, Martin Vetterli; Ecole polytechnique fédérale de Lausanne)
Florence presented convergence results for an algorithm based on one-way path averaging. Inspired by the Push-Sum protocol of Kempe et al., she described a one-way method in which a node “gives away” a fraction of its estimate and pushes it along a random direction in the network. The receiving node takes some of the value and passes the rest along — It’s kind of like passing a plate of food around a table while keeping a little (or a lot) for yourself. It’s a cool algorithm, and it works really well in experiments. However, the rate of convergence is still an open question — it seems related to the convergence of non-homogeneous Markov chains.

TIGHT BOUNDS FOR ALGEBRAIC GOSSIP ON GRAPHS (Michael Borokhovich, Chen Avin, Zvi Lotker; Ben Gurion University of the Negev)
This paper was more discrete in nature. There are n nodes in a network and each has a value in a finite field. They pass linear combinations of their symbols around. The goal for every node to learn all the information, or equivalently to gather a full-rank set of equations. Nodes can communicate according to a graph structure — they presented upper and lower bounds of n d_{\max} where d_{\max} is the maximum degree in the graph. They also showed the barbell graph is very very slow.

DISTRIBUTED CONSENSUS WITH FINITE MESSAGING (Debashis Dash, Ashutosh Sabharwal; Rice University)
This was on distributed vertex coloring in which each node gets to know something about the colors in its local neighborhood. This is a bit tough (which they prove), but the authors allow themselves a little slack in that they want to minimize the number of defects (nodes with an adjacent node of the same color), rather than make it $0$. A number of algorithms were presented, many of them based on an initial random assignment followed by a refinement step using the local information.

A NEAR-OPTIMAL ALGORITHM FOR NETWORK-CONSTRAINED AVERAGING WITH NOISY LINKS (Nima Noorshams, Martin J. Wainwright; University of California, Berkeley)
This paper was essentially about packing routes in a “gossip along the way” paradigm — if a node wakes up and starts a path (say horizontally), it can also send a message vertically to trigger path-averaging along parallel paths. This gives a two-phase algorithm and the number of rounds ends up looking like the diameter of the graph. However, the number of one-hop messages scales in the same way. Thus the gain is through parallelization.

ISIT 2010 : some talks on statistics and probability

For many of the talks I attended I didn’t take notes — partly this is because I didn’t feel expert enough to note things down correctly, and partly because I am

RÉNYI DIVERGENCE AND MAJORIZATION (Tim van Erven; Centrum Wiskunde & Informatica, Peter Harremoës; Copenhagen Business College)
Peter gave a talk on some properties of the Rényi entropy

D_{\alpha}(P \| Q) = \frac{1}{\alpha - 1} \log \int p^{\alpha} q^{1 - \alpha} d \mu

where \mu is a dominating measure, and a related geometric quantity, the Lorenz diagram. This is a set of all points in \mathbb{R}^2 which are equal to (\int f dP, \int f dQ) for some unit measure function f on [0,1]. It turns out that subset relationships of Lorenz diagrams are the same as majorization relationships between the measures. This is related to something called Markov ordering from a recent paper by Gorban, Gorban, and Judge. This was one of those nice talks which pointed me towards some results and tools that are new to me but that I could not fully understand during the talk.

MINIMAX LOWER BOUNDS VIA F-DIVERGENCES (Adityanand Guntuboyina; Yale University)
This talk was on minimax lower bounds on parameter estimation which can be calculated in terms of generalized f divergences, which are of the form

D_f(P \| Q) = \int f \left( \frac{dP}{dQ} \right) dQ

Suppose T is an estimator for some parameter \theta. We want universal (over T) lower bounds on \mathbb{P}(T(X^n) \ne \theta). This gives a generalization of Fano’s inequality:

\mathbb{P}(T(X^n) \ne \theta) \ge \inf_Q \sum_{\theta} D_f(P_{\theta} \| Q)

This has a very geometric feel to it but I had a bit of a hard time following all of the material since I didn’t know the related literature very well (much of it was given in relation to Yang and Barron’s 1999 paper).

MUTUAL INFORMATION SADDLE POINTS IN CHANNELS OF EXPONENTIAL FAMILY TYPE (Todd Coleman; University of Illinois, Maxim Raginsky; Duke University)
In some cases where we have a class of channels and a class of channel inputs, there is a saddle point. The main example is that Gaussian noise is the worst for a given power constraint and Gaussian inputs are the best. Another example is the “bits through queues” This paper gave a more general class of channels of “exponential family type” and gave a more general condition under which you get a saddle point for the mutual information. The channel models are related to Gibbs variational principle, and the arguments had a kind of “free energy” interpretation. Ah, statistical physics rears its head again.

INFORMATION-THEORETIC BOUNDS ON MODEL SELECTION FOR GAUSSIAN MARKOV RANDOM FIELDS (Wei Wang, Martin J. Wainwright, Kannan Ramchandran; University of California, Berkeley)
This was on two problems: inferring edges in a graphical model from iid samples from that model, and inverse covariance estimation. They are similar, but not quite the same. The goal was to prove necessary conditions for doing this; these necessary conditions match the corresponding achievable rates from polytime algorithms. The main result was that you need n = \Omega( d^2 \log p) samples, where p is the number of variables, and d is the max degree in the graphical model. The proof approach was through modeling graph selection as channel coding in which the channel outputs samples from a graphical model chosen by the encoder. If the decoder can identify the model then the problem is solvable. This is analogous to the talk Sriram Vishwanath gave today on matrix completion.

ISIT 2010 : Anthony Ephremides

“The first phase [of life] you believe in Santa Claus, the second you don’t believe in Santa Clause, and in the third you become Santa Claus.” — Tony Ephremides

Prof. Anthony Ephremides gave the third plenary at ISIT, which included an interim report on the consummation between information theory and network coding and many connections between opera and research. He did a great job I think in explaining the difference between throughput region, stability region, and capacity region (under bursty vs. non-bursty use). These are in increasing order of inclusion. Some interesting tidbits (some new, some not):

  • He had a complaint about the way networking handles fading by simply saying the success probability for a packet being received is just \mathbb{P}( SNR > \gamma).
  • Under contention, the capacity region may not be convex, unlike in information theory where you can do time sharing.
  • For wireless network coding it is important to connect the MAC protocol and scheduling issues as well as change the notion of cut capacities. That is, you shouldn’t replace edges with hyper-edges, because that’s not a good model.
  • The information rate is the rate from the payload plus the information in the idleness and the information from the identity of the transmitter. That is, you get data from the actual packet, when the packet was sent, and who is sending the packet.
  • Extending many analyses to more than 2 users has not been done.
  • Can bursty traffic capacity be larger than non-bursty? The NNN (*) community says that would be a contradiction, because you can always then emulate bursty traffic. But this is unfair because idling the transmitter to build up packets and hence create artificial bursts is not allowed in the problem formulation, so there is no contradiction.
  • Relaying is good from a network perspective because it can partially enable a first-come first-serve (FCFS) discipline. So relays bridge a gap to the optimal scheduling policy. This is different than the information theory notion of cooperation yielding diversity gain.
  • Multicast throughput is an interesting thing to think about more rigorously for the future.
  • His prescription : the information theory has to apply its tools more broadly to networking problems and to unorthodox problems. The networking community should use more rigorous modeling and analysis methods.

(*) NNN = Nagging nabobs of negativism (really it should be “nattering nabobs of negativism,” Safire’s famous alliterative phrase wielded with such gusto by Spiro Agnew).

ISIT 2010 : Michael Jordan

Perhaps malapropos for the NBA Finals, Prof. Michael Jordan gave the first plenary talk at ISIT. It was a great overview of nonparametric Bayesian modeling. In particular, he covered his favorite Chinese restaurant process (also known as the Pitman-Yor stick-breaking process), hierarchical Dirichlet priors, and all the other jargon-laden elements of modeling. At the end he covered some of the rather stunning successes of this approach in applications with lots of data to learn from. What was missing for me was a sense of how these approaches worked in the data-poor regime, so I asked a question (foolishly) about sample complexity. Alas, since that is a “frequentist” question and Jordan is a “Bayesian,” I didn’t quite get the answer to the question I was trying to ask, but that’s what happens when you don’t phrase things properly.

One nice thing that I learned was the connection to Kingman’s earlier work on characterizing random measures via non-homogeneous Poisson processes. Kingman has been popping up all over the place in my reading, from urn processes to exchangeable partition processes (also known as paintbox processes). When I get back to SD, it will be back to the classics for me!