Broadcast gossip revisited : companion variables and convergence to consensus

Mike Rabbat pointed out his new preprint on ArXiv:

Broadcast Gossip Algorithms for Consensus on Strongly Connected Digraphs
Wu Shaochuan, Michael G. Rabbat

The basic starting point is the unidirectional Broadcast Gossip algorithm — we have n nodes in a graph and each node i starts with a value x_i(0). At each time, a random node broadcasts its value to its neighbors in the graph and they each update their value with the weighted average of their current value and the received value. Eventually, this process converges almost surely and all nodes i will have a common value \hat{x}. However, in general \hat{x} \ne \bar{x}, where \bar{x} = \frac{1}{n} \sum x_i(0), but \mathbb{E}[\hat{x}] = \bar{x}. So this process achieves consensus but only computes the average in expectation.

The reason broadcast gossip does’t compute the average is pretty clear — since the communication and updates are unidirectional, the average of the nodes’ values changes over time. One way around this is to use companion variables \{y_i(t)\} to track the effect of the broadcast. These have been studied before, but in this paper they set the initial values y_i(0) = 0 and perform updates as follows : if node k broadcasts at time t and node j is a neighbor, then

x_j(t+1) = (1 - a_{j,k}) x_j(t) + a_{j,k} x_k(t) + \epsilon d_j^{(k)} y_j(t)
y_j(t+1) = a_{j,k} (x_j(t) - x_k(t)) + (1 - \epsilon d_j^{(k)}) y_j(t) + b_{j,k} y_k(t)
x_k(t+1) = x_k(t)
y_j(t+1) = 0

So what’s going on here? Each time a node transmits it resets its companion variable to 0. Each time it receives a broadcast it accumulates an offset in y_j. The actual value estimate at the nodes is weighted average of the current and received value plus some of the local offset.

The parameters of the algorithm are matrices A = (a_{j,k}) and B = (b_{j,k}) and per-node matrices D_k = (d_j^{(k)}). Let

A_k = A e_k e_k^T
B_j = B e_k e_k^T
L_k = \mathrm{diag}(A_k \mathbf{1}) - A_k
S_k = I - e_k e_k^T + B_k

When node k broadcasts) we can write a matrix

W_k \left[ \begin{array}{cc} I - L_k & \epsilon D_k \\ L_k & S_k - \epsilon D_k \end{array} \right]

such that

\left[ \begin{array}{c} \mathbf{x}(t+1) \\ \mathbf{y}(t+1) \end{array} \right]^T = W(t) \left[ \begin{array}{c} \mathbf{x}(t) \\ \mathbf{y}(t) \end{array} \right]^T

where W(t) = W_k if node k broadcasts at time $t$.

The main results are:

  1. If we choose \epsilon small enough, then \lim_{t \to \infty} \mathbb{E}[ \mathbf{x}(t) | \mathbf{x}(0) ] = \hat{x} \mathbf{1} and \hat{x} = \bar{x} under more conditions.
  2. A sufficient condition for the second moment of the error to go to 0.
  3. Rules on how to pick the parameters, especially \epsilon.

There are also some nice simulations. Broadcast gossip is nice because of its simplicity, and adding a little monitoring/control variable y_j(t) seems to buy a lot in terms of controlling bad sample paths for the process.

Bellairs Workshop 2012

The beach at Bellairs

The beach at Bellairs

I am spending the week at the Bellairs Research Institute in Holetown, Barbados. McGill owns this facility and faculty organize informal workshops throughout the year on various topics. There are two going on right now — one on control theory approaches in computer animation, and out workshop on signal processing in networks. The McGill side is Mark Coates and Mike Rabbat and a number of their students, both masters and PhD. Anna Scaglione and Angelia Nedich arrived recently.

The format of the workshop has been a mix of tutorials and student presentations and plenty of time for discussion and some problem formation. And of course, for the beach, which is just behind the research facility. Holetown is on the west coast of Barbados, and the Caribbean is warm and inviting. I’m having a great time, even though I am falling behind on my other projects and email and whatnot.

People from Barbados call themselves Bajans (‎/ˈbeɪdʒənz/), so one should be careful not to discuss p-values or t-tests around them.

IEEE page charges for Open Access

I just got an email saying my page proofs are ready for my paper with Alex Dimakis on mobility in gossip algorithms. If I want to make the paper open access, I have to shell out $3000. I think this is in addition to the $110 per page “voluntary” page charges. Now, I’m on the record as being a fan of Open Access, but $3k is a pretty hefty chunk of change! Has anyone else had experience with this?

ISIT 2010 : gossip and consensus

THE MISSING PIECE SYNDROME IN PEER-TO-PEER COMMUNICATION (Bruce Hajek, Ji Zhu; University of Illinois at Urbana Champaign)
This paper proposes a model for peer-to-peer content distribution in a Bit-Torrent-like setup where there is a seed node and everybody wants to get K pieces of a file held by the seed. Users arrive according to a Poisson process and peers randomly collect and transfer (instantaneously) one piece. The paper provides a stability analysis for this system based on queueing. It’s a cool model, and the talk had some rather amusing moments for those who were there…

WEIGHTED GOSSIP: DISTRIBUTED AVERAGING USING NON-DOUBLY STOCHASTIC MATRICES (Florence Bénézit; Ecole Normale Supérieure-INRIA, Vincent Blondel; UC Louvain, Patrick Thiran; Ecole polytechnique fédérale de Lausanne, John Tsitsiklis; Massachusetts Institute of Technology, Martin Vetterli; Ecole polytechnique fédérale de Lausanne)
Florence presented convergence results for an algorithm based on one-way path averaging. Inspired by the Push-Sum protocol of Kempe et al., she described a one-way method in which a node “gives away” a fraction of its estimate and pushes it along a random direction in the network. The receiving node takes some of the value and passes the rest along — It’s kind of like passing a plate of food around a table while keeping a little (or a lot) for yourself. It’s a cool algorithm, and it works really well in experiments. However, the rate of convergence is still an open question — it seems related to the convergence of non-homogeneous Markov chains.

TIGHT BOUNDS FOR ALGEBRAIC GOSSIP ON GRAPHS (Michael Borokhovich, Chen Avin, Zvi Lotker; Ben Gurion University of the Negev)
This paper was more discrete in nature. There are n nodes in a network and each has a value in a finite field. They pass linear combinations of their symbols around. The goal for every node to learn all the information, or equivalently to gather a full-rank set of equations. Nodes can communicate according to a graph structure — they presented upper and lower bounds of n d_{\max} where d_{\max} is the maximum degree in the graph. They also showed the barbell graph is very very slow.

DISTRIBUTED CONSENSUS WITH FINITE MESSAGING (Debashis Dash, Ashutosh Sabharwal; Rice University)
This was on distributed vertex coloring in which each node gets to know something about the colors in its local neighborhood. This is a bit tough (which they prove), but the authors allow themselves a little slack in that they want to minimize the number of defects (nodes with an adjacent node of the same color), rather than make it $0$. A number of algorithms were presented, many of them based on an initial random assignment followed by a refinement step using the local information.

A NEAR-OPTIMAL ALGORITHM FOR NETWORK-CONSTRAINED AVERAGING WITH NOISY LINKS (Nima Noorshams, Martin J. Wainwright; University of California, Berkeley)
This paper was essentially about packing routes in a “gossip along the way” paradigm — if a node wakes up and starts a path (say horizontally), it can also send a message vertically to trigger path-averaging along parallel paths. This gives a two-phase algorithm and the number of rounds ends up looking like the diameter of the graph. However, the number of one-hop messages scales in the same way. Thus the gain is through parallelization.

giving no credit where it is not due

Luca pointed to a paper by Chierichetti, Lattanzi, and Panconesi, which has an amusing comment in the last section (I don’t want to spoil it).

The paper itself is interesting, of course. Conductance often appears in bounds on mixing times for Markov chains, but the rumor spreading problem is a bit different than the consensus problems that I have studied in the past. A nice quote from the introduction:

Our long term goal is to characterize a set of necessary and/or suffcient conditions for rumour spreading to be fast in a given network. In this work, we provide a very general suffcient condition — high conductance. Our main motivation comes from the study of social networks. Loosely stated, we are looking after a theorem of the form “Rumour spreading is fast in social networks”. Our result is a good step in this direction because there are reasons to believe that social networks have high conductance.

Lazy random walks on the torus via Dirichlet forms

One of the simple example graphs I’ve used in some of my research on gossip algorithms has been the 2-dimensional torus with n vertices, which looks like a \sqrt{n} \times \sqrt{n} grid with the top and left edges wrapped around to connect with the bottom and right edges. Every vertex has 4 neighbors. Now imagine a very lazy random walk on this graph in which a random walker moves from vertex u to one of its neighbors with probability 1/n. It’s “well known” that this random walk takes around n^2 steps to mix. That is, if P is the matrix of transition probabilities then

T_{r} = \frac{1}{1 - \lambda_2(P)} \approx n^2

Here \lambda_2(P) is the second largest eigenvalue of P and the relaxation time T_{r} is the inverse of the spectral gap of the matrix. One way of characterizing T_{r} for reversible Markov chains is via the Dirichlet form. For a function f : V \to \mathbb{R} on the states of the chain define the Dirichlet form D(f,f) by

D(f,f) = \frac{1}{2} \sum_{u,v \in V} \pi_u P_{uv} (f(u) - f(v))^2

In our example the stationary distribution pi_u = 1/n and P_{uv} = 1/n for all edges in the graph. We write f \perp \pi if

\sum_{v \in V} \pi_v f(v) = 0

Define a norm associated with \pi via

\|f\|^2 = \sum_{v \in V} \pi_i f(v)^2

Then the characterization is

T_{r} = \sup_{f \perp \pi, f \ne 0} \frac{ \|f\|^2 }{ D(f,f) }

One question I asked myself today was whether it was “easy” to see what f you should choose in the grid example to get the scaling of n^2. Here’s one choice that gives the correct scaling. We’ll set f to be constant on each column. Assume without loss of generality that \sqrt{n} is divisible by 4 and set m = \sqrt{n}/4. The values for f on the columns will be like two triangles:

\{\frac{1}{n}, \frac{2}{n}, \ldots, \frac{m}{n},  \frac{m}{n}, \frac{m-1}{n}, \ldots, \frac{1}{n}, \frac{-1}{n}, \frac{-2}{n}, \ldots, \frac{-m}{n}, \frac{-m}{n}, \frac{-m+1}{n}, \ldots, \frac{-1}{n} \}

Now we can evaluate the norm, noting that there are \sqrt{n} vertices per column:

\|f\|^2 = 4 \sum_{i =1}^{m} \frac{1}{n} \sqrt{n} \frac{i^2}{n^2} = c \frac{1}{n}

This is because the sum of the first m squares scales like m^3 and m = \sqrt{n}/4. Now turning to the Dirichlet form, note that each difference between columns is at most 2/n and there are fewer than n edges for which f(u) \ne f(v). Thus:

D(f,f) \le \frac{1}{2} n \frac{1}{n} \frac{1}{n} \frac{4}{n^2} = c' \frac{1}{n^3}

Taking the ratio gives the lower bound of T_r \ge c''/n^2 as desired.

The first f I tried was just equal to +1 on the first half of the columns and -1 on the second half of the columns. This ends up giving a suboptimal bound, because the norm \|f\|^2 = 1 but in the denominator we get 2 \sqrt{n} positive terms. The key is to make all the differences (f(u) - f(v))^2 in the denominator small while keeping the average of \|f\|^2 large enough. Even though you sum over n small differences in the denominator, it stays small enough to pay for the \|f\|^2 = c/n in the numerator.

While doing this calculation, I noticed that the book Markov Chains and Mixing Times is also online — it makes a handy reference and is a little easier to use than my old go-to, the Aldous-Fill book.