# more not-so-recent hits from ArXiV

arXiv:1403.4696 [math.OC]
Design and Analysis of Distributed Averaging with Quantized Communication
Mahmoud El Chamie, Ji Liu, Tamer Başar

The goal of this paper is to analyze the “performance of a subclass of deterministic distributed averaging algorithms where the information exchange between neighboring nodes (agents) is subject to uniform quantization.” I was interested in the connections to Lavaei and Murray’s TAC paper. Here though, they consider a standard consensus setup with a doubly stochastic weight matrix $W$ and deterministic, rather than randomized, quantization. They consider two types — rounding and truncation (essentially the floor operation). The update rule is $x_i(t+1) = x_i(t) + \sum_{j} W_{ij} (Q(x_j(t)) - Q(x_i(t)))$ where $Q$ is the quantization operation. They shoe that in finite time the agents either reach a consensus on the floor of the average of their initial values, or that the cycle indefinitely in a neighborhood around the average. They they show how to control the size of the neighborhood in a decentralized way. There are a lot of works on quantized consensus that have appeared in the last 5 years, and to be honest I haven’t really kept up on the recent literature, so I’m not sure how to compare this to the other works that have appeared, but perhaps some of the readers of the blog have…

arXiv:1403.4699 [math.OC]
A Proximal Stochastic Gradient Method with Progressive Variance Reduction
Lin Xiao, Tong Zhang

This paper looks at convex optimization problems of the form

$\min_{x \in \mathbb{R}^d} F(x) + R(x)$

where the overall objective $P(x) + R(x)$ is strongly convex, the regularizer $R(x)$ is lower semicontinuous and convex, and the term $F(x) = \frac{1}{n} \sum_{i=1}^{n} f_i(x)$ separates into a sum of function $f_i(x)$ which are Lipschitz continuous. The proximal gradient method is an iterative procedure for solving this program that does the following:

$x_{t} = \mathrm{argmin}_{x \in \mathbb{R}^d} \left\{ \nabla F(x_{t})^{\top} x + \frac{1}{2 \eta_t} \| x - x_{t-1} \|^2 + R(x) \right\}$

If we define the $\mathrm{prox}_{R}$ function as

$\mathrm{prox}_R(y) = \mathrm{argmin}_{x \in \mathbb{R}^d} \left\{ \frac{1}{2} \|x - y\|^2 + R(x) \right\}$

then the step looks like:

$x_t = \mathrm{prox}_{\eta_t R}(x_{t-1} - \eta_{t} \nabla F(x_{t-1}))$

A stochastic gradient (SG) version of this is

$x_t = \mathrm{prox}_{\eta_t R}(x_{t-1} - \eta_{t} \nabla f_{i_t}(x_{t-1}))$

where $i_t$ is sampled uniformly from $\{1,2,\ldots, n\}$ at each time. The advantage of the SG variant is that it takes less time to do one iteration, but each iteration is much noisier. The goal of this paper is to adapt a previous method/approach to variance reduction to improve the performance of the Prox-SG algorithm. The approach is one or resampling points according to the Lipschitz constants. This sort of “sampling based adaptivity” was also used by my ex-colleague Samory Kpotufe and collaborators in their NIPS paper from 2012 (a longer version is under review). At least I think they’re related.

arXiv:1403.5341v1 [cs.LG]
An Information-Theoretic Analysis of Thompson Sampling
Daniel Russo, Benjamin Van Roy

In a multi-armed bandit problem we have a set of actions (arms) $A$ and at each time the learner picks an action $a$ and observes an outcome $Y_t(a) \in \mathcal{Y}$ which is assigned a reward by a function $R: \mathcal{Y} \to \mathbb{R}$. The rewards are assumed to be i.i.d. across time for each action with distributions $p(y | a)$ that are unknown to the learner. The goal is to maximize the reward, which is the same as finding the arm with the largest expected reward. This leads to a classical explore/exploit tradeoff where the learner has to decide whether to explore new arms which may have higher expected reward, or continue exploiting the reward offered by the current arm. Thompson sampling is a Bayesian approach where the learner starts with a prior on the best action and then samples actions at each time according to its posterior belief on the best arm. The authors here analyze the regret of such a policy in terms of what they call the information gain of the system. This gain depends on the ratio between two quantities that are functions of the outcome distributions $p(y | a)$. One is what they call the “divergence in mean,” namely the difference in expected reward between arms, and the other is the KL divergence.

# ISIT Blogging, part 3

I’ll round out the end of my ISIT blogging with very brief takes on a few more papers. I took it pretty casually this year in terms of note taking, and while I attended many more talks, my notes for most of them consist of a title and a star next to the ones where I want to look at the paper more closely. That’s probably closer to how most people attend conferences, only they probably use the proceedings book. I actually ended up shredding the large book of abstracts to use as bedding for my vermicompost (I figured they might appreciate eating a little Turkish paper for a change of diet).

On Connectivity Thresholds in Superposition of Random Key Graphs on Random Geometric Graphs
B Santhana Krishnan (Indian Institute of Technology, Bombay, India); Ayalvadi Ganesh (University of Bristol, United Kingdom); D. Manjunath (IIT Bombay, India)
This looked at a model where you have a random geometric graph (RGG) together with a uniformly chosen random subset $S_i$ of $\{ 1, 2, \ldots, P_n\}$ of size $K_n$ at each node. The subset is the set of keys available at each node; two nodes can talk (securely) if they share a key in common. We keep the edge in the RGG is if the link can be secured. The question is whether the secure-link graph is connected. It turns out that the important scaling is in terms of $r_n^2 K_n^2/P_n$, where $r_n$ is the connectivity radius of the RGG. This sort of makes sense, as the threshold is more or less $\Theta(\log n/n)$, so the keys provide a kind of discount factor on effective radius needed for connectivity — if the number of keys per node is small then you need a larger radius to compensate.

Secure Network Coding for Distributed Secret Sharing with Low Communication Cost
Nihar B Shah (University of California at Berkeley, USA); K. v. Rashmi (University of California at Berkeley, USA); Kannan Ramchandran (University of California at Berkeley, USA)
This paper was on secret sharing — a dealer wants to distribute $n$ shares of a secret such that any $k$ of them can be used to reconstruct the secret but $k-1$ or fewer cannot. The idea here is that the dealer has to distribute these shares over the network, which means that if a receiver is not connected directly to the dealer then the share will be passed insecurely through another node. Existing approaches based on pairwise agreement protocols are communication intensive. The idea here is use ideas from network coding to share masked versions of shares so that intermediate nodes will not get valid shares from others. To do this the graph needs to satisfy a particular condition ($k$-propagating), which is defined in the paper. A neat take on the problem, and worth looking at if you’re interested in that sort of thing.

Conditional Equivalence of Random Systems and Indistinguishability Proofs
Ueli Maurer (ETH Zurich, Switzerland)
This was scheduled to be in the same session as my paper with Vinod, but was moved to an earlier session. Maurer’s “programme” as it were, is to think about security via three kinds of systems — real systems with real protocols and pseudorandomness, idealized systems with real protocols but real randomness, and perfect systems which just exist on paper. The first two are trivially indistinguishable from a computational perspective, and the goal is to show that the last two are information-theoretically indistinguishable. This conceptual framework is actually useful for me to separate out the CS and IT sides of the security design question. This paper tried to set up a framework in which there is a distinguisher D which tries to make queries to two systems and based on the answers has to decide if they are different or not. I think if you’re interested in sort of a systems-theoretic take on security you should take a look at this.

Tight Bounds for Universal Compression of Large Alphabets
Jayadev Acharya (University of California, San Diego, USA); Hirakendu Das (University of California San Diego, USA); Ashkan Jafarpour (UCSD, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
The main contribution of this paper was to derive bounds on compression of patterns of sequences over unknown/large alphabets. The main result is that the worst case pattern redundancy for i.i.d. distributions is basically $n^{1/3}$ where $n$ is the blocklength. The main result is a new upper bound which uses some tricks like sampling a random number of points, where the number of samples is Poisson distributed, and a partition of the set of distributions induced by Poisson sampling.

To Surprise and Inform
Lav R. Varshney (IBM Thomas J. Watson Research Center, USA)
Lav talked about communication over a channel where the goal is to communicate subject to a constraint on the Bayesian surprise $s(x) = D( p(Y|x) \| P(Y) )$ where $X$ and $Y$ are the input and output of the channel. He gets a single-letter expression for the capacity under a bound on the max surprise and gives an example for which the same distribuion maximizes mutual information and achieves the minimax surprise. The flip side is to ask for capacity when each output should be surprising (or “attention seeking”). He gets a single letter capacity here as well, but the structure of the solution seems to be a bit more complicated.

# Postdoc at Cornell in smart grid, learning, optimization and control

Applicants are sought for postdoctoral scholar position(s) at Cornell University in the areas of smart grid, learning, optimization, and control. Topics include, but are not limited to,

1. the economics and operation of power systems with significant penetration of intermittent renewables.
2. stochastic optimization, learning, game theory, mechanism design, and their applications.
3. inference and control involving heavy tail distributions.

Successful candidates will participate in research activities led by Professors Lang Tong and/or Eilyan Bitar.

To apply, please send your CV, two recent papers, and 2-3 names references to Lang Tong (ltong@ece.cornell.edu) and Eilyan Bitar (eyb5@cornell.edu).

# Allerton 2011

Despite Yury‘s attempts to get me to “stop blogging,” here is my much-delayed recap of Allerton. Because of moving to TTI-Chicago right after Allerton and then almost immediately shuttling back to UCSD for the iDASH Privacy Workshop, things have been a bit delayed. I could only attend for two days but I wanted to highlight a few interesting talks that I saw. More Allerton blogging was done by Michael Mitzenmacher (part 1, part 2) and a bit by Maxim Raginsky about his talk on causal calculus (since he blogged about it I don’t have to, ha!). The conference has gotten too big and the rooms are too small to hold the audience, so it probably is time to move the thing. We have similar issues at ITA and the 2012 ITA Workshop is moving off campus next year (you heard it here first, folks!)

But here are some interesting talks I saw: