Design and Analysis of Distributed Averaging with Quantized Communication

Mahmoud El Chamie, Ji Liu, Tamer Başar

The goal of this paper is to analyze the “performance of a subclass of deterministic distributed averaging algorithms where the information exchange between neighboring nodes (agents) is subject to uniform quantization.” I was interested in the connections to Lavaei and Murray’s TAC paper. Here though, they consider a standard consensus setup with a doubly stochastic weight matrix and deterministic, rather than randomized, quantization. They consider two types — rounding and truncation (essentially the floor operation). The update rule is where is the quantization operation. They shoe that in finite time the agents either reach a consensus on the floor of the average of their initial values, or that the cycle indefinitely in a neighborhood around the average. They they show how to control the size of the neighborhood in a decentralized way. There are a lot of works on quantized consensus that have appeared in the last 5 years, and to be honest I haven’t really kept up on the recent literature, so I’m not sure how to compare this to the other works that have appeared, but perhaps some of the readers of the blog have…

arXiv:1403.4699 [math.OC]

A Proximal Stochastic Gradient Method with Progressive Variance Reduction

Lin Xiao, Tong Zhang

This paper looks at convex optimization problems of the form

where the overall objective is strongly convex, the regularizer is lower semicontinuous and convex, and the term separates into a sum of function which are Lipschitz continuous. The proximal gradient method is an iterative procedure for solving this program that does the following:

If we define the function as

then the step looks like:

A stochastic gradient (SG) version of this is

where is sampled uniformly from at each time. The advantage of the SG variant is that it takes less time to do one iteration, but each iteration is much noisier. The goal of this paper is to adapt a previous method/approach to variance reduction to improve the performance of the Prox-SG algorithm. The approach is one or resampling points according to the Lipschitz constants. This sort of “sampling based adaptivity” was also used by my ex-colleague Samory Kpotufe and collaborators in their NIPS paper from 2012 (a longer version is under review). At least I think they’re related.

arXiv:1403.5341v1 [cs.LG]

An Information-Theoretic Analysis of Thompson Sampling

Daniel Russo, Benjamin Van Roy

In a multi-armed bandit problem we have a set of actions (arms) and at each time the learner picks an action and observes an outcome which is assigned a reward by a function . The rewards are assumed to be i.i.d. across time for each action with distributions that are unknown to the learner. The goal is to maximize the reward, which is the same as finding the arm with the largest expected reward. This leads to a classical explore/exploit tradeoff where the learner has to decide whether to explore new arms which may have higher expected reward, or continue exploiting the reward offered by the current arm. Thompson sampling is a Bayesian approach where the learner starts with a prior on the best action and then samples actions at each time according to its posterior belief on the best arm. The authors here analyze the regret of such a policy in terms of what they call the information gain of the system. This gain depends on the ratio between two quantities that are functions of the outcome distributions . One is what they call the “divergence in mean,” namely the difference in expected reward between arms, and the other is the KL divergence.

Filed under: Uncategorized ]]>

For the purposes of this post, a randomized algorithm is -differentially private if

for all pairs of data sets and containing individuals and differing in a single individual, and all measurable sets .

For cases where the output of has a density, we can interpret this as saying the log-likelihood ratio for the output of the distribution is bounded by :

.

For those familiar with hypothesis testing, this guarantee is saying something about the hypothesis test between and being “hard.” One interpretation of differential privacy is that an *adversary* observing the output of the algorithm will have a difficult time inferring if the data of individual is or , *even if they know all other data points*.

Wasserman and Zhou showed that the parameter controls power of this hypothesis test. Oh and Viswanath write this more explicitly as a pair of inequalities governing the tradeoff between the false-alarm and missed-detection probabilities:

.

If we plot these against each other we get a picture like this:

The receiver operating characteristic (ROC) is defined as the true positive rate as a function of the false positive rate. That is, it’s on the y-axis and on the x-axis. So this is the same plot as above, only flipped along the y-axis. To calculate the AUC we just integrate. The point where equality holds in both of the above inequalities is where , or

.

That’s the “corner point” in the previous figure. So the AUC is just

.

We can plot the AUC as a function of pretty easily:

So we can now see how the privacy parameter affects the AUC for this hypothesis test. Depending on how comfortable you are with the risk (and also the threat model), you can assess for yourself what kind of you would prefer. Oh and Viswanath also calculate the tradeoff for -differential privacy, but maybe I’ll leave that for another post. In the end, I don’t find this AUC plot so illuminating, but then again, I don’t have a visceral “feel” for how the AUC corresponds to the quality of the prediction.

Filed under: Uncategorized ]]>

In terms of teaching technical writing at the graduate level, the issues may be similar but the students are generally older — they may have even had some writing experience from undergraduate or masters-level research. How should the “ESL” issue affect how we teach technical writing?

Filed under: Uncategorized ]]>

**Fourier**: Wang and Giannakis, Wireless Multicarrier Communications: Where Fourier Meets Shannon, IEEE Signal Processing Magazine, 2000.**Bode**: Elia, When Bode meets Shannon: control-oriented feedback communication schemes, IEEE Transactions on Automatic Control, 2004.**Maxwell**: Chakraborty and Franceschetti, Maxwell meets Shannon: Space-time duality in multiple antenna channels, Allerton 2006,*and*Lee and Chung, Capacity scaling of wireless ad hoc networks: Shannon meets Maxwell, IEEE Transactions on Information Theory, 2012.**Carnot**: Shental and Kanter, Shannon Meets Carnot: Generalized Second Thermodynamic Law, Europhysics Letters, 2009.**Nash**: Berry and Tse, Shannon Meets Nash on the Interference Channel, IEEE Transactions on Information Theory, 2011.**Walras**: Jorswieck and Mochaourab, Shannon Meets Walras on Interference Networks, ITA Workshop 2013.**Nyqust**: Chen, Eldar, and Goldsmith,

Shannon Meets Nyquist: Capacity of Sampled Gaussian Channels, IEEE Transactions on Information Theory, 2013.**Strang and Fix**: Dragotti, Vetterli, and Blu, Sampling moments and reconstructing signals of finite rate of innovation: Shannon meets Strang–Fix, IEEE Transactions on Signal Processing, 2007.**Blackwell and LeCam**: Raginsky, Shannon meets Blackwell and Le Cam: channels, codes, and statistical experiments, ISIT 2011.**Wiener**: Forney, On the role of MMSE estimation in approaching the information-theoretic limits of linear Gaussian channels: Shannon meets Wiener, Allerton 2003, and Forney, Shannon meets Wiener II: On MMSE estimation in successive decoding schemes, Allerton 2004 and ArXiv 2004.**Bellman**: Meyn and Mathew, Shannon meets Bellman: Feature based Markovian models for detection and optimization, CDC 2008.**Tesla**: Grover and Sahai, Shannon meets Tesla: Wireless information and power transfer, ISIT 2010.**Shortz**: Efron, Shannon Meets Shortz: A Probabilistic Model of Crossword Puzzle Difficulty, Journal of the American Society for Information Science and Technology, 2008.**Marconi**: Tse, Modern Wireless Communication: When Shannon Meets Marconi, ICASSP 2006.**Kalman**: Gattami, Kalman meets Shannon, ArXiV 2014.

Sometimes people are meeting Shannon, and sometimes he is meeting them, but each meeting produces at least one paper.

Filed under: Uncategorized ]]>

- You Can Never Hold Back Spring (Tom Waits)
- Le Gars qui vont à la fête (Stutzmann/Södergren, by Poulenc)
- Judas mercator pessimus (King’s Singers, by Gesualdo)
- Calling (Snorri Helgason)
- Hold Your Head (Hey Marseilles)
- Soutoukou (Mamadou Diabate)
- A Little Lost (Nat Baldwin)
- Gun Has No Trigger (Dirty Projectors)
- Stranger to My Happiness (Sharon Jones & The Dap-Kings)
- Dama Dam Mast Qalandar (Red Baraat)
- Libra Stripes (Polyrhythmics)
- Jaan Pehechan Ho (The Bombay Royale)
- Jolie Coquine (Caravan Palace)
- The Natural World (CYMBALS)
- Je Ne Vois Que Vous (Benjamin Schoos feat. Laetitia Sadier)
- Romance (Wild Flag)

Filed under: Uncategorized ]]>

- students often don’t have a clear line of thought before they write,
- they don’t think of who their audience is,
- they don’t know how to rewrite, or indeed how important it is.

Adding to all of this is that they don’t know how to *read* a paper. In particular, they don’t know what to be reading for in terms of content or form. This makes the experience of reading “related work” sections incredibly frustrating.

What I was thinking was a class where students learn to write a literature review (a small one) on a topic of their choosing. The first part will be how to read papers and make connections between them. What is the point of a literature review, anyway? The first objective is to develop a more systematic way of reading and processing papers. I think everyone I know professionally, myself included, learned how to do this in an ad-hoc way. I believe that developing a formula would help improve my own literature surveying. The second part of the course would be teaching about rewriting (rather than writing). That is, instead of providing rules like “don’t use the passive voice so much” we could focus on “how to revise your sentences to be more active.” I would also benefit from a systematic approach to this for my own writing.

I was thinking of a kind of once-a-week writing seminar style class. Has anyone seen a class like this in engineering programs? Are there tips/tricks from other fields/departments which do have such classes that could be useful in such a class? Even though it is “for social scientists”, Harold Becker’s book is a really great resource.

Filed under: Uncategorized ]]>

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning

H. Brendan McMahan

This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective

Wee Peng Tay

This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent ) and experts (agents ). The agent is trying to do an -ary hypothesis test. The experts (and the agent) have access to local (private) observations for . The observations come from a family of distributions determined by the true hypothesis . The agent needs to pick one of the experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties

Ji Liu, Stephen J. Wright

This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector is a version of that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.

Filed under: Uncategorized ]]>

Wojciech Banaszczyk. *Balancing vectors and gaussian measures of n-dimensional convex bodies*. Random Structures & Algorithms, 12(4):351–360, 1998.

This result came to my attention from a talk given by Sasho Nikolov here at Rutgers on his paper with Kunal Talwar on approximating hereditary discrepancy (see Kunal’s post from last year). The result is pretty straightforward to state.

Banaszczyk’s Theorem.There exists a universal constant such that the following holds. Let be an real matrix such that the -th column satisfies for , and let be a convex body in such that , where . Then there exists a vector such that .

This is a pretty cool result! Basically, if your convex body is big enough to capture half of the probability of a standard Gaussian, then if you blow it up by to get , then for any arbitrary collection of sub-unit-norm vectors , I can find a way to add and subtract them from each other so that the result ends up in .

I haven’t found a use for this result, but it’s a neat fact to keep in the bucket. Maybe it would be useful in alignment/beamforming schemes? Unfortunately, as far as I can tell he doesn’t tell you how to find this mysterious , so…

Filed under: Uncategorized ]]>

This morning I had the singular experience of having a paper rejected from ICML 2014 in which *all of the reviewers* specifically marked that they *did not read and consider the response*. Based on the initial scores the paper was borderline, so the rejection is not surprising. However, we really did try to address their criticisms in our rebuttal. In particular, some misunderstood what our claims were. Had they bothered to read our response (and proposed edits), perhaps they would have realized this.

Highly selective (computer science) conferences often tout their reviews as being just as good as a journal, but in both outcomes and process, it’s a pretty ludicrous claim. I know this post may sound like sour grapes, but it’s not about the outcome, it’s about the process. Why bother with the facade of inviting authors to rebut if the reviewers are unwilling to read the response?

Filed under: Uncategorized ]]>

Along with a couple of Honeywell security researchers I am running a study on a rather familiar problem for most of us – creating memorable but secure passwords, i.e. how to generate passwords that are both suitably random and memorable. We have just launched a simple user study that asks volunteers to participate in an interactive session that lets them choose password candidates and see how well they remember them. Needless to say, these are not actual passwords used by any system, only strings that could be used as passwords.

No personal information is collected in the study and the system only stores the data that is actually provided by the user. To that end, you may choose to not provide any bit of information as you choose. The study takes only a couple of minutes to finish. You may run it multiple times if you wish (and you will likely get different use cases) but you will have to clear the cache on your browsers to get a fresh configuration.

We need at least 300 participants to get statistical significance, so we would appreciate it if you could participate in the study.

Please click here to go to the study: http://138.91.115.120:8080/syspwd

Thanks for your help. Any questions on the study may be directed to me.

Raj

Filed under: Uncategorized ]]>