# yet more not-so-recent hits from ArXiV

Some shorter takes on these papers, some of which I should read in more detail later. I figure I’ll use the blog for some quick notes and to see if any readers have any comments/ideas about these:

Differentially Private Convex Optimization with Piecewise Affine Objectives (Shuo Han, Ufuk Topcu, George J. Pappas) — arXiv:1403.6135 [math.OC]. The idea here is to look at minimizing functions of the form
$f(x) = \max_{i = 1,2, \ldots, m} \{ a_i^{\top} x + b_i \}$
subject to $x$ belonging to some convex polytope $\mathcal{P}$. This is a bit different than the kind of convex programs I’ve been looking at (which are more ERM-like). Such programs occur often in resource allocation problems. Here the private information of users are the offsets $b_i$. They propose a number of methods for generating differentially private approximations to this problem. Analyzing the sensitivity of this optimization is tricky, so they use an upper bound based on the diameter of the feasible set $\mathcal{P}$ to find an appropriate noise variance. The exponential mechanism also gives a feasible mechanism, although the exact dependence of the suboptimality gap on $\epsilon$ is unclear. They also propose a noisy subgradient method where, instead of using SGD, they alter the sampling distribution using the exponential mechanism to choose a gradient step. Some preliminary experiments are also given (although none exploring the dependence on $\epsilon$, which would also be very interesting)!

Assisted Common Information with an Application to Secure Two-Party Sampling (Vinod M. Prabhakaran, Manoj M. Prabhakaran) — arXiv:1206.1282 [cs.IT]. This is the final version of the journal version of a few conference papers that Vinod and Manoj have done on an interesting variant of the Gács-Körner problem. The motivation is from secure multiparty computation — the problem also touches on some work Vinod and I started but is sadly languishing due to the utter overwhelmingness of starting a new job. Hopefully I can get back to it this summer.

Analysis of Distributed Stochastic Dual Coordinate Ascent (Tianbao Yang, Shenghuo Zhu, Rong Jin, Yuanqing Lin) — arXiv:1312.1031 [cs.DC]. The title pretty much sums it up. I’m interested in looking a bit more at the analysis method, since I had a similar algorithm bouncing around my head that I would like to analyze. The main idea is also update the primal variables to achieve a speedup/use a larger step size.

Convergence of Stochastic Proximal Gradient Algorithm (Lorenzo Rosasco, Silvia Villa, Bang Công Vũ) — arXiv:1403.5074 [math.OC]. This is a similar setup as my last post, with a convex objective that has a smooth and non-smooth component. They show convergence in expectation and almost surely. The key here is that they show convergence in an infinite-dimentionsal Hilbert space instead of, say, $\mathbb{R}^d$.

# Some (not-so-)recent hits from ArXiV

I always end up bookmarking a bunch of papers from ArXiV and then looking at them a bit later than I want. So here are a few notes on some papers from the last month. I have a backlog of reading to catch up on, so I’ll probably split this into a couple of posts.

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning
H. Brendan McMahan
This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective
Wee Peng Tay
This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent $0$) and $K$ experts (agents $1, 2, \ldots, K$). The agent is trying to do an $M$-ary hypothesis test. The experts (and the agent) have access to local (private) observations $Y_k[1], Y_k[2], \ldots, Y_k[n_k]$ for $k \in \{0,1,2,\ldots,K\}$. The observations come from a family of distributions determined by the true hypothesis $m$. The agent $0$ needs to pick one of the $K$ experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function $C_k$ which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent $0$ should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Ji Liu, Stephen J. Wright
This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector $\hat{x}$ is a version of $x$ that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.

# Broadcast gossip revisited : companion variables and convergence to consensus

Mike Rabbat pointed out his new preprint on ArXiv:

Broadcast Gossip Algorithms for Consensus on Strongly Connected Digraphs
Wu Shaochuan, Michael G. Rabbat

The basic starting point is the unidirectional Broadcast Gossip algorithm — we have $n$ nodes in a graph and each node $i$ starts with a value $x_i(0)$. At each time, a random node broadcasts its value to its neighbors in the graph and they each update their value with the weighted average of their current value and the received value. Eventually, this process converges almost surely and all nodes $i$ will have a common value $\hat{x}$. However, in general $\hat{x} \ne \bar{x}$, where $\bar{x} = \frac{1}{n} \sum x_i(0)$, but $\mathbb{E}[\hat{x}] = \bar{x}$. So this process achieves consensus but only computes the average in expectation.

The reason broadcast gossip does’t compute the average is pretty clear — since the communication and updates are unidirectional, the average of the nodes’ values changes over time. One way around this is to use companion variables $\{y_i(t)\}$ to track the effect of the broadcast. These have been studied before, but in this paper they set the initial values $y_i(0) = 0$ and perform updates as follows : if node $k$ broadcasts at time $t$ and node $j$ is a neighbor, then

$x_j(t+1) = (1 - a_{j,k}) x_j(t) + a_{j,k} x_k(t) + \epsilon d_j^{(k)} y_j(t)$
$y_j(t+1) = a_{j,k} (x_j(t) - x_k(t)) + (1 - \epsilon d_j^{(k)}) y_j(t) + b_{j,k} y_k(t)$
$x_k(t+1) = x_k(t)$
$y_j(t+1) = 0$

So what’s going on here? Each time a node transmits it resets its companion variable to $0$. Each time it receives a broadcast it accumulates an offset in $y_j$. The actual value estimate at the nodes is weighted average of the current and received value plus some of the local offset.

The parameters of the algorithm are matrices $A = (a_{j,k})$ and $B = (b_{j,k})$ and per-node matrices $D_k = (d_j^{(k)})$. Let

$A_k = A e_k e_k^T$
$B_j = B e_k e_k^T$
$L_k = \mathrm{diag}(A_k \mathbf{1}) - A_k$
$S_k = I - e_k e_k^T + B_k$

When node $k$ broadcasts) we can write a matrix

$W_k \left[ \begin{array}{cc} I - L_k & \epsilon D_k \\ L_k & S_k - \epsilon D_k \end{array} \right]$

such that

$\left[ \begin{array}{c} \mathbf{x}(t+1) \\ \mathbf{y}(t+1) \end{array} \right]^T = W(t) \left[ \begin{array}{c} \mathbf{x}(t) \\ \mathbf{y}(t) \end{array} \right]^T$

where $W(t) = W_k$ if node $k$ broadcasts at time $t$.

The main results are:

1. If we choose $\epsilon$ small enough, then $\lim_{t \to \infty} \mathbb{E}[ \mathbf{x}(t) | \mathbf{x}(0) ] = \hat{x} \mathbf{1}$ and $\hat{x} = \bar{x}$ under more conditions.
2. A sufficient condition for the second moment of the error to go to 0.
3. Rules on how to pick the parameters, especially $\epsilon$.

There are also some nice simulations. Broadcast gossip is nice because of its simplicity, and adding a little monitoring/control variable $y_j(t)$ seems to buy a lot in terms of controlling bad sample paths for the process.

# The card-cyclic to random shuffle

Mixing time of the card-cyclic to random shuffle
Ben Morris

We analyze the following method for shuffling $n$ cards. First, remove card 1 (i.e., the card with label 1) and then re-insert it randomly into the deck. Then repeat with cards 2, 3,…, $n$. Call this a round. R. Pinsky showed, somewhat surprisingly, that the mixing time is greater than one round. We show that in fact the mixing time is on the order of $\log n$ rounds.

The talk is based on a paper with Weiyang Ning (a student at UW) and Yuval Peres. The description the results is somewhat different because it’s $\log n$ rounds of $n$ moves, or $n \log n$ moves. From the intro to the paper:

To prove the lower bound we introduce the concept of a barrier between two parts of the deck that moves along with the cards as the shuffling is performed. Then we show that the trajectory of this barrier can be well-approximated by a deterministic function $f$… we relate the mixing rate of the chain to the rate at which $f$ converges to a constant.

The key is to use path coupling, a technique pioneered by Bubley and Dyer. It’s a cute result and would be a fun paper to read for a class project or something, I bet.

# ArXiV notes : July 2012 part 1

I usually tag things from ArXiV that I want to skim later, so I figured I would post the ones I found interesting in the last month or so. Some of them may deserve their own post later. Jet lag is not letting me take the nap I need right now so I figured I’d post it before conking out.

Convergence Rates for Differentially Private Statistical Estimation
K. Chaudhuri and D. Hsu
ArXiV 1206.6395
This paper is about estimation under privacy constraints. In differential privacy there are a fair number of constraints with regards to the data being well behaved — typical assumptions include boundedness, for example. One way around this is to move to a relaxed notion of privacy known as $(\alpha,\delta)$-differential privacy, but what this paper shows is that for the stricter $\alpha$-differential privacy definition, statistical estimation is governed by the gross error sensitivity (GES) of the estimator.

The Greedy Miser: Learning under Test-time Budgets
Z. Xu, K. Weinberger, and O. Chapelle
ArXiV 1206.6451
This paper is about incorporating the cost of extracting features in classification or regression. For a data point $\mathbf{x}_i \in \mathbb{R}^d$, the feature $\alpha$ costs $c_{\alpha}$ to compute and there is a total budget for the computation. The main approach is to minimize a linear combination of the loss and the cost iteratively, where each iteration consists of building a regression tree.

Sequential detection of multiple change points in networks: a graphical model approach
A.A. Amini and X. Nguyen
ArXiV 1207.1687
This looks at the change point detection problem in the network setting, where each of $d$ nodes in a network has a change point $\lambda_j$. Data is observed on vertices and edges and nodes have to pass messages on the graph. There’s a prior distribution on the change point times and after observing the data the nodes can use message passing to do some belief propagation to estimate the posterior distribution and define a stopping time. This turns out to be asymptotically optimal, in that the expected delay in detecting the change point is achieved.

The Price of Privacy in Untrusted Recommendation Engines
S. Bannerjee, N. Hegde, and L. Massoulié
ArXiV 1207.3269
This studies the effect on recommendations engines of individuals masking their data due to privacy concerns. The technical issues are the the dimension is high and each user only has a small amount of data to control (e.g. the movies that they rented or have seen). The question is one of sample complexity — how many users do we need to cluster successfully. If users are not so sparse, they can achieve the Fano lower bound (order-wise) using a sketch and spectral techniques, but the dependence is like $N \log N$, where $N$ is the number of items (e.g. movies). It’s worse if the users are sparse — the scaling is with $N^2$, which is also achievable in some settings. The analysis approach using Fano is analogous to the “information leakage” connections with differential privacy.

Surrogate Losses in Passive and Active Learning
S. Hanneke and L. Yang
ArXiV 1207.3772
This paper looks at how to use surrogate losses (which are computationally tractable) in active learning problems. “We take as our starting point that we have already committed to use a given surrogate loss, and we restrict our attention to just those scenarios in which this heuristic actually does work. We are then interested in how best to make use of the surrogate loss toward the goal of producing a classifier with relatively small error rate.” The central insight in their algorithm is that by taking into account the fact that the “real loss” is (say) the 0-1 loss, then for the purposes of active learning, one only really needs to get the sign right and not the magnitude. So the process will minimize the real loss but not necessarily the surrogate.

# Updated perl script for merging TeX files for ArXiV

Manu Sridharan (blog) left a comment the other day on my old post on my script to merge multiple TeX files (and strip the comments) for posting to ArXiV. He’s created a git repository for it, which seem so much more official and stuff. It’s at:

https://gist.github.com/2175026

Thanks a bunch, Manu!

As a side note, Péter Gács has a de-macro script to eliminate all of your private macros if you’re so inclined.

# Differentially Private ERM

Right after Memorial Day, I submitted a paper with Kamalika Chaudhuri and Claire Monteleoni to the Journal of Machine Learning Research on differential privacy and empirical risk minimization. This work looks at how to learn a classifier from training data in such a way that an adversary with access to the classifier and full knowledge of all but one of the training data points would still have a difficult time inferring the values of the last data point.

Ben Rubenstein has a nice post on the differential privacy model, and Adam Smith has more to say on sample secrecy. Adam and his colleagues (Dan Kifer and Abhradeep Guha Thakurta) gave us useful feedback on an earlier draft, which prompted me to learn some new facts about matrix perturbations, symmetric functions, and eigenvalues. Perhaps I’ll blog about this a bit more in the future, but I seem to be going in fits and starts here, so I don’t want to promise anything.

On a related note, ArXiV has dramatically changed its interface for submissions, and it is soooooo much better than before.

# ArXiV is down

I got the following from ArXiV today:

Submissions to arXiv have been disabled for maintenance. arXiv’s database is down for maintenance. It is still possible to browse, view and search papers that have already been announced but submissions and replacements disabled, as are the functions to add cross listings and journal references.

So I guess I’ll have to wait to post our new submission, “Privacy constraints in regularized convex optimization,” until the system comes back up.

In the meantime, I’ll blog a bit about ISIT!

# be careful how you name your algorithm

The Lasso is a popular algorithm for selection used in statistics. In this recent paper posted to ArXiV the authors show that the Lasso is not algorithmically stable. It’s only to be expected, really. A twirling loop of rope is hardly an image one might associate to stability!

# prove as you go or scaffold first?

I had an interesting conversation two weeks ago about the working process for doing theory work in CS and EE. We discussed two extremes of working styles. In one, you meticulously prove small statements, type them up as you go along, getting the epsilons and deltas right and not working on the next step until the current step is totally set. I call this “prove as you go.” The other is that you sketch out some proofs to convince yourself that they are probably true (in some form) and then try to chase down the implications until you have the big result. When some deadline rolls around, you then build up the proofs for real. This could be thought of as “scaffolding first.” Fundamentally, these are internal modes of working, but because of the pressure to publish in CS and EE they end up influencing how people view theory work.