# Some (not-so-)recent hits from ArXiV

I always end up bookmarking a bunch of papers from ArXiV and then looking at them a bit later than I want. So here are a few notes on some papers from the last month. I have a backlog of reading to catch up on, so I’ll probably split this into a couple of posts.

arXiv:1403.3465v1 [cs.LG]: Analysis Techniques for Adaptive Online Learning
H. Brendan McMahan
This is a nice survey on online learning/optimization algorithms that adapt to the data. These are all variants of the Follow-The-Regularized-Leader algorithms. The goal is to provide a more unified analysis of online algorithms where the regularization is data dependent. The intuition (as I see it) is that you’re doing a kind of online covariance estimation and then regularizing with respect to the distribution as you are learning it. Examples include the McMahan and Streeter (2010) paper and the Duchi et al. (2011) paper. Such adaptive regularizers also appear in dual averaging methods, where they are called “prox-functions.” This is a useful survey, especially if, like me, you’ve kind of checked in and out with the online learning literature and so may be missing the forest for the trees. Or is that the FoReL for the trees?

arXiv:1403.4011 [cs.IT]: Whose Opinion to follow in Multihypothesis Social Learning? A Large Deviation Perspective
Wee Peng Tay
This is a sort of learning from expert advice problem, though not in the setting that machine learners would consider it. The more control-oriented folks would recognize it as a multiple-hypothesis test. The model is that there is a single agent (agent $0$) and $K$ experts (agents $1, 2, \ldots, K$). The agent is trying to do an $M$-ary hypothesis test. The experts (and the agent) have access to local (private) observations $Y_k[1], Y_k[2], \ldots, Y_k[n_k]$ for $k \in \{0,1,2,\ldots,K\}$. The observations come from a family of distributions determined by the true hypothesis $m$. The agent $0$ needs to pick one of the $K$ experts to hire — the analogy is that you are an investor picking an analyst to hire. Each expert has its own local loss function $C_k$ which is a function of the amount of data it has as well as the true hypothesis and the decision it makes. This is supposed to model a “bias” for the expert — for example, they may not care to distinguish between two hypotheses. The rest of the paper looks at finding policies/decision rules for the agents that optimize the exponents with respect to their local loss functions, and then looking at how agent $0$ should act to incorporate that advice. This paper is a little out of my wheelhouse, but it seemed interesting enough to take a look at. In particular, it might be interesting to some readers out there.

arXiv:1403.3862 [math.OC] Asynchronous Stochastic Coordinate Descent: Parallelism and Convergence Properties
Ji Liu, Stephen J. Wright
This is another paper on lock-free optimization (c.f. HOGWILD!). The key difference, as stated in the introduction, is that they “do not assume that the evaluation vector $\hat{x}$ is a version of $x$ that actually existed in the shared memory at some point in time.” What does this mean? It means that a local processor, when it reads the current state of the iterate, may be performing an update with respect to a point not on the sample path of the algorithm. They do assume that the delay between reading and updating the common state is bounded. To analyze this method they need to use a different analysis technique. The analysis is a bit involved and I’ll have to take a deeper look to understand it better, but from a birds-eye view this would make sense as long as the step size is chosen properly and the “hybrid” updates can be shown to be not too far from the original sample path. That’s the stochastic approximator in me talking though.