# Signal boost: Postdoc in Privacy at Penn State

Sofya Raskhodnikova and Adam Smith are looking to fill a postdoc position at Penn State for a multi-year project on privacy, streaming and learning.

Qualifications: Ph.D., with expertise in the theoretical foundations of at least one of the research areas (algorithms, machine learning and statistics, data privacy). Willingness to work on a cross-disciplinary project.

Duration and compensation: At least one year, renewable. Start date is
negotiable, though we slightly prefer candidates starting fall 2015. Salary is competitive.

Applicants should email a CV, short research statement and list of references directly to the project leaders ({asmith,sofya}@cse.psu.edu) with “postdoc” in the subject line.

Location: The university is located in the beautiful college town of
State College in the center of Pennsylvania. The State College area has 130,000 inhabitants and offers a wide variety of cultural and outdoor recreational activities. The university offers outstanding events from collegiate sporting events to fine arts productions. Many major population centers on the east coast (New York, Philadelphia, Pittsburgh, Washington D.C., Baltimore) are only a few hours’ drive away and convenient air services to several major hubs are operated by three major airlines out of State College.

Penn State is an equal opportunity employer. We encourage applications from underrepresented minorities.

# a cute paper: finding a most biased coin with fewest flips

I saw this paper on ArXiV a while back and figured it would be a fun read, and it was. Post-ISIT blogging may have to wait for another day or two.

Finding a most biased coin with fewest flips
Karthekeyan Chandrasekaran, Richard Karp
arXiv:1202.3639 [cs.DS]

The setup of the problem is that you have $n$ coins with biases $\{p_i : i \in [n]\}$. For some given $p \in [\epsilon,1-\epsilon]$ and $\epsilon \in (0,1/2)$, each coin is “heavy” ($p_i = p + \epsilon$) with probability $\alpha$ and “light” ($p_i = p - \epsilon$) with probability $1 - \alpha$. The goal is to use a sequential flipping strategy to find a heavy coin with probability at least $1 - \delta$.

Any such procedure has three components, really. First, you have to keep track of some statistics for each coin $i$. On the basis of that, you need a rule to pick which coin to flip. Finally, you need a stopping criterion.

The algorithm they propose is a simple likelihood-based scheme. If I have flipped a particular coin $i$ a bunch of times and gotten $h_i$ heads and $t_i$ tails, then the likelihood ratio is
$L_i = \left( \frac{p+\epsilon}{p - \epsilon} \right)^{h_i} \left( \frac{ 1 - p - \epsilon }{ 1 -p + \epsilon} \right)^{t_i}$
So what the algorithm does is keep track of these likelihoods for the coins that it has flipped so far. But what coin to pick? It is greedy and chooses a coin $i$ which has the largest likelihood $L_i$ so far (breaking ties arbitrarily).

Note that up to now the prior probability $\alpha$ of a coin being heavy has not been used at all, nor has the failure probability $\delta$. These appear in the stopping criterion. The algorithm keeps flipping coins until there exists at least one $i$ for which
$L_i \ge \frac{1 - \alpha}{\alpha} \cdot \frac{ 1 - \delta }{\delta}$
It then outputs the coin with the largest likelihood. It’s a pretty quick calculation to see that given $(h_i, t_i)$ heads and tails for a coin $i$,
$\mathbb{P}(\mathrm{coin\ }i\mathrm{\ is\ heavy}) = \frac{\alpha L_i}{ \alpha L_i + (1 - \alpha) }$,
from which the threshold condition follows.

This is a simple-sounding procedure, but to analyze it they make a connection to something called a “multitoken Markov game” which models the corresponding mutli-armed bandit problem. What they show is that for the simpler case given by this problem, the corresponding algorithm is, in fact optimal in the sense that it makes the minimum expected number of flips:
$\frac{16}{\epsilon^2} \left( \frac{1 - \alpha}{\alpha} + \log\left( \frac{(1 -\alpha)(1 - \delta)}{\alpha \delta} \right) \right)$

The interesting thing here is that the prior distribution on the heavy/lightness plays a pretty crucial role here in designing the algorithm. part of the explore-exploit tradeoff in bandit problems is the issue of hedging against uncertainty in the distribution of payoffs — if instead you have a good handle on what to expect in terms of how the payoffs of the arms should vary, you get a much more tractable problem.

/

# If at first you don’t succeed, normalize, normalize again

My ex-groupmate and fellow Uni High graduate Galen Reeves told me about a paper a few weeks ago when I visited him at Stanford:

Successive Normalization of Rectangular Arrays
Richard A. Olshen and Bala Rajaratnam
The Annals of Statistics 38(3), pp.1369-1664, 2010

Apparently, however, the arguments in the paper are not quite correct [1], and they recently uploaded a correction to ArXiV.

This paper looks at the effect of a very common preprocessing step used to transform an $n \times k$ data matrix $\mathbf{X}$ into a form acceptable for statistical or machine learning algorithms that assume things like zero-mean or bounded vectors. Here $n$ may represent the number of individuals, and $k$ the number of features, for example. Or the data may come from a DNA microarray (their motivating example). This preprocessing is often done without much theoretical justification — the mathematical equivalence of tossing spilled salt over your shoulder. This paper looks at the limiting process of standardizing rows and then columns and then rows and then columns again and again. They further need that $n,k \ge 3$. “Readers will see that the process and perhaps especially the mathematics that underlies it are not as simple as we had hoped they would be.”

So what exactly is the preprocessing? I am going to describe things in pseudocode (too lazy to do real code, sorry). Given a data matrix X[i,j] they look at
 for i = 1:n { X[i,1:k] = X[i,1:k] - sum(X[i,1:k]) } for j = 1:k { X[1:n,j] = X[l:n,j] - sum(X[1:n,j]) } 
They call the first a “row mean polish” and the second a “column mean polish.” They show this converges in one step.

But what about standardizing? The more complicated polishing procedure looks like this:
 for i = 1:n { mu = sum(X[i,1:k]) sigma = sqrt( sum( (X[i,1:k] - mu)^2 ) ) X[i,1:k] = (X[i,1:k] - mu)/sigma } for j = 1:k { mu = sum(X[1:n,j]) sigma = sqrt( sum( (X[1:n,j] - mu)^2 ) ) X[1:n,j] = (X[1:n,j] - mu)/sigma } 
This standardizes rows first, and then columns (or “lather” and “rinse,” since we are going to “repeat”). They call this Efron‘s algorithm because he told them about it. So what happens if we repeat these two steps over and over again on a matrix with i.i.d. entries from some continuous distribution?

Theorem 4.1 Efron’s algorithm converges almost surely for X on a Borel set of entries with complement a set of Lebesgue measure 0.

So what does it look like in practice? How fast is this convergence? Empirically, it looks exponential, and they have some theoretical guarantees in the paper, kind of hidden in the discussion. The proofs are not terribly ornate but are tricky, and I don’t quite get all the details myself, but I figured readers of this blog would certainly be interested in this cute result.

[1] A fun quote from the paper “Theorem 4.1 of [2] is false. A claimed backwards martingale is NOT. Fortunately, all that seems damaged by the mistake is pride. Much is true.” I really liked this.

Robert Tavernor, Smoot’s Ear : The Measure of Humanity – This is an interesting, albeit dry, history of measurement in Europe, starting from the Greeks and Romans, more or less, up through the development of the metric system. It’s chock full of interesting little facts and also highlights the real problems that arise when there is no standard as well as when trying to define a standard.

Naguib Mahfouz, Palace Walk – The first in Mahfouz’s masterpiece trilogy, this novel follows a very traditional family of an Egyptian merchant, who spends his time partying every night while terrorizing his family during the day. It’s set during the end of the British occupation at the end of WWI and the protests against the British that start at the end of the novel seem eerily relevant today.

Nell Irvin Painter, The History of White People – This is a popular history of theories of race, beauty, and intelligence and how they became entwined with skin color, head-shape, and other measurable quantities. It was an interesting read but felt a little incomplete somehow. Also, she uses the work “pride of place” too many times. It was distracting!

Vivek Borkar, Stochastic Approximation : a Dynamical Systems Viewpoint – This slim book gives a concise, albeit very technical, introduction to the basic methods and results in stochastic approximation. It’s fairly mathematically challenging, but because it’s to-the-point, I found it easier going than the book by Kushner and Yin.

# ISIT 2010 : gossip and consensus

THE MISSING PIECE SYNDROME IN PEER-TO-PEER COMMUNICATION (Bruce Hajek, Ji Zhu; University of Illinois at Urbana Champaign)
This paper proposes a model for peer-to-peer content distribution in a Bit-Torrent-like setup where there is a seed node and everybody wants to get K pieces of a file held by the seed. Users arrive according to a Poisson process and peers randomly collect and transfer (instantaneously) one piece. The paper provides a stability analysis for this system based on queueing. It’s a cool model, and the talk had some rather amusing moments for those who were there…

WEIGHTED GOSSIP: DISTRIBUTED AVERAGING USING NON-DOUBLY STOCHASTIC MATRICES (Florence Bénézit; Ecole Normale Supérieure-INRIA, Vincent Blondel; UC Louvain, Patrick Thiran; Ecole polytechnique fédérale de Lausanne, John Tsitsiklis; Massachusetts Institute of Technology, Martin Vetterli; Ecole polytechnique fédérale de Lausanne)
Florence presented convergence results for an algorithm based on one-way path averaging. Inspired by the Push-Sum protocol of Kempe et al., she described a one-way method in which a node “gives away” a fraction of its estimate and pushes it along a random direction in the network. The receiving node takes some of the value and passes the rest along — It’s kind of like passing a plate of food around a table while keeping a little (or a lot) for yourself. It’s a cool algorithm, and it works really well in experiments. However, the rate of convergence is still an open question — it seems related to the convergence of non-homogeneous Markov chains.

TIGHT BOUNDS FOR ALGEBRAIC GOSSIP ON GRAPHS (Michael Borokhovich, Chen Avin, Zvi Lotker; Ben Gurion University of the Negev)
This paper was more discrete in nature. There are $n$ nodes in a network and each has a value in a finite field. They pass linear combinations of their symbols around. The goal for every node to learn all the information, or equivalently to gather a full-rank set of equations. Nodes can communicate according to a graph structure — they presented upper and lower bounds of $n d_{\max}$ where $d_{\max}$ is the maximum degree in the graph. They also showed the barbell graph is very very slow.

DISTRIBUTED CONSENSUS WITH FINITE MESSAGING (Debashis Dash, Ashutosh Sabharwal; Rice University)
This was on distributed vertex coloring in which each node gets to know something about the colors in its local neighborhood. This is a bit tough (which they prove), but the authors allow themselves a little slack in that they want to minimize the number of defects (nodes with an adjacent node of the same color), rather than make it $0$. A number of algorithms were presented, many of them based on an initial random assignment followed by a refinement step using the local information.

A NEAR-OPTIMAL ALGORITHM FOR NETWORK-CONSTRAINED AVERAGING WITH NOISY LINKS (Nima Noorshams, Martin J. Wainwright; University of California, Berkeley)
This paper was essentially about packing routes in a “gossip along the way” paradigm — if a node wakes up and starts a path (say horizontally), it can also send a message vertically to trigger path-averaging along parallel paths. This gives a two-phase algorithm and the number of rounds ends up looking like the diameter of the graph. However, the number of one-hop messages scales in the same way. Thus the gain is through parallelization.

# giving no credit where it is not due

Luca pointed to a paper by Chierichetti, Lattanzi, and Panconesi, which has an amusing comment in the last section (I don’t want to spoil it).

The paper itself is interesting, of course. Conductance often appears in bounds on mixing times for Markov chains, but the rumor spreading problem is a bit different than the consensus problems that I have studied in the past. A nice quote from the introduction:

Our long term goal is to characterize a set of necessary and/or suffcient conditions for rumour spreading to be fast in a given network. In this work, we provide a very general suffcient condition — high conductance. Our main motivation comes from the study of social networks. Loosely stated, we are looking after a theorem of the form “Rumour spreading is fast in social networks”. Our result is a good step in this direction because there are reasons to believe that social networks have high conductance.