# SPCOM 2014: some more talks (and a plenary)

I did catch Greg Wornell’s plenary at SPCOM, which was called When Bits Absolutely, Positively, Have to be There as Soon as Possible, a riff on this FedEx commercial, which is older than I am. The talk was on link-aware PHY-layer design– basically looking at how ARQ enables incremental redundancy, and how to do a sort of layered superposition + incremental redundancy scheme in the sequential setting as well as a “multi-path” setting where blocks can arrive out of order. This was really digging into the signal issues in a way that a lot of non-communication engineering information theorists may get squeamish about. The nice thing is that I think the engineering problem is approachable without knowing a lot of heavy-duty math, but still requires some careful analysis.

Communication and Compression Via Sparse Linear Regression
Ramji Venkataramanan
This was on building codewords and codebooks out of a lower-complexity code dictionary $A \in \mathbb{R}^{n \times ML}$ where each codeword is a superposition of $L$ columns, one each from groups of size $M$. Thus encoding is $A \beta$ where $\beta$ is a sparse vector. I saw a talk by Barron and Joseph from a previous ISIT about this, but the framework extends to rate distortion (achieving the rate distortion function), and channel coding. The main point is to lower the complexity of the code at the expense of the gap to optimal rate — encoding and decoding are polynomial time but the rate gap for rate-distortion goes to zero as $1/\log n$. Ramji gave a really nice and clear talk on this — I hope he puts the slides up!

An Optimal Varentropy Bound for Log-Concave Distributions
Mokshay’s talk was also really clear and excellent. For a distribution $f(X)$ on $\mathbb{R}^n$, we can define $\tilde{h}(X) = - \log f(X)$. The entropy is the expectation of this random variable, and the varentropy is the variance. Their main result is a upper bound on the varentropu of log concave distributions $f(X)$. To wit, $\mathrm{Var}(\tilde{h}) \le n$. This bound doesn’t depend on the distribution and is sharp if $f$ is a product of exponentials. They then use this to prove a universal bound on the deviation of $\tilde{h}$ from its expectation, which gives a AEP that doesn’t really assume anything about the joint distribution of the variables except for log-concavity. There was more in the talk, but I eagerly await the paper.

Event-triggered Sampling and Reconstruction of Sparse Real-valued Trigonometric Polynomials
Neeraj Sharma; Thippur V. Sreenivas
This was on non-uniform sampling where the sampler tries to detect level crossings of the analog signal and samples at that point — the rate may not be uniform enough to use existing nonuniform sampling techniques. They come up with a method for reconstructing signals which are real-valued trigonometric polynomials with a few nonzero coefficients (e.g. sparse) and it seems to work pretty decently in experiments.

Removing Sampling Bias in Networked Stochastic Approximation
Vivek Borkar; Raaz Dwivedi
In networked stochastic approximation, the intermittent communication between nodes may mean that the system tracks a different ODE than the one we want. By modifying the method to account for “local clocks” on each edge, we can correct for this, but we end up with new conditions on the step size to make things work. I am pretty excited about this paper, but as usual, my notes were not quite up to getting the juicy bits. That’s what paper reading is for.

On Asymmetric Insertion and Deletion Errors
Ankur A. Kulkarni
The insertion/deletion channel model is notoriously hard. Ankur proposed a new model where $0$‘s are “indestructible” — they cannot be inserted or deleted. This asymmetric model leads to new asymptotic bounds on the capacity. I don’t really work on this channel model so I can’t get the finer points of the results, but once nice takeaway was that asymptotically, each indestructible $0$ in the codeword lets us correct around $1/2$ a deletion more.

# ISIT 2014: a few more talks

Annina Bracher (ETH Zurich, Switzerland); Amos Lapidoth (ETHZ, Switzerland)
The title pretty much describes it — there are two receivers which are both looking out for a particular message. This is the identification problem, in which the receiver only cares about a particular message (but we don’t know which one) and we have to design a code such that they can detect the message. The number of messages is $2^{2^{nC}}$ where $C$ is the Shannon capacity of the DMC. In the broadcast setting we run into the problem that the errors for the two receivers are entangled. However, their message sets are disjoint. The way out is to look at the average error for each (averaged over the other user’s message). The main result is that the rates only depend on the conditional marginals, and they have a strong converse.

Efficient compression of monotone and m-modal distributions
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
A monotone distribution is a distribution on $\mathbb{N}$ such that the probabilities are non-increasing. The redundancy for this class is infinite, alas, so they restrict the support to size $k$ (where $k$ can be large). They propose a two-step compression scheme in which the first step is to approximate the true distribution with a piecewise constant step distribution, and then use a compression scheme for step distributions.

Writing on a Dirty Paper in the Presence of Jamming
Amitalok J Budkuley (Indian Institute of Technology, Bombay, India); Bikash K Dey (Indian Institute of Technology Bombay, India); Vinod M Prabhakaran (Tata Institute of Fundamental Research, India)
Ahh, jamming. A topic near and dear to my heart. This paper takes a game-theoretic approach to jamming in a DPC setup: “the capacity of the channel in the presence of the jammer is the unique Nash equilibrium utility of the zero sum communication game between the user and the jammer.” This is a mutual information game, and they show that i.i.d. Gaussian jamming and dirty paper coding are a Nash equilibrium. I looked at an AVC version of this problem in my thesis, and the structure is quite a bit different, so this was an interesting different take on the same problem — how can we use the state information to render adversarial interference as harmless as noise?

Stable Grassmann Manifold Embedding via Gaussian Random Matrices
Hailong Shi (Tsinghua University & Department of Electronic Engineering, P.R. China); Hao Zhang (TsinghuaUniversity, P.R. China); Gang Li (Tsinghua University, P.R. China); Xiqin Wang (Tsinghua University, P.R. China)
This was in the session I was chairing. The idea is that you are given a subspace (e.g., a point on the Grassman manifold) and you want to see what happens when you randomly project this into a lower-dimensional subspace using an i.i.d. Gaussian matrix a la Johnson-Lindenstrauss. The JL Lemma says that projections are length-preserving. Are they also volume-preserving? It turns out that they are (no surprise). The main tools are measure concentration results together with a union bound over a covering set.

Is “Shannon capacity of noisy computing” zero?
Pulkit Grover (Carnegie Mellon University, USA)
Yes. I think. Maybe? Pulkit set up a physical model for computation and used a cut-set argument to show that the total energy expenditure is high. I started looking at the paper in the proceedings and realized that it’s significantly different than the talk though, so I’m not sure I really understood the argument. I should read the paper more carefully. You should too, probably.

# ISIT 2014: how many samples do we need?

Due to jetlag, my CAREER proposal deadline, and perhaps a bit of general laziness, I didn’t take as many notes at ISIT as I would have, so my posting will be somewhat light (in addition to being almost a month delayed). If someone else took notes on some talks and wants to guest-post on it, let me know!

Strong Large Deviations for Composite Hypothesis Testing
Yen-Wei Huang (Microsoft Corporation, USA); Pierre Moulin (University of Illinois at Urbana-Champaign, USA)
This talk was actually given by Vincent Tan since neither of the authors could make it (this seems to be a theme of talks I’ve attended this summer. The paper was about testing a simple hypothesis $H_1$ versus a composite hypothesis $H_0$ where under $H_0$ the observations are i.i.d. with respect to one of possibly $k$ different distributions. There are therefore $k$ different errors and the goal is to characterize these errors when we ask for the probability of true detection to be greater than $1 - \epsilon$. This is a sort of generalized Neyman-Pearson setup. They look at the vector of log-likelihood ratios and show that a threshold test is nearly optimal. At the time, I understood the idea of the proof, but I think it’s one of things where you need to really read the paper.

Randomized Sketches of Convex Programs with Sharp Guarantees
Mert Pilanci (University of California, Berkeley, USA); Martin J. Wainwright (University of California, Berkeley, USA)
This talk was about using random projections to lower the complexity of solving a convex program. Suppose we want to minimize $\| Ax - y \|^2$ over $x$ given $y$. A sketch would be to solve $\| SAx - Sy \|^2$ where $S$ is a random projection. One question is how to choose $A$. They show that choosing $S$ to be a randomized Hadamard matrix (the paper studies Gaussian matrices), then the objective value of the sketched program is at most $(1 + \epsilon)^2$ times the value of the original program as long as the the number of rows of $S$ is larger than $O( \epsilon^{-2} \mathbb{W}^2(A \mathcal{K}))$, where $\mathbb{W}(A \mathcal{K})$ is the Gaussian width of the tangent cone of the contraint set at the optimum value. For more details look at their preprint on ArXiV.

On Efficiency and Low Sample Complexity in Phase Retrieval
Youssef Mroueh (MIT-IIT, USA); Lorenzo Rosasco (DIBRIS, Unige and LCSL – MIT, IIT, USA)
This was another talk not given by the authors. The problem is recovery of a complex vector $x_0 \in \mathbb{C}^n$ from phaseless measurements of the form $b_i = |\langle a_i, x_0 \rangle|^2$ where $a_i$ are complex spherically symmetric Gaussian vectors. Recovery from such measurements is nonconvex and tricky, but an alternating minimizing algorithm can reach a local optimum, and if you start it in a “good” initial position, it will find a global optimum. The contribution of this paper is provide such a smart initialization. The idea is to “pair” the measurements to create new measurements $y_i = \mathrm{sign}( b_i^{(1)} - b_i^{(2)} )$. This leads to a new problem (with half as many measurements) which is still hard, so they find a convex relaxation of that. I had thought briefly about such sensing setups a long time ago (and by thought, I mean puzzled over it at a coffeshop once), so it was interesting to see what was known about the problem.

Sorting with adversarial comparators and application to density estimation
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
Ashkan gave this talk on a problem where you have $m$ samples from an unknown distribution $p$ and a set of distributions $\{q_1, q_2, \ldots, q_n\}$ to compare against. You want to find the distribution that is closest in $\ell_1$. One way to do this is via Scheffe tournament tht compares all pairs of distributions — this runs in time $n^2$ time. They show a method that runs in $O(n)$ time by studying the structure of the comparators used in the sorting method. The motivation is that running comparisons can be expensive (especially if they involve human decisions) so we want to minimize the number of comparisons. The paper is significantly different than the talk, but I think it would definitely be interesting to those interested in discrete algorithms. The density estimation problem is really just a motivator — the sorting problem is far more general.

# ICML 2014: a few more papers

After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.

• Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
• Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
• Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
• (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking $(\epsilon,\delta)$-differential privacy yield better sample complexity results than $(\epsilon,)$-differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
• Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
• Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on $n$ variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple $(x_i, x_j, y)$ a marginal $\mu_{ij}(x_i, x_j, y)$. If you want to create a rule $f: \mathcal{X} \to \mathcal{Y}$ for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.

# Postdoc opportunity at Aalto University

Aalto University is a new university with over a century of experience. Created from a high-profile merger between three leading universities in Finland – the Helsinki School of Economics, Helsinki University of Technology and the University of Art and Design Helsinki – Aalto University opens up new possibilities for strong multidisciplinary education and research. The university has 20 000 students and a staff of 5,000 including 350 professors.

The stochastics research group at the Department of Mathematics and Systems Analysis is currently undergoing a period of regeneration, as new associate and assistant professors have been employed to replace previous long-term faculty, and several new young researchers are being recruited with the aim of significant growth. To strengthen this line of development, we are now seeking to hire a postdoctoral researcher with a PhD in mathematics or a related area.

The postdoctoral researcher will carry out research in collaboration with the stochastics research group, with a small amount of teaching duties included. The salary is competitive, based on experience and qualifications, and includes occupational health and a travel budget for international conferences and workshops. The position is for one year with a possible extension for another year, starting preferably in September 2014 and no later than January 2015.

Further information and instructions for applying:

Please feel free to forward this message to colleagues and potential candidates. The application deadline is 13 June 2014.

# expected time for an optimal game of memory

Via Dan Katz, I learned about a recent problem (warning: JSTOR) from the American Mathematical Monthly (a publication of the Mathematics Association of America, making shaded icosahedrons look cool for almost 100 years):

What to Expect in a Game of Memory
Author(s): Daniel J. Velleman, Gregory S. Warrington
The American Mathematical Monthly, Vol. 120, No. 9 (November), pp. 787-805

The game of memory is played with a deck of $n$ pairs of cards. The cards in each pair are identical. The deck is shuffled and the cards laid face down. A move consists of flipping over first one card and then another. The cards are removed from play if they match. Otherwise, they are flipped back over and the next move commences. A game ends when all pairs have been matched. We determine that, when the game is played optimally, as $n \rightarrow \infty$
• The expected number of moves is $(3 - 2 \ln 2) n + 7/8 - 2 \ln 2 \approx 1.61 n$.
• The expected number of times two matching cards are unwittingly flipped over is $\ln 2$.
• The expected number of flips until two matching cards have been seen is $2^{2n} / \binom{2n}{n} \approx \sqrt{\pi n}$.

This is not a competitive game of memory, but the singe player version. It’s a kind of explore-exploit tradeoff with a simple structure — if you know how to exploit, do it. Note that one could do $2 n$ moves by flipping every card over once (there are $2n$ cards) to learn all of their identities and then removing all of the pairs one by one. The better strategy is

1. Remove any known pair.
2. If no known pair is known, flip a random unknown card and match it if you can.
3. If the first card is not matchable, flip another random unknown card to learn its value (and remove the pair if it matches.

This strategy exploits optimally when it can and explores optimally when it can’t. The second bullet point in the abstract is the gain from getting lucky, i.e. two randomly drawn cards matching.

The paper is an interesting read, but the arguments are all combinatorial. Since the argument is a limiting one as $n \to \infty$, I wonder if there is a more “probabilistic” argument (this is perhaps a bit fuzzy) for the results.

# estimating probability metrics from samples

I took a look at this interesting paper by Sriperumbudur et al., On the empirical estimation of integral probability metrics (Electronic Journal of Statistics Vol. 6 (2012) pp.1550–1599). The goal of the paper is to estimate a distance or divergence between two distributions $P$ and $Q$ based on samples from each distribution. This sounds pretty vague at first… what kind of distributions? How many samples? This paper looks at integral probability metrics, which have the form

$\gamma(P,Q) = \sup_{f \in \mathcal{F}} \left| \int_{S} f dP - \int_{S} f dQ \right|$

where $S$ is a measurable space on which $P$ and $Q$ are defined, and $\mathcal{F}$ is a class of real-valued bounded measurable functions on $S$. This class doesn’t contain Csiszár $\phi$-divergences (also known as Ali-Silvey distances), but does contain the total variational distance.

Different choices of the function class give rise to different measures of difference used in so-called two-sample tests, such as the Kolmogorov-Smirnov test. The challenge in practically using these tests is that it’s hard to get bounds on how fast an estimator of $\gamma(P,Q)$ converges if we have to estimate it from samples of $P$ and $Q$. The main result of the paper is to provide estimators with consistency and convergence guarantees. In particular, they estimators are based on either linear programming or (in the case of kernel tests) in closed form.

The second section of the paper connects tests based on IPMs with the risk associated to classification rules for separating $P$ and $Q$ when the separation rule is restricted to come from the function class $\mathcal{F}$ associated to the rule. This is a nice interpretation of these two-sample tests — they are actually doing similar things for restricted classes of classifiers/estimators.

Getting back to KL divergence and non-IPM measures, since total variation gives a lower bound on the KL divergence, they also provide lower bounds on the total variation distance in terms of other IPM metrics. This is important since the total variation distance can’t be estimated itself in a strongly consistent way. This could be useful for algorithms which need to estimate the total variation distance for continuous data. In general, estimating distances between multivariate continuous distributions can become a bit of a mess when you have to use real data — doing a plug-in estimate using, e.g. a kernel density estimator is not always the best way to go, and instead attacking the distance you want to measure directly could yield better results.

# Postdoc at INRIA-ENS Paris on Graphs, Algorithms and Probability

Applications are invited for a Postdoc position (full-time, up to 2 years) at INRIA-ENS in Paris. The position is funded by the ANR GAP grant “Graphs, Algorithms and Probability.”

Requirements are a PhD degree in Computer Science or Mathematics and a strong background in some of the following topics:

• discrete probability
• statistical learning
• combinatorial optimization
• stochastic networks

Applications must include a research statement, a CV and the names and contacts of references. All material should be sent by email to Marc Lelarge. Please indicate in the subject POSTDOC GAP.

Important dates:

• Intention of application (short email): as soon as possible
• Deadline for application: December 1st, 2013
• Suggested starting dates: Jan.-Feb. 2014

# ISIT Blogging, part 2

Logarithmic Sobolev inequalities and strong data processing theorems for discrete channels
Maxim Raginsky (University of Illinois at Urbana-Champaign, USA)
Max talked about how the strong data processing inequality (DPI) is basically a log-Sobolev inequality (LSI) that is used in measure concentration. The strong DPI says that
$D(QW \| PW) \le \alpha D(Q \| P)$
for some $\alpha < 1$, so the idea is to get bounds on
$\delta^*(P,W) = \sup_{Q} \frac{D(QW \| PW)}{D(Q \| P)}$.
What he does is construct a hierarchy of LSIs in which the strong DPI fits and then gets bounds on this ratio in terms of best constants for LSIs. The details are a bit hairy, and besides, Max has his own blog so he can write more about it if he wants…

Building Consensus via Iterative Voting
Farzad Farnoud (University of Illinois, Urbana-Champaign, USA); Eitan Yaakobi (Caltech, USA); Behrouz Touri (University of Illinois Urbana-Champaign, USA); Olgica Milenkovic (University of Illinois, USA); Jehoshua Bruck (California Institute of Technology, USA)
This paper was about rank aggregation, or how to take a bunch of votes expressed as permutations/rankings of options to produce a final option. The model is one in which people iteratively change their ranking based on the current ranking. For example, one could construct the pairwise comparison graph (a la Condorcet) and then have people change their rankings when they disagree with the majority on an edge. They show conditions under which this process converges (the graph should not have a cycle) and show that if there is a Condorcet winner, then after this process everyone will rank the Condorcet winner first. They also look at a Borda count version of this problem but to my eye that just looked like an average consensus method, but it was at the end of the talk so I might have missed something.

Information-Theoretic Study of Voting Systems
Eitan Yaakobi (Caltech, USA); Michael Langberg (Open University of Israel, Israel); Jehoshua Bruck (California Institute of Technology, USA)
Eitan gave this talk and the preceding talk. This one was about looking at voting through the lens of coding theory. The main issue is this — what sets of votes or distribution of vote profiles will lead to a Condorcet winner? Given a set of votes, one could look at the fraction of candidates who rank candidate j in the i-th position and then try to compute entropies of the resulting distributions. The idea is somehow to characterize the existence or lack of a Condorcet winner in terms of distances (Kendall tau) and these entropy measures. This is different than looking at probability distributions on permutations and asking about the probability of there existing a Condorcet cycle.

Brute force searching, the typical set and Guesswork
Mark Chirstiansen (National University of Ireland Maynooth, Ireland); Ken R Duffy (National University of Ireland Maynooth, Ireland); Flávio du Pin Calmon (Massachusetts Institute of Technology, USA); Muriel Médard (MIT, USA)
Suppose an item $X$ is chosen $\sim P$ from a list and we want to guess the choice that is made. We’re only allowed to ask questions of the form “is the item $Y$?” Suppose now that the list is a list of codewords of blocklength $k$ drawn i.i.d. according to $Q$. This paper looks at the number of guesses one needs if $P$ is uniform on the typical set according to $Q$ versus when $P$ is distributed according the distribution $Q^k$ conditioned on $X$ being in the typical set. The non-uniformity of the latter turns out to make the guessing problem a lot easier.

Rumor Source Detection under Probabilistic Sampling
Nikhil Karamchandani (University of California Los Angeles, USA); Massimo Franceschetti (University of California at San Diego, USA)
This paper looked at an SI model of infection on a graph — nodes are either Susceptible (S) or Infected (I), and there is a probability of transitioning from S to I based on your neighbors’ states. There in exponential waiting time $\tau_{ij}$ for the $i$ to infect $j$ if $i$ is infected. The idea is that the rumor starts somewhere and infects a bunch of people and then you get to observe/measure the network. You want to find the source. This was studied by Zaman and Shah under the assumption of perfect observation of all nodes. This work looked at the case where nodes randomly report their infection state, so you only get an incomplete picture of the infection state. They characterize the effect of the reporting probability on the excess error and show that for certain tree graphs, incomplete reporting is as good as full reporting.

/