Paper writing as a POMDP

During a research conversation this week I came to the realization that we really should model the paper writing as a partially observed Markov decision problem (POMDP). The writer has an action space of writing SIMPLE or FANCY statements. A SIMPLE statement follows very clearly from the previous statement and so should be easy to follow. A FANCY statement is a bit more challenging but might sound more impressive the reviewer. The reviewer has several states: BORED, IMPRESSED, and CONFUSED. Based on the input being SIMPLE or FANCY, the reviewer changes state according to a Markov chain with this state space.

The writer suffers a cost (or gains a reward, depending on how we want to model it) based on the current action and reviewer state. For example, issuing a FANCY statement when the reviewer is CONFUSED might suffer a higher cost than a FANCY statement when IMPRESSED. The goal is to minimize the discounted cost. This makes sense since reviewers become more fatigued over time, so extracting reward earlier in the paper is better.

Now, conventional wisdom might suggest that clarity is best, so a pure-SIMPLE policy should be a dominant strategy for this problem. However, depending on the cost and transition structure, I am not sure this is true.

And now, back to paper writing for me…

Allerton 2014: hypercontracting interference channels while noisily feeding back what you detected on the graph

Before the expiration window passes, here are few more short takes from Allerton… for some talks I couldn’t take notes because I didn’t get a seat or I missed half the talk shuttling between sessions.

The Gaussian Channel with Noisy Feedback: Near-Capacity Performance Via Simple Interaction
Assaf Ben-Yishai, Ofer Shayevitz
This was a really nice talk by Ofer on trying to get practical codes for AWGN channels with noisy feedback by using the intuition given by the Schalkwijk-Kailath scheme plus some tricks from using the mod operation. This is reminiscent of lattices (which may be an interesting future direction). The SK scheme has a problem with noise accumulation, which they deal with using these mode operations, and can get to errors around 10^(-6) with around 19 rounds, or blocklength 19 at reasonable SNRs. Blocklength is misleading here since there is feedback every symbol. The other catch is that the feedback link must have much higher SNR than the forward link, but this is true in applications such as sensing, where the receiver may be plugged into the wall, but the transmitter may be on a swallowable medical monitoring device.

Point-To-Point Codes for Interference Channels: A Journey Toward High Performance at Low Complexity
Young-Han Kim
Continuing with my UCSD bias, I also wanted to mention Young-Han’s talk, which was on using COTS (commercial, off-the-shelf) coding schemes on the interference channel (in particular, the 2 user IC). He talked about rate splitting approaches and block Markov schemes. Much of this work is with Lele Wang, who may be graduating soon…

Signal Detection on Graphs
Venkatesh Saligrama
This was a hypothesis testing problem where the observations come from nodes on graph. Under the null, they are Gaussian noise, and under the other hypothesis, there is a connected subgraph with an elevated mean. How should we do detection in this scenario? This is a compound hypothesis testing problem because there are (too) many possible connected subgraphs to consider. He gets around this by looking at a convex program parameterized by a measure of the size/shape of the connected component. This is where my notes get messy though, so you might want to look at the paper if it sounds interesting to you…

Hypercontractivity in Hamming Space
Yury Polyanskiy
I’ve hypercontractivity before, and Yury talked about his paper on ArXiV, which is about functions on the binary hypercube. This talk felt more like a tour of results on hypercontractivity and less like a “here is my new result” talk, which I actually appreciated because I felt it tied together ideas well and made me realize how strange the hypercontractivity parameter of an operator is.

Allerton 2014: privately charging your information odometer while realizing your group has pretenders

Here are a few papers that I saw at Allerton — more to come later.

Group Testing with Unreliable Elements
Arya Mazumdar, Soheil Mohajer
This was a generalization of the group testing problem: items are either positive or null, and can be tested in groups such that if any element of the group is positive, the whole group will test positive. The item states can be thought of as a binary column vector \mathbf{x} and each test is the row or a matrix A: the (i,j)-the entry of A is 1 if the i-th item is part of the j-th group. The multiplication A \mathbf{x} is taken using Boolean OR. The twist in this paper is that they consider a situation where \tau elements can “pretend” to be positive in each test, with a possibly different group in each test. This is different than “noisy group testing” which was considered previously, which is more like A \mathbf{x} + \mathrm{noise}. They show achievable rates for detecting the positives using random coding methods (i.e. a random A matrix). There was some stuff at the end about the non-i.i.d. case but my notes were sketchy at that point.

The Minimal Realization Problems for Hidden Markov Models
Qingqing Huang, Munther Dahleh, Rong Ge, Sham Kakade
The realization problem for an HMM is this: given the exact joint probability distribution on length N strings (Y_1, Y_2, \ldots, Y_N) from an HMM, can we create a multilinear system whose outputs have the same joint distribution? A multilinear system looks like this:
w_{t+1} = A^{(\ell_t)} w_t for w_t \in \mathbb{R}^k with w_0 = v
z_{t} = u^t w_t
We think of A^{(\ell_t)} \in \mathbb{R}^{k \times k} as the transition for state \ell_t (e.g. a stochastic matrix) and we want the output to be z_t. This is sort of at the nexus of control/Markov chains and HMMs and uses some of the tensor ideas that are getting hot in machine learning. As the abstract puts it, the results are that they can efficiently construct realizations if N > max(4 log_d(k) + 1; 3), where d is the size of the output alphabet size, and k is the minimal order of the realization.”

Differentially Private Distributed Protocol for Electric Vehicle Charging
Shuo Han, Ufuk Topcu, George Pappas
This paper was about a central aggregator trying to get multiple electric vehicle users to report their charging needs. The central allocator has a utility maximization problem over the rates of charging:
\min_{r_i} U(\sum r_i)
\mathrm{s.t.}\  \le r_i \le \bar{r}_i
\qquad \mathbf{1}^{\top} r_i = z_i
They focus on the case where the charge demands z_i are private but the charging rates r_i can be public. This is in contrast to the mechanism design literature popularized by Aaron Roth and collaborators. They do an analysis of a projected gradient descent procedure with Laplace noise added for differential privacy. It’s a stochastic gradient method but seems to converge in just a few time steps — clearly the number of iterations may impact the privacy parameter, but for the examples they showed it was only a handful of time steps.

An Interactive Information Odometer and Applications
Mark Braverman and Omri Weinstein
This was a communication complexity talk about trying to maintain an estimate of the information complexity of a protocol \Pi for computing a function f(X,Y) interactively, where Alice has X, Bob has Y, and (X,Y) \sim \mu:
\mathsf{IC}_{\mu}(\Pi) = I( \Pi ; X | Y ) + I( \Pi ; Y | X)
Previous work has shown that the (normalized) communication complexity of computing f on n-tuples of variables \epsilon-accurately approaches the minimum information complexity over all protocols for computing f once \epsilon-accurately. At least I think this is what was shown previously — it’s not quite my area. The result in this paper is a strong converse for this — the goal is to maintain an online estimate of \mathsf{IC} during the protocol. The mechanism seeems a bit like communication with feedback a la Horstein, but I got a bit confused as to what was going on — basically it seems that Alice and Bob need to be able to agree on the estimate during the protocol and use a few extra bits added to the communication to maintain this estimate. If I had an extra n hours in the week I would read up more about this. Maybe on my next plane ride…

SPCOM 2014: some more talks (and a plenary)

I did catch Greg Wornell’s plenary at SPCOM, which was called When Bits Absolutely, Positively, Have to be There as Soon as Possible, a riff on this FedEx commercial, which is older than I am. The talk was on link-aware PHY-layer design– basically looking at how ARQ enables incremental redundancy, and how to do a sort of layered superposition + incremental redundancy scheme in the sequential setting as well as a “multi-path” setting where blocks can arrive out of order. This was really digging into the signal issues in a way that a lot of non-communication engineering information theorists may get squeamish about. The nice thing is that I think the engineering problem is approachable without knowing a lot of heavy-duty math, but still requires some careful analysis.

Communication and Compression Via Sparse Linear Regression
Ramji Venkataramanan
This was on building codewords and codebooks out of a lower-complexity code dictionary A \in \mathbb{R}^{n \times ML} where each codeword is a superposition of L columns, one each from groups of size M. Thus encoding is A \beta where \beta is a sparse vector. I saw a talk by Barron and Joseph from a previous ISIT about this, but the framework extends to rate distortion (achieving the rate distortion function), and channel coding. The main point is to lower the complexity of the code at the expense of the gap to optimal rate — encoding and decoding are polynomial time but the rate gap for rate-distortion goes to zero as 1/\log n. Ramji gave a really nice and clear talk on this — I hope he puts the slides up!

An Optimal Varentropy Bound for Log-Concave Distributions
Mokshay Madiman; Liyao Wang
Mokshay’s talk was also really clear and excellent. For a distribution f(X) on \mathbb{R}^n, we can define \tilde{h}(X) = - \log f(X). The entropy is the expectation of this random variable, and the varentropy is the variance. Their main result is a upper bound on the varentropu of log concave distributions f(X). To wit, \mathrm{Var}(\tilde{h}) \le n. This bound doesn’t depend on the distribution and is sharp if f is a product of exponentials. They then use this to prove a universal bound on the deviation of \tilde{h} from its expectation, which gives a AEP that doesn’t really assume anything about the joint distribution of the variables except for log-concavity. There was more in the talk, but I eagerly await the paper.

Event-triggered Sampling and Reconstruction of Sparse Real-valued Trigonometric Polynomials
Neeraj Sharma; Thippur V. Sreenivas
This was on non-uniform sampling where the sampler tries to detect level crossings of the analog signal and samples at that point — the rate may not be uniform enough to use existing nonuniform sampling techniques. They come up with a method for reconstructing signals which are real-valued trigonometric polynomials with a few nonzero coefficients (e.g. sparse) and it seems to work pretty decently in experiments.

Removing Sampling Bias in Networked Stochastic Approximation
Vivek Borkar; Raaz Dwivedi
In networked stochastic approximation, the intermittent communication between nodes may mean that the system tracks a different ODE than the one we want. By modifying the method to account for “local clocks” on each edge, we can correct for this, but we end up with new conditions on the step size to make things work. I am pretty excited about this paper, but as usual, my notes were not quite up to getting the juicy bits. That’s what paper reading is for.

On Asymmetric Insertion and Deletion Errors
Ankur A. Kulkarni
The insertion/deletion channel model is notoriously hard. Ankur proposed a new model where 0‘s are “indestructible” — they cannot be inserted or deleted. This asymmetric model leads to new asymptotic bounds on the capacity. I don’t really work on this channel model so I can’t get the finer points of the results, but once nice takeaway was that asymptotically, each indestructible 0 in the codeword lets us correct around 1/2 a deletion more.

ISIT 2014: a few more talks

Identification via the Broadcast Channel
Annina Bracher (ETH Zurich, Switzerland); Amos Lapidoth (ETHZ, Switzerland)
The title pretty much describes it — there are two receivers which are both looking out for a particular message. This is the identification problem, in which the receiver only cares about a particular message (but we don’t know which one) and we have to design a code such that they can detect the message. The number of messages is 2^{2^{nC}} where C is the Shannon capacity of the DMC. In the broadcast setting we run into the problem that the errors for the two receivers are entangled. However, their message sets are disjoint. The way out is to look at the average error for each (averaged over the other user’s message). The main result is that the rates only depend on the conditional marginals, and they have a strong converse.

Efficient compression of monotone and m-modal distributions
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
A monotone distribution is a distribution on \mathbb{N} such that the probabilities are non-increasing. The redundancy for this class is infinite, alas, so they restrict the support to size k (where k can be large). They propose a two-step compression scheme in which the first step is to approximate the true distribution with a piecewise constant step distribution, and then use a compression scheme for step distributions.

Writing on a Dirty Paper in the Presence of Jamming
Amitalok J Budkuley (Indian Institute of Technology, Bombay, India); Bikash K Dey (Indian Institute of Technology Bombay, India); Vinod M Prabhakaran (Tata Institute of Fundamental Research, India)
Ahh, jamming. A topic near and dear to my heart. This paper takes a game-theoretic approach to jamming in a DPC setup: “the capacity of the channel in the presence of the jammer is the unique Nash equilibrium utility of the zero sum communication game between the user and the jammer.” This is a mutual information game, and they show that i.i.d. Gaussian jamming and dirty paper coding are a Nash equilibrium. I looked at an AVC version of this problem in my thesis, and the structure is quite a bit different, so this was an interesting different take on the same problem — how can we use the state information to render adversarial interference as harmless as noise?

Stable Grassmann Manifold Embedding via Gaussian Random Matrices
Hailong Shi (Tsinghua University & Department of Electronic Engineering, P.R. China); Hao Zhang (TsinghuaUniversity, P.R. China); Gang Li (Tsinghua University, P.R. China); Xiqin Wang (Tsinghua University, P.R. China)
This was in the session I was chairing. The idea is that you are given a subspace (e.g., a point on the Grassman manifold) and you want to see what happens when you randomly project this into a lower-dimensional subspace using an i.i.d. Gaussian matrix a la Johnson-Lindenstrauss. The JL Lemma says that projections are length-preserving. Are they also volume-preserving? It turns out that they are (no surprise). The main tools are measure concentration results together with a union bound over a covering set.

Is “Shannon capacity of noisy computing” zero?
Pulkit Grover (Carnegie Mellon University, USA)
Yes. I think. Maybe? Pulkit set up a physical model for computation and used a cut-set argument to show that the total energy expenditure is high. I started looking at the paper in the proceedings and realized that it’s significantly different than the talk though, so I’m not sure I really understood the argument. I should read the paper more carefully. You should too, probably.

ISIT 2014: how many samples do we need?

Due to jetlag, my CAREER proposal deadline, and perhaps a bit of general laziness, I didn’t take as many notes at ISIT as I would have, so my posting will be somewhat light (in addition to being almost a month delayed). If someone else took notes on some talks and wants to guest-post on it, let me know!

Strong Large Deviations for Composite Hypothesis Testing
Yen-Wei Huang (Microsoft Corporation, USA); Pierre Moulin (University of Illinois at Urbana-Champaign, USA)
This talk was actually given by Vincent Tan since neither of the authors could make it (this seems to be a theme of talks I’ve attended this summer. The paper was about testing a simple hypothesis H_1 versus a composite hypothesis H_0 where under H_0 the observations are i.i.d. with respect to one of possibly k different distributions. There are therefore k different errors and the goal is to characterize these errors when we ask for the probability of true detection to be greater than 1 - \epsilon. This is a sort of generalized Neyman-Pearson setup. They look at the vector of log-likelihood ratios and show that a threshold test is nearly optimal. At the time, I understood the idea of the proof, but I think it’s one of things where you need to really read the paper.

Randomized Sketches of Convex Programs with Sharp Guarantees
Mert Pilanci (University of California, Berkeley, USA); Martin J. Wainwright (University of California, Berkeley, USA)
This talk was about using random projections to lower the complexity of solving a convex program. Suppose we want to minimize \| Ax - y \|^2 over x given y. A sketch would be to solve \| SAx - Sy \|^2 where S is a random projection. One question is how to choose A. They show that choosing S to be a randomized Hadamard matrix (the paper studies Gaussian matrices), then the objective value of the sketched program is at most (1 + \epsilon)^2 times the value of the original program as long as the the number of rows of S is larger than O( \epsilon^{-2} \mathbb{W}^2(A \mathcal{K})), where \mathbb{W}(A \mathcal{K}) is the Gaussian width of the tangent cone of the contraint set at the optimum value. For more details look at their preprint on ArXiV.

On Efficiency and Low Sample Complexity in Phase Retrieval
Youssef Mroueh (MIT-IIT, USA); Lorenzo Rosasco (DIBRIS, Unige and LCSL – MIT, IIT, USA)
This was another talk not given by the authors. The problem is recovery of a complex vector x_0 \in \mathbb{C}^n from phaseless measurements of the form b_i = |\langle a_i, x_0 \rangle|^2 where a_i are complex spherically symmetric Gaussian vectors. Recovery from such measurements is nonconvex and tricky, but an alternating minimizing algorithm can reach a local optimum, and if you start it in a “good” initial position, it will find a global optimum. The contribution of this paper is provide such a smart initialization. The idea is to “pair” the measurements to create new measurements y_i = \mathrm{sign}( b_i^{(1)} - b_i^{(2)} ). This leads to a new problem (with half as many measurements) which is still hard, so they find a convex relaxation of that. I had thought briefly about such sensing setups a long time ago (and by thought, I mean puzzled over it at a coffeshop once), so it was interesting to see what was known about the problem.

Sorting with adversarial comparators and application to density estimation
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
Ashkan gave this talk on a problem where you have m samples from an unknown distribution p and a set of distributions \{q_1, q_2, \ldots, q_n\} to compare against. You want to find the distribution that is closest in \ell_1. One way to do this is via Scheffe tournament tht compares all pairs of distributions — this runs in time n^2 time. They show a method that runs in O(n) time by studying the structure of the comparators used in the sorting method. The motivation is that running comparisons can be expensive (especially if they involve human decisions) so we want to minimize the number of comparisons. The paper is significantly different than the talk, but I think it would definitely be interesting to those interested in discrete algorithms. The density estimation problem is really just a motivator — the sorting problem is far more general.

ICML 2014: a few more papers

After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.

  • Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
  • An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
  • Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
  • Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
  • (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking (\epsilon,\delta)-differential privacy yield better sample complexity results than (\epsilon,)-differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
  • Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
  • Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on n variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple (x_i, x_j, y) a marginal \mu_{ij}(x_i, x_j, y). If you want to create a rule f: \mathcal{X} \to \mathcal{Y} for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.