# Data: what is it good for? (Absolutely Something): the first few weeks

So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:

• engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
• it’s been more than a decade since we were undergraduates, let alone 18
• the students in our class are fresh out of high school and also haven’t had discussion-based classes

My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.

What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:

• Introduction: what is data?
• Potentials and perils of data science
• The importance of modeling
• Statistical considerations
• Machine learning and algorithms
• Data and society: ethics and privacy
• Data visualizaion
• Project Presentations

I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.

# Allerton 2014: privately charging your information odometer while realizing your group has pretenders

Here are a few papers that I saw at Allerton — more to come later.

Group Testing with Unreliable Elements
Arya Mazumdar, Soheil Mohajer
This was a generalization of the group testing problem: items are either positive or null, and can be tested in groups such that if any element of the group is positive, the whole group will test positive. The item states can be thought of as a binary column vector $\mathbf{x}$ and each test is the row or a matrix $A$: the $(i,j)$-the entry of $A$ is 1 if the $i$-th item is part of the $j$-th group. The multiplication $A \mathbf{x}$ is taken using Boolean OR. The twist in this paper is that they consider a situation where $\tau$ elements can “pretend” to be positive in each test, with a possibly different group in each test. This is different than “noisy group testing” which was considered previously, which is more like $A \mathbf{x} + \mathrm{noise}$. They show achievable rates for detecting the positives using random coding methods (i.e. a random $A$ matrix). There was some stuff at the end about the non-i.i.d. case but my notes were sketchy at that point.

The Minimal Realization Problems for Hidden Markov Models
Qingqing Huang, Munther Dahleh, Rong Ge, Sham Kakade
The realization problem for an HMM is this: given the exact joint probability distribution on length $N$ strings $(Y_1, Y_2, \ldots, Y_N)$ from an HMM, can we create a multilinear system whose outputs have the same joint distribution? A multilinear system looks like this:
$w_{t+1} = A^{(\ell_t)} w_t$ for $w_t \in \mathbb{R}^k$ with $w_0 = v$
$z_{t} = u^t w_t$
We think of $A^{(\ell_t)} \in \mathbb{R}^{k \times k}$ as the transition for state $\ell_t$ (e.g. a stochastic matrix) and we want the output to be $z_t$. This is sort of at the nexus of control/Markov chains and HMMs and uses some of the tensor ideas that are getting hot in machine learning. As the abstract puts it, the results are that they can efficiently construct realizations if $N > max(4 log_d(k) + 1; 3)$, where $d$ is the size of the output alphabet size, and $k$ is the minimal order of the realization.”

Differentially Private Distributed Protocol for Electric Vehicle Charging
Shuo Han, Ufuk Topcu, George Pappas
This paper was about a central aggregator trying to get multiple electric vehicle users to report their charging needs. The central allocator has a utility maximization problem over the rates of charging:
$\min_{r_i} U(\sum r_i)$
$\mathrm{s.t.}\ \le r_i \le \bar{r}_i$
$\qquad \mathbf{1}^{\top} r_i = z_i$
They focus on the case where the charge demands $z_i$ are private but the charging rates $r_i$ can be public. This is in contrast to the mechanism design literature popularized by Aaron Roth and collaborators. They do an analysis of a projected gradient descent procedure with Laplace noise added for differential privacy. It’s a stochastic gradient method but seems to converge in just a few time steps — clearly the number of iterations may impact the privacy parameter, but for the examples they showed it was only a handful of time steps.

An Interactive Information Odometer and Applications
Mark Braverman and Omri Weinstein
This was a communication complexity talk about trying to maintain an estimate of the information complexity of a protocol $\Pi$ for computing a function $f(X,Y)$ interactively, where Alice has $X$, Bob has $Y$, and $(X,Y) \sim \mu$:
$\mathsf{IC}_{\mu}(\Pi) = I( \Pi ; X | Y ) + I( \Pi ; Y | X)$
Previous work has shown that the (normalized) communication complexity of computing $f$ on $n$-tuples of variables $\epsilon$-accurately approaches the minimum information complexity over all protocols for computing $f$ once $\epsilon$-accurately. At least I think this is what was shown previously — it’s not quite my area. The result in this paper is a strong converse for this — the goal is to maintain an online estimate of $\mathsf{IC}$ during the protocol. The mechanism seeems a bit like communication with feedback a la Horstein, but I got a bit confused as to what was going on — basically it seems that Alice and Bob need to be able to agree on the estimate during the protocol and use a few extra bits added to the communication to maintain this estimate. If I had an extra $n$ hours in the week I would read up more about this. Maybe on my next plane ride…

# SPCOM 2014: some talks

Relevance Singular Vector Machine for Low-­rank Matrix Sensing
Martin Sundin; Saikat Chatterjee; Magnus Jansson; Cristian Rojas
This talk was on designing Bayesian priors for sparse-PCA problems — the key is to find a prior which induces a low-rank structure on the matrix. The model was something like $y = A \mathrm{vec}(X) + n$ where $X$ is a low-rank matrix and $n$ is noise. The previous state of the art is by Babacan et al., a paper which I obviously haven’t read, but the method they propose here (which involved some heavy algebra/matrix factorizations) appears to be competitive in several regimes. Probably more of interest to those working on Bayesian methods…

Non-Convex Sparse Estimation for Signal Processing
David Wipf
More Bayesian methods! Although David (who I met at ICML) was not trying to say that the priors are particularly “correct,” but rather that the penalty functions that they induce on the problems he is studying actually make sense. More of an algorithmist’s approach, you might say. He set up the problem a bit more generally, to minimize problems of the form
$\min_{X_i} \sum_{i} \alpha_i \mathrm{rank}[X_i] \ \ \ \ \ \ \ Y = \sum_{i} A_i(X_i)$
where $A_i$ are some operators. He made the case that convex relaxations of many of these problems, while analytically beautiful, have restrictions which are not satisfied in practice, and indeed they often have poor performance. His approach is via Empirical Bayes, but this leads to non-convex problems. What he can show is that the algorithm he proposes is competitive with any method that tries to separate the error from the “low-rank” constraint, and that the new optimization is “smoother.” I’m sure more details are in his various papers, for those who are interested.

PCA-HDR: A Robust PCA Based Solution to HDR Imaging
My apologies for taking fewer notes on this one, but I don’t know much about HDR imaging, so this was mostly me learning about HDR image processing. There are several different ways of doing HDR, from multiple exposures to flash/no-flash, and so on. The idea is that artifacts introduced by the camera can be modeled using the robust PCA framework and that denoting in HDR imaging may be better using robust PCA. I think that looking at some of the approaches David mentioned may be good in this domain, since it seems unlikely to me that these images will satisfy the conditions necessary for convex relaxations to work…

On Communication Requirements for Secure Computation
Vinod M Prabhakaran
Vinod showed some information theoretic approaches to understanding how much communication is needed for secure computation protocols like remote oblivious transfer: Xavier has $\{X_0, X_1\}$, Yvonne has $Y \in \{0,1\}$ and Zelda wants $Z = X_Y$, but nobody should be able to infer each other’s values. Feige, Killian, and Naor have a protocol for this, which Vinod and Co. can show is communication-optimal. There were several ingredients here, including cut-set bounds, distribution switching, data processing inequalities, and special bounds for 3-party protocols. More details in his CRYPTO paper (and others).

Artificial Noise Revisited: When Eve Has More Antennas Than Alice
Shuiyin Liu; Yi Hong; Emanuele Viterbo
In a MIMO wiretap setting, if the receiver has more antennas than the transmitter, then the transmitter can send noise in the nullspace of the channel matrix of the direct channel — as long as the eavesdropper has fewer antennas than the transmitter then secure transmission is possible. In this paper they show that positive secrecy capacity is possible even when the eavesdropper has more antennas, but as the number of eavesdropper antennas grows, the achievable rate goes to $0$. Perhaps a little bit of a surprise here!

# SPCOM 2014: tutorials

I just attended SPCOM 2014 at the Indian Institute of Science in Bangalore — many thanks to the organizers for the invitation! SPCOM 2014 happens every two years and is a mix of invited and submitted papers (much like Allerton). This year they mixed the invited talks with the regular talks which I thought was a great idea — since invited papers were not for specific sessions, it makes a lot more sense to do it that way, plus it avoids a sort of “two-tier” system.

I arrived early enough to catch the tutorials on the first day. There was a 3 hour session in the morning and another on the in afternoon. For the morning I decided to expand my horizons by attending Manoj Gopalkrishnan‘s tutorial on the physics of computation. Manoj focused on the question of how much energy it takes to erase or copy a bit of information. He started with some historical context via von Neumann, Szilard, and Landauer to build a correspondence between familiar information theoretic concepts and their physical counterparts. So in this correspondence, relative entropy is the same as free energy. He then turned to look at what one might call “finite time” thermodynamics. Suppose that you have to apply a control that operates in finite time in order to change a bit. One way to look at this is through controlling the transition probabilities in a two-state Markov chain representing the value of the bit you want to fix. You want to drive the resting state (with stationary distribution $(1/2,1/2)$ to something like $(\epsilon, 1 - \epsilon)$ within time $T$. At this level I more or less understood what was going on, but since my physics background is pretty poor, I think I missed out on how the physical intuition/constraints impact what control strategies you can choose.

Prasad Santhanam gave the other tutorial, which was a bit more solid ground for me. This was not quite a tutorial on large-alphabet probability estimation, but more directly on universal compression and redundancy calculations. The basic setup is that you have a family of distributions $\mathcal{P}$ and you don’t know which distribution $p \in \mathcal{P}$ will generate your data. Based on the data sample you want to do something: estimate some property of the distribution, compress the sample to a size close to its entropy, etc. A class can be weakly or strongly compressible, or insurable (which means being able to estimate quantiles), and so on. These problems turn out to be a bit different from each other depending on some topological features of the class. One interesting thing to consider for the machine learners out there this stopping time that you need in some analyses. As you are going along, observing the data and doing your task (estimation, compression, etc) can you tell from the data that you are doing well? This has major implications for whether or not an online algorithm can even work the way we want it to, and is something Prasad calls “data-driven compressible.”

I’ll try to write another post or two about the talks I saw as well!

# ISIT 2014: a few more talks

Annina Bracher (ETH Zurich, Switzerland); Amos Lapidoth (ETHZ, Switzerland)
The title pretty much describes it — there are two receivers which are both looking out for a particular message. This is the identification problem, in which the receiver only cares about a particular message (but we don’t know which one) and we have to design a code such that they can detect the message. The number of messages is $2^{2^{nC}}$ where $C$ is the Shannon capacity of the DMC. In the broadcast setting we run into the problem that the errors for the two receivers are entangled. However, their message sets are disjoint. The way out is to look at the average error for each (averaged over the other user’s message). The main result is that the rates only depend on the conditional marginals, and they have a strong converse.

Efficient compression of monotone and m-modal distributions
Jayadev Acharya (University of California, San Diego, USA); Ashkan Jafarpour (University of California, San Diego, USA); Alon Orlitsky (University of California, San Diego, USA); Ananda Theertha Suresh (University of California, San Diego, USA)
A monotone distribution is a distribution on $\mathbb{N}$ such that the probabilities are non-increasing. The redundancy for this class is infinite, alas, so they restrict the support to size $k$ (where $k$ can be large). They propose a two-step compression scheme in which the first step is to approximate the true distribution with a piecewise constant step distribution, and then use a compression scheme for step distributions.

Writing on a Dirty Paper in the Presence of Jamming
Amitalok J Budkuley (Indian Institute of Technology, Bombay, India); Bikash K Dey (Indian Institute of Technology Bombay, India); Vinod M Prabhakaran (Tata Institute of Fundamental Research, India)
Ahh, jamming. A topic near and dear to my heart. This paper takes a game-theoretic approach to jamming in a DPC setup: “the capacity of the channel in the presence of the jammer is the unique Nash equilibrium utility of the zero sum communication game between the user and the jammer.” This is a mutual information game, and they show that i.i.d. Gaussian jamming and dirty paper coding are a Nash equilibrium. I looked at an AVC version of this problem in my thesis, and the structure is quite a bit different, so this was an interesting different take on the same problem — how can we use the state information to render adversarial interference as harmless as noise?

Stable Grassmann Manifold Embedding via Gaussian Random Matrices
Hailong Shi (Tsinghua University & Department of Electronic Engineering, P.R. China); Hao Zhang (TsinghuaUniversity, P.R. China); Gang Li (Tsinghua University, P.R. China); Xiqin Wang (Tsinghua University, P.R. China)
This was in the session I was chairing. The idea is that you are given a subspace (e.g., a point on the Grassman manifold) and you want to see what happens when you randomly project this into a lower-dimensional subspace using an i.i.d. Gaussian matrix a la Johnson-Lindenstrauss. The JL Lemma says that projections are length-preserving. Are they also volume-preserving? It turns out that they are (no surprise). The main tools are measure concentration results together with a union bound over a covering set.

Is “Shannon capacity of noisy computing” zero?
Pulkit Grover (Carnegie Mellon University, USA)
Yes. I think. Maybe? Pulkit set up a physical model for computation and used a cut-set argument to show that the total energy expenditure is high. I started looking at the paper in the proceedings and realized that it’s significantly different than the talk though, so I’m not sure I really understood the argument. I should read the paper more carefully. You should too, probably.

# ICML 2014: a few more papers

After a long stint of proposal writing, I figured I should catch up on some old languishing posts. So here’s a few quick notes on the remainder of ICML 2014.

• Fast Stochastic Alternating Direction Method of Multipliers (Wenliang Zhong; James Kwok): Most of the talks in the Optimization II session were on ADMM or stochastic optimization, or both. This was int he last category. ADMM can have rather high-complexity update rules, especially on large, complex problems, so the goal is to lower the complexity of the update step by making it stochastic. The hard part seems to be controlling the step size.
• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar): The full version of this paper is on ArXiV. The authors look at a multi-core lock-free stochastic coordinate descent method and characterize how many cores you need to get linear speedups — this depends on the convexity properties of the objective function.
• Communication-Efficient Distributed Optimization using an Approximate Newton-type Method (Ohad Shamir; Nati Srebro; Tong Zhang): This paper looked 1-shot “average at the end” schemes where you divide the data onto multiple machines, have them each train a linear predictor (for example) using stochastic optimization and then average the results. This is just averaging i.i.d. copies of some complicated random variable (the output of an optimization) so you would expect some variance reduction. This method has been studied by a few people int the last few years. While you do get variance reduction, the bias can still be bad. On the other extreme, communicating at every iteration essentially transmits the entire data set (or worse) over the network. They propose a new method for limiting communication by computing an approximate Newton step without approximating the full Hessian. It works pretty well.
• Lower Bounds for the Gibbs Sampler over Mixtures of Gaussians (Christopher Tosh; Sanjoy Dasgupta): This was a great talk about how MCMC can be really slow to converge. The model is a mixture of Gaussians with random weights (Dirichlet) and means (Gaussian I think). Since the posterior on the parameters is hard to compute, you might want to do Gibbs sampling. They use conductance methods to get a lower bound on the mixing time of the chain. The tricky part is that the cluster labels are permutation invariant — I don’t care if you label clusters (1,2) versus (2,1), so they need to construct some equivalence classes. They also have further results on what happens when the number of clusters is misspecified. I really liked this talk because MCMC always seems like black magic to me (and I even used it in a paper!)
• (Near) Dimension Independent Risk Bounds for Differentially Private Learning (Prateek Jain; Abhradeep Guha Thakurta): Abhradeep presented a really nice paper with a tighter analysis of output and objective perturbation methods for differentially private ERM, along with a new algorithm for risk minimization on the simplex. Abhradeep really only talked about the first part. If you focus on scalar regret, they show that essentially the error comes from taking the inner product of a noise vector with a data vector. If the noise is Gaussian then the noise level is dimension-independent for bounded data. This shows that taking $(\epsilon,\delta)$-differential privacy yield better sample complexity results than $(\epsilon,)$-differential privacy. This feels similar in flavor to a recent preprint on ArXiV by Beimel, Nissim, and Stemmer.
• Near-Optimally Teaching the Crowd to Classify (Adish Singla; Ilija Bogunovic; Gabor Bartok; Amin Karbasi; Andreas Krause): This was one of those talks where I would have to go back to look at the paper a bit more. The idea is that you want to train annotators to do better in a crowd system like Mechanical Turk — which examples should you give them to improve their performance? They model the learners as doing some multiplicative weights update. Under that model, the teacher has to optimize to pick a batch of examples to give to the learner. This is hard, so they use a submodular surrogate function and optimize over that.
• Discrete Chebyshev Classifiers (Elad Eban; Elad Mezuman; Amir Globerson): This was an award-winner. The setup is that you have categorical (not numerical) features on $n$ variables and you want to do some classification. They consider taking pairwise inputs and compute for each tuple $(x_i, x_j, y)$ a marginal $\mu_{ij}(x_i, x_j, y)$. If you want to create a rule $f: \mathcal{X} \to \mathcal{Y}$ for classification, you might want to pick one that has best worst-case performance. One approach is to take the one which has best worst-case performance over all joint distributions on all variables that agree with the empirical marginals. This optimization looks hard because of the exponential number of variables, but they in fact show via convex duality and LP relaxations that it can be solved efficiently. To which I say: wow! More details are in the paper, but the proofs seem to be waiting for a journal version.

# ICML 2014: Some talks and posters

I was a somewhat inconsistent note-taker here. Because a lot of the talks I attended were sufficiently out-of-area for me that I didn’t get the context for the work, I often found myself jotting a few “look at this later” pointers to myself rather than actual ideas from the talk.

First, the plenaries: Eric Horvitz, Michael Kearns, and Michael Jordan. Horvitz talked about how we’ve made a lot of progress in machine learning but there’s more work to be done in bringing humans back into the loop. Examples include developing semantics for what features mean, how to visualize the results, adding humans into the loop (e.g. active learning or interactive settings), crowdsourcing, and building tools that are sensitive to human cognitive limitations, like detecting and informing people of “surprising events,” which involves knowing what surprising means. He also announced a new data set, COCO for “common objects in context” (not Cocoa Puffs) which has around 300k-400k images and lots of annotations. The goal was to build al library of objects that a 4-year-old can recognize. Can a computer?

I honestly was a little too zonked/jetlagged to understand Michael Kearns’ talk, which was on challenges in algorithmic trading. He was focused on problems that brokers face, rather than the folks who are holding the risk. Michael Jordan gave a variant on a talk I’ve seen him give in the last few plenary/big talks I’ve seen: computation, statistics, and big data. The three examples he talked about were local differential privacy, bounds for distributed estimation, and the bag of little bootstraps.

As far as the research talks go, here are a few from the first day:

• Robust Principal Component Analysis with Complex Noise(Qian Zhao; Deyu Meng; Zongben Xu; Wangmeng Zuo; Lei Zhang): This paper interpreted the Robust PCA problem (given $Y = L = E$ where $L$ is low-rank and $E$ is sparse, recover $L$) in terms of MAP inference. The solution generally looks like a nuclear-norm plus $L_1$ regularization, which they claim implies a kind of Laplace-like model for the noise. They build a generative model and then change the distributions around to get different noise models.
• Discriminative Features via Generalized Eigenvectors (Nikos Karampatziakis; Paul Mineiro): This was on how to learn features that are discriminative in a multiclass setting while still being somewhat efficient. The main idea was to look at correlations in the existing features via the tensor $x \otimes x \otimes y$ where $x$ are the features and $y$ are the labels, and to then find generalized eigenvalues and eigenvectors by looking for vectors $v$ that maximize (for a given $(i,j)$ the ratio $\frac{ \mathbb{E}[ (v^{\top} x)^2 | y = i] }{ \mathbb{E}[ (v^{\top} x)^2 | y = j] }$. This nonlinearity is important for reasons which I wasn’t entirely sure about.
• Randomized Nonlinear Component Analysis (David Lopez-Paz; Suvrit Sra; Alex Smola; Zoubin Ghahramani; Bernhard Schoelkopf): I really enjoyed this talk — basically the idea is kernel versions of PCA and CCA have annoyingly large running times. So what they do here is linearize the kernel using sampling and then do some linear component analysis on the resulting features. The key tool is to use Matrix Bernstein inequalities to bound the kernel approximations.
• Memory and Computation Efficient PCA via Very Sparse Random Projections (Farhad Pourkamali Anaraki; Shannon Hughes): This talk was on efficient approximations to PCA for large data sets, but not in a streaming setting. The idea was, as I recall, that you have big data sets and different sites. Each site takes a very sparse random projection of its data (e.g. via a random signed Bernoulli matrix) and then these get aggregated via an estimator. They show that the estimator is unbiased and the variance depends on the kurtosis of the distribution of elements in the projection matrix. One thing that was interesting to me is that the covariance estimate has bias term towards the canonical basis, which is one of those facts that makes sense after you hear it.
• Concept Drift Detection Through Resampling (Maayan Harel; Shie Mannor; Ran El-Yaniv; Koby Crammer): This talk was sort of about change-detection, but not really. The idea is that a learning algorithm sees examples sequentially and wants to tell if there is a significant change in the expected risk of the distribution. The method they propose is a sequential permutation test — the challenge is that a gradual change in risk might be hard to detect, and the number of possible hypotheses to consider grows rather rapidly. I got some more clarification from Harel’s explanation at the poster, but I think this is one where reading the paper will make it clearer.

Noted without notes, but I enjoyed the posters (sometimes I read them since the presenter was not around):

• An Asynchronous Parallel Stochastic Coordinate Descent Algorithm (Ji Liu; Steve Wright; Christopher Re; Victor Bittorf; Srikrishna Sridhar)
• Clustering in the Presence of Background Noise (Shai Ben-David; Nika Haghtalab)
• Demystifying Information-Theoretic Clustering (Greg Ver Steeg; Aram Galstyan; Fei Sha; Simon DeDeo)
• Consistency of Causal Inference under the Additive Noise Model (Samory Kpotufe; Eleni Sgouritsa; Dominik Janzing; Bernhard Schoelkopf)
• Concentration in unbounded metric spaces and algorithmic stability (Aryeh Kontorovich)
• Hard-Margin Active Linear Regression (Zohar Karnin; Elad Hazan)
• Heavy-tailed regression with a generalized median-of-means (Daniel Hsu; Sivan Sabato)
/