ISIT 2015 : statistics and learning

The advantage of flying to Hong Kong from the US is that the jet lag was such that I was actually more or less awake in the mornings. I didn’t take such great notes during the plenaries, but they were rather enjoyable, and I hope that the video will be uploaded to the ITSOC website soon.

There were several talks on entropy estimation in various settings that I did not take great notes on, to wit:

  • OPTIMAL ENTROPY ESTIMATION ON LARGE ALPHABETS VIA BEST POLYNOMIAL APPROXIMATION (Yihong Wu, Pengkun Yang, University Of Illinois, United States)
  • DOES DIRICHLET PRIOR SMOOTHING SOLVE THE SHANNON ENTROPY ESTIMATION PROBLEM? (Yanjun Han, Tsinghua University, China; Jiantao Jiao, Tsachy Weissman, Stanford University, United States)
  • ADAPTIVE ESTIMATION OF SHANNON ENTROPY (Yanjun Han, Tsinghua University, China; Jiantao Jiao, Tsachy Weissman, Stanford University, United States)

I would highly recommend taking a look for those who are interested in this problem. In particular, it looks like we’re getting towards more efficient entropy estimators in difficult settings (online, large alphabet), which is pretty exciting.

QUICKEST LINEAR SEARCH OVER CORRELATED SEQUENCES
Javad Heydari, Ali Tajer, Rensselaer Polytechnic Institute, United States
This talk was about hypothesis testing where the observer can control the samples being taken by traversing a graph. We have an n-node graph (c.f. a graphical model) representing the joint distribution on n variables. The data generated is i.i.d. across time according to either F_0 or F_1. At each time you get to observe the data from only one node of the graph. You can either observe the same node as before, explore by observing a different node, or make a decision about whether the data from from F_0 or F_1. By adopting some costs for different actions you can form a dynamic programming solution for the search strategy but it’s pretty heavy computationally. It turns out the optimal rule for switching has a two-threshold structure and can be quite a bit different than independent observations when the correlations are structured appropriately.

MISMATCHED ESTIMATION IN LARGE LINEAR SYSTEMS
Yanting Ma, Dror Baron, North Carolina State University, United States; Ahmad Beirami, Duke University, United States
The mismatch studied in this paper is a mismatch in the prior distribution for a sparse observation problem y = Ax + \sigma_z z, where x \sim P (say a Bernoulli-Gaussian prior). The question is what happens when we do estimation assuming a different prior Q. The main result of the paper is an analysis of the excess MSE using a decoupling principle. Since I don’t really know anything about the replica method (except the name “replica method”), I had a little bit of a hard time following the talk as a non-expert, but thankfully there were a number of pictures and examples to help me follow along.

SEARCHING FOR MULTIPLE TARGETS WITH MEASUREMENT DEPENDENT NOISE
Yonatan Kaspi, University of California, San Diego, United States; Ofer Shayevitz, Tel-Aviv University, Israel; Tara Javidi, University of California, San Diego, United States
This was another search paper, but this time we have, say, K targets W_1, W_2, \ldots, W_K uniformly distributed in the unit interval, and what we can do is query at each time n a set S_n \subseteq [0,1] and get a response Y_n = X_n \oplus Z_n where X_n = \mathbf{1}( \exists W_k \in S_n ) and Z_n \sim \mathrm{Bern}( \mu(S_n) + b ) where \mu is the Lebesgue measure. So basically you can query a set and you get a noisy indicator of whether you hit any targets, where the noise depends on the size of the set you query. At some point \tau you stop and guess the target locations. You are (\epsilon,\delta) successful if the probability that you are within \delta of each target is less than \epsilon. The targeting rate is the limit of \log(1/\delta) / \mathbb{E}[\tau] as \epsilon,\delta \to 0 (I’m being fast and loose here). Clearly there are some connections to group testing and communication with feedback, etc. They show there is a significant gap between the adaptive and nonadaptive rate here, so you can find more targets if you can adapt your queries on the fly. However, since rate is defined for a fixed number of targets, we could ask how the gap varies with K. They show it shrinks.

ON MODEL MISSPECIFICATION AND KL SEPARATION FOR GAUSSIAN GRAPHICAL MODELS
Varun Jog, University of California, Berkeley, United States; Po-Ling Loh, University of Pennsylvania, United States
The graphical model for jointly Gaussian variables has no edge between nodes i and j if the corresponding entry (\Sigma^{-1})_{ij} = 0 in the inverse covariance matrix. They show a relationship between the KL divergence of two distributions and their corresponding graphs. The divergence is lower bounded by a constant if they differ in a single edge — this indicates that estimating the edge structure is important when estimating the distribution.

CONVERSES FOR DISTRIBUTED ESTIMATION VIA STRONG DATA PROCESSING INEQUALITIES
Aolin Xu, Maxim Raginsky, University of Illinois at Urbana–Champaign, United States
Max gave a nice talk on the problem of minimizing an expected loss \mathbb{E}[ \ell(W, \hat{W}) ] of a d-dimensional parameter W which is observed noisily by separate encoders. Think of a CEO-style problem where there is a conditional distribution P_{X|W} such that the observation at each node is a d \times n matrix whose columns are i.i.d. and where the j-th row is i.i.d. according to P_{X|W_j}. Each sensor gets independent observations from the same model and can compress its observations to b bits and sends it over independent channels to an estimator (so no MAC here). The main result is a lower bound on the expected loss as s function of the number of bits latex b, the mutual information between W and the final estimate \hat{W}. The key is to use the strong data processing inequality to handle the mutual information — the constants that make up the ratio between the mutual informations is important. I’m sure Max will blog more about the result so I’ll leave a full explanation to him (see what I did there?)

More on Shannon theory etc. later!

Recipe: Chaimen mutton stew

A danger of living near New York is the relative proximity of Kalustyan’s and its near-infinite array of spices. Last year on an impulse I bought a packet of dry chaimen spice mix (basically the dry ingredients from the paste recipe here) — it’s a mix of fenugreek, cumin, paprika, and garlic used in some Armenian dishes. Since I’m not really up for making bastirma or something fancy like that, I’ve been experimenting over the last year with ways of using it outside of making a dip. Here’s a recipe for a mutton curry (I used goat), but I’ve used an adapted procedure to make a bean and vegetable dish stew as well: squash, carrots, and white beans, for example.

a picture of the cooking, close to the end

a picture of the cooking, close to the end

Ingredients

  • 2 heaping tbsp chairmen mix
  • 1 6 oz can of tomato paste
  • 1 cup parsley, chopped
  • 3 cloves garlic, diced or crushed
  • 2-4 tbsp plain yogurt
  • 1 lb cubed goat (bone-in stew meat)
  • 1 medium onion, diced
  • olive oil
  • 1/3 cup diced prunes

Instructions
Mix chaimen, 2-3 tbsp or so of tomato paste, garlic, 1/2 cup parsley, and enough yogurt to make a thick paste. If necessary add a little olive oil to thin it out. Coat/rub it into the goat, cover, and let it sit for at least an hour on the counter or possibly overnight in the fridge.

Heat oil with onions in a heavy pot/dutch oven and cook on low until the onions turn more translucent. Add in 1 clove of garlic sometime in that time, being careful not to burn. Add goat and marinade cook for a few minutes, turning over a few times. Add enough water to just cover the goat and bring the heat up until it hits boiling, then cover and turn heat to low and cook it on low heat slowly for at least an hour, preferably more. Uncover and cook for 15 minutes longer to reduce the liquid. By this point the meat should be falling off the bones pretty easily. if not, you can cook a bit longer. Add in prunes and additional tomato paste to thicken (as needed) and cook for 15-20 minutes, still on low, until prunes dissolve. Remove from heat and serve garnished with additional parsley.

Note: other adaptations could include adding potatoes or some other starch about 30 minutes before finishing, or other vegetables that you think might be good. If you make this with chicken adding some vegetables would be great. For a vegetarian version you can shorten the cooking time a bit since you will likely add less liquid. White beans, kidney beans, and fava all work pretty well (or a mix of them). It takes on a bit of a cassoulet-like consistency in that case then. Lots of places you can sub here, like raisins or apricots for prunes, probably.

ISIT 2015 begins!

As usual, I will blog it. As usual, I will try to do it in a timely fashion. As usual, the probability that I succeed is low, as there is a strong converse and I am operating above capacity… Someone help me out with some finite blocklength analysis… 

 

AISTATS 2015: a few talks from one day

I attended AISTATS for about a day and change this year — unfortunately due to teaching I missed the poster I had there but Shuang Song presented a work on learning from data sources of different quality, which her work with Kamalika Chaudhuri and myself. This was my first AISTATS. It had single track of oral presentations and then poster sessions for the remaining papers. The difficulty with a single track for me is that my interest in the topics is relatively focused, and the format of a general audience with a specialist subject matter meant that I couldn’t get as much out of the talks as I would have wanted. Regardless, I did get exposed to a number of new problems. Maybe the ideas can percolate for a while and inform something in the future.

Computational Complexity of Linear Large Margin Classification With Ramp Loss
Søren Frejstrup Maibing, Christian Igel
The main result of this paper (I think) is that ERM under ramp loss is NP-hard. They gave the details of the reduction but since I’m not a complexity theorist I got a bit lost in the weeds here.

A la Carte — Learning Fast Kernels
Zichao Yang, Andrew Wilson, Alex Smola, Le Song
Ideas like “random kitchen sinks” and other kernel approximation methods require you to have a kernel you want to approximate, but in many problems you in fact need to learn the kernel from the data. If I give you a kernel function k(x,x') = k( |x - x'| ), then you can take the Fourier transform K(\omega) of k. This turns out to be a probability distribution, so you can sample random \{\omega_i\} i.i.d. and build a randomized Fourier approximation of k. If you don’t know the kernel function, or you have to learn it, then you could instead try to learn/estimate the transform directly. This paper was about trying to do that in a reasonably efficient way.

Learning Where to Sample in Structured Prediction
Tianlin Shi, Jacob Steinhardt, Percy Liang
This was about doing Gibbs sampling, not for MCMC sampling from the stationary distribution, but for “stochastic search” or optimization problems. The intuition was that some coordinates are “easier” than others, so we might want to focus resampling on the harder coordinates. But this might lead to inaccurate sampling. The aim here twas to build a heterogenous sampler that is cheap to compute and still does the right thing.

Tradeoffs for Space, Time, Data and Risk in Unsupervised Learning
Mario Lucic, Mesrob Ohannessian, Amin Karbasi, Andreas Krause
This paper won the best student paper award. They looked at a k-means problem where they do “data summarization” to make the problem a bit more efficient — that is, by learning over an approximation/summary of the features, they can find different tradeoffs between the running time, risk, and sample size for learning problems. The idea is to use coresets — I’d recommend reading the paper to get a better sense of what is going on. It’s on my summer reading list.

Averaged Least-Mean-Squares: Bias-Variance Trade-offs and Optimal Sampling Distributions
Alexandre Defossez, Francis Bach
What if you want to do SGD but you don’t want to sample the points uniformly? You’ll get a bias-variance tradeoff. This is another one of those “you have to read the paper” presentations. A nice result if you know the background literature, but if you are not a stochastic gradient aficionado, you might be totally lost.

Sparsistency of \ell_1-Regularized M-Estimators
Yen-Huan Li, Jonathan Scarlett, Pradeep Ravikumar, Volkan Cevher
In this paper they find a new condition, which they call local structured smoothness, which is sufficient for certain M-estimators to be “sparsistent” — that is, they recover the support pattern of a sparse parameter asymptotically as the number of data points goes to infinity. Examples include the LASSO, regression in general linear models, and graphical model selection.

Some of the other talks which were interesting but for which my notes were insufficient:

  • Two-stage sampled learning theory on distributions (Zoltan Szabo, Arthur Gretton, Barnabas Poczos, Bharath Sriperumbudur)
  • Generalized Linear Models for Aggregated Data (Avradeep Bhowmik, Joydeep Ghosh, Oluwasanmi Koyejo)
  • Efficient Estimation of Mutual Information for Strongly Dependent Variables (Shuyang Gao, Greg Ver Steeg, Aram Galstyan)
  • Sparse Submodular Probabilistic PCA (Rajiv Khanna, Joydeep Ghosh, Russell Poldrack, Oluwasanmi Koyejo)

2015 North American School of Information Theory

The 2015 ​North American ​School of Information Theory ​(NASIT) will be held on August 10-13, 2015, at the University of California, San Diego in La Jolla. If you or your colleagues have students who might be interested in this event, we would be grateful if you could forward this email to them and encourage their participation. The application deadline is ​Sunday, June 7. As in the past schools, we again have a great set of lecturers this year​​:

We are pleased to announce that ​Paul Siegel will be the​​ Padovani Lecturer of the IEEE Information Theory Society​​ and will give his lecture at the School. The Padovani Lecture is sponsored by a generous gift of Roberto Padovani.

For more information and application, please visit the School website.​​

Tracks: Frequenties and Bayesity

  1. Ashkaraballi (Nancy Ajram)
  2. Restless Leg (Har Mar Superstar)
  3. Let The Good Times Roll (JD McPherson)
  4. The Theme From Dangeresque II (Strong Bad)
  5. Wild Stallion Mountain (The Bombay Royale)
  6. Reel It In (Nikhil P. Yerawadekar & Low Mentality)
  7. Central Park Blues (Ultimate Painting)
  8. Piazza, New York Catcher (Belle and Sebastian)
  9. Golden Slippers (The Prince Myshkins)
  10. Nowadays (Del [the Funky Homosapien])
  11. Coney Island Baby (Tom Waits)
  12. J’veux D’la Musique (Tout Le Temps) (Les Nubians)
  13. Howl (JD Brooks & The Uptown Sound)
  14. I Believe in a Thing Called Love (The Darkness)
  15. Last Month Of The Year (Fairfield Four)
  16. A un niño llorando (Schola Antique of Chicago)
  17. Rock & Roll Is Cold (Matthew E. White)
  18. Try A Little Tenderness

Readings

I had a rough semester this Spring, but I did manage to read some books, mostly thanks to an over-aggressive travel schedule.

Dead Ringers: How Outsourcing is Changing How Indians Understand Themselves (Shehzad Nadeem). Published a few years ago, this book is a study of how two kinds of outsourcing — business process (BPO) and information processing outsourcing (IPO) — have changed attitudes of Indians towards work in a globalized economy. Nadeem first lays out the context for outsourcing and tries to dig behind the numbers to see where and to whom the benefits are going. The concept of time arbitrage was a new way of thinking about the 24-hour work cycle that outsourcing enables — this results in a slew of deleterious health effects for workers as well as knock-on effects for family structures and the social fabric. This sets the stage for a discussion of whether or not outsourcing has really brought a different “corporate culture” to India (a topic on which I have heard a lot from friends/relatives). The book brings a critical perspective that complicates the simplified “cyber-coolies” versus “global agents” discussion that we often hear.

Cowboy Feng’s Space Bar and Grille (Steven Brust). Mind-candy, a somewhat slight novel that was a birthday gift back in high school. Science fiction of a certain era, and with a certain lightness.

Hawk (Steven Brust). Part n in a series, also mind-candy at this point. If you haven’t read the whole series up to this point, there’s little use in starting here.

Saga Volumes I-IV (Brian K. Vaughan / Fiona Staples). This series was recommended by several people and since I hadn’t read a graphic novel in a while I figured I’d pick it up. Definitely an interesting world, angels vs. demons in space with androids who have TV heads thrown in for good measure, it’s got a sort of visual freedom that text-based fiction can’t really match up to. Why not have a king with a giant HDTV for a head? Makes total sense to me, if that’s the visual world you live in. Unfortunately, the series is at a cliff-hanger so I have to wait for more issues to come out.

This Earth of Mankind (Pramoedya Ananta Toer): A coming-of-age story set in 1898 Indonesia, which is a place and time about which I knew almost nothing. Toer orally dictated a quartet of novels while imprisoned in Indonesia, of which this is the first. The mélange of ideas around colonialism, independence, cultural stratification in Java, and the benefits and perils of “Western education” echo things I know from reading about India, but are very particular to Indonesia. In particular, the bupati system and relative decentralization of Dutch authority in Indonesia created complex social hierarchies that are hard to understand. The book follows Minke, the only Native (full Javanese) to attend his Dutch-medium school, and his relationship with Annelise, the Indo (half-Native, half Dutch) daughter of a Dutch businessman and his concubine Nyai Ontosoroh. Despite their education and accomplishments, Minke and Nyai Ontosoroh are quite powerless in the face of the racist hierarchies of Dutch law that do not allow Natives a voice. This novel sets the stage for the rest of the quartet, which I am quite looking forward to reading.

The Bone Clocks (David Mitchell): The latest novel from David Mitchell is not as chronologically sprawling as Cloud Atlas. I don’t want to give too much away, but there is an epic behind-the-scenes struggle going on, some sort of mystic cult stuff, and a whole lot of “coincidences” that Mitchell is so good at sprinkling throughout his book. There are also some nice references to his other books, including Black Swan Green and The Thousand Autumns of Jacob de Zoet. I liked the latter novel better than this one, despite its gruesomeness, because it felt a bit more grounded. I think fans of Mitchell’s work will like the Bone Clocks, but of his novels, I don’t think I would recommend starting with this one.