Having seen a talk recently by John Ioannidis on how medical research is (often) bunk, this finer corrective by Larry Wasserman was nice to read.

Computer science conferences are often not organized by the ACM, but instead there are different foundations for machine learning and vision and so on that basically exist to organize the annual conference(s). At least, that is what I understand. There are a few which are run by the ACM, and there’s often debate about whether or not the ACM affiliation is worth it, given the overheads and so on. Boaz Barak had a post a little over a week ago making the case for sticking with the ACM. Given the hegemonic control of the IEEE on all things EE (more or less), this debate is new to me. As far as I can tell, ISIT exists to cover some of the cost of publishing the IT Transactions, and so it sort of has to be run by IEEE.

As mentioned before, Tara Javidi has a nice post up on what it means for one random variable to be stochastically less variable than another.

Paul Miniero has a bigger picture view of NIPS — I saw there were lots of papers on “deep learning” but it’s not really my area so I missed many of those posters.

David Eppstein’s top 10 cs.DS papers from 2012.

Almost a month later, I’m finishing up blogging about NIPS. Merry Christmas and all that (is anyone reading this thing?), and here’s to a productive 2013, research-wise. It’s a bit harder to blog these things because unlike a talk, it’s hard to take notes during a poster presentation.

Overall, I found NIPS to be a bit overwhelming — the single-track format makes it feel somehow more crowded than ISIT, but also it was hard for me to figure out how to strike the right balance of going to talks/posters and spending time talking to people and getting to know what they are working on. Now that I am fairly separated from my collaborators, conferences should be a good time to sit down and work on some problems, but somehow things are always a bit more frantic than I want them to be.

Anyway, from the rest of the conference, here are a few talks/posters that I went to and remembered something about.

T. Dietterich
Challenges for Machine Learning in Computational Sustainability
This was a plenary talk on machine learning problems that arise in natural resources management. There was a lot in this talk, and a lot of different problems ranging from prediction (for bird migrations, etc), imputation of missing data, and classification. These were real-world hands-on problems and one thing I got out of it is how much work you need to put into the making algorithms that work for the dat you have, rather than pulling some off-the-shelf works-great-in-theory method. He gave a version of this talk at TTI but I think the new version is better.

K. Muandet, K. Fukumizu, F. Dinuzzo, B. Schölkopf
Learning from Distributions via Support Measure Machines
This was on generalizing SVMs to take distributions as inputs instead of points — instead of getting individual points as training data, you get distributions (perhaps like clusters) and you have to do learning/classification on that kind of data. Part of the trick here is finding the right mathematical framework that remains computationally tractable.

J. Duchi, M. Jordan, M. Wainwright
Privacy Aware Learning
Since I work on privacy, this was of course interesting to me — John told me a bit about the work at Allerton. The model of privacy is different than the “standard” differential privacy model — data is stochastic and the algorithm itself (the learner) is not trusted, so noise has to be added to individual data points. A bird’s eye view of the idea is this : (1) stochastic gradient descent (SGD) is good for learning, and is robust to noise (e.g. noisy gradients), (2) noise is good at protecting privacy, so (3) SGD can be used to guarantee privacy by using noisy gradients. Privacy is measured here in terms of the mutual information between the data point and a noisy gradient using that data point. The result is a slowdown in the convergence rate that is a function of the mutual information bound, and it appears in the same place in the upper and lower bounds.

J. Wiens, J. Guttag, E. Horvitz
Patient Risk Stratification for Hospital-Associated C. Diff as a Time-Series Classification Task
This was a cool paper on predicting which patients would be infected with C. Diff (a common disease people get as a secondary infection from being the hospital). Since we have different data for each patient and lots of missing data, the classification problem is not easy — they try to assess a time-evolving risk of infection and then predict whether or not the patient will test positive for C. Diff.

P. Loh, M. Wainwright
No voodoo here! Learning discrete graphical models via inverse covariance estimation
This paper won a best paper award. The idea is that for Gaussian graphical models the inverse covariance matrix is graph-compatible — zeros correspond to missing edges. However, this is not true/easy to do for discrete graphical models. So instead they build the covariance matrix for all tuples of variables — $\{X_1, X_2, X_3, X_4, X_1 X_2, X_1 X_3, \ldots \}$ (really what they want is a triangulation of the graph) and then show that indeed, the inverse covariance matrix does respect the graph structure in a sense. More carefully, they have to augment the variables with the power set of the maximal cliques in a triangulation of the original graphical model. The title refers to so-called “paranormal” methods which are also used for discrete graphical models.

Distributed Non-Stochastic Experts
This was a star-network with a centralized learner and a bunch of experts, except that the expert advice arrives at arbitrary times — there’s a tradeoff between how often the experts communicate with the learner and the achievable regret, and they try to quantify this tradeoff.

M. Streeter, B. McMahan
No-Regret Algorithms for Unconstrained Online Convex Optimization
There’s a problem with online convex optimization when the feasible set is unbounded. In particular, we would want to know that the optimal $x^{\ast}$ is bounded so that we could calculate the rate of convergence. They look at methods which can get around this by proposing an algorithm called “reward doubling” which tries to maximize reward instead of minimize regret.

Y. Chen, S. Sanghavi, H. Xu
Clustering Sparse Graphs
Suppose you have a graph and want to partition it into clusters with high intra-cluster edge density and low inter-cluster density. They come up with nuclear-norm plus $L_1$ objective function to find the clusters. It seems to work pretty well, and they can analyze it in the planted partition / stochastic blockmodel setting.

P. Shenoy, A. Yu
Strategic Impatience in Go/NoGo versus Forced-Choice Decision-Making
This was a talk on cognitive science experimental design. They explain the difference between these two tasks in terms of a cost-asymmetry and use some decision analysis to explain a bias in the Go/NoGo task in terms of Bayes-risk minimization. The upshot is that the different in these two tasks may not represent a difference in cognitive processing, but in the cost structure used by the brain to make decisions. It’s kind of like changing the rules of the game, I suppose.

S. Kpotufe, A. Boularias
This was a super-cute paper, which basically says that if the regressor is very sensitive in some coordinates and not so much in others, you can use information about the gradient/derivative of the regressor to rebalance things and come up with a much better estimator.

K. Jamieson, R. Nowak, B. Recht
Query Complexity of Derivative-Free Optimization
Sometimes taking derivatives is expensive or hard, but you can approximate them by taking two close points and computing an approximation. This requires the function evaluations to be good. Here they look at how to handle approximate gradients computed with noisy function evaluations and find the convergence rate for those procedures.

On Saturday I attended the Electronic Data Methods (EDM) Forum Symposium in Orlando. The focus of the workshop was how to build infrastructure for sharing clinical data for improving patient care. This comes in two flavors : quality improvement (QI), which refers to learning from clinical data much like a feedback loop, patient-centered outcomes research (PCOR) or comparative effectiveness research (CER), which is looks at how patient outcomes vary across different treatments. There’s a lot of hope that moving to electronic health records (EHRs) can facilitate these kind of studies, but the upshot of the workshop was that there are a lot of practical impediments.

One big issue that came up was essentially how EHRs are used, and how the data in them is hard to get out in a consistent and quantifiable way. Physicians record results in idiosyncratic ways, and in order to get practicing physicians to buy-in, the data format of EHRs is rather flexible, resulting in huge headaches for people trying to extract a data table out of a databased of EHRs. Much of the data is in running text — NLP approaches are improving, but it’s far from automated.

Once the data is extracted, it turns out it’s quite noisy, and poorly validated. Sometimes it’s a case of garbage-in : the data was not recorded properly in the first place. Other times, it’s due to miscalibration. There were a number of talks (which I missed) dedicated to this. Then there are questions of whether the data you have collected is representative. If you are trying to draw inferences across multiple sites, how do we appropriately account for confounding factors such as demographic differences? This is the kind of thing that can plague even a single-site observational study, butit becomes particularly acute for multi-site investigations.

Finally, even if each site can extract a more-or-less clean data set, you have the problem of sharing this data. This raises headaches from a policy perspective as well as technological perspective. On the policy side, each site has its own IRB and own review, and many instituions are hesitant to cede authority to third party or federated IRBs. For a small number of sites, a policy and technology framework can be worked out, but scaling these systems up and providing oversight is going to raise new challenges that we probably cannot anticipate. Even if two sites want to share data, they have to implement privacy protections, and depending on the kind of data being shared, technologies may not even exist to mask patient identities — biological samples are inherently problematic in this regard, but even sharing a data table is non-trivial. Apart from the privacy concerns, creating a common schema for the data to be shared sounds like an obvious thing to do, but if the two sites are using different EHR software… well, let’s say it’s not as easy as sharing PowerPoint from Mac to PC.

All in all, I came away feeling like the state of the art is both depressing and invigorating — there’s a lot to do, and I just hope that the short time frame that people go on about doesn’t result in half-baked partial solutions becoming the standard. There are a lot of questions from basic statistics through distributed system design here, so maybe after chewing on it a while I’ll get some new problem ideas.

I also attended HealthSec ’11 this week, and the program was a little different than what I had expected. There was a mix of technical talks and policy/framework proposals around a couple of themes:

• security in medical devices
• auditing in electronic medical records
• medical record dissemination and privacy

In particular, a key challenge in healthcare coming up is how patient information is going to be handled in heath insurance exchanges (HIE’s) that will be created as part of the Affordable Care Act. The real question is what is the threat model for health information : hackers who want to wholesale health records, or the selling of data by third parties (e.g. insurance companies). Matthew Green from Dartmouth discussed implications of the PCAST report on Health Information Technology, which I will have to read.

The most interesting part of the workshop was the panel on de-identification and whether it was a relevant or useful framework moving forward. The panelists were Sean Nolan from Microsoft, Kelly Edwards from University of Washington, Arvind Narayanan from Stanford, and Lee Tien from the EFF. Sean Nolan talked a bit about how HIPAA acts as an impediment to exploratory research, which I have worked on a little, but also raised the thorny ethical issue of public good versus privacy, which is key to understanding the debate over health records in clinical research. Edwards is a bioethicist and had some very important points to raise about how informed consent is an opportunity to educate patients about their (potential) role in medical research, but also to make them feel like informed participants in the process. The way in which we phrase the tradeoff Nolan mentioned really relates to ethics in how we communicate the tradeoff to patients. Narayanan (famous for his Netflix deanonymization) talked about the relationship between technology and policy has to be rethought or turned more into a dialogue rather than a blame-shifting or challenge-posing framework. Lee Tien made a crucial point that if we do not understand how patient data moves about in our existing system, then we have no hope of reform or regulation, and no stakeholder in the system how has that “bird’s eye view” of these data flows.

I hope that in the future I can contribute to this in some way, but in the meantime I’ve been left with a fair bit to chew on. Although the conference was perhaps a bit less technical than I would have liked, I think it was quite valuable as a starting point for future work.

I came down with the flu at the tail end of ITA, so I proceeded to fail at a bunch of things, like submitting reviews that were due, writing a submission for ISIT, and blogging the ITA Workshop in time. Cosma already blogged about his favorite talks, beating me to the punch.

I basically missed the first day because of technical glitches with our registration system, but once that was all resolved things went a bit more smoothly (at least I hope they did). The poster session seemed well-attended, and we shot videos of all the posters which will be posted at some point. Ofer did a great job arranging the Graduation Day and poster events. The thing about these conferences is that you end up wanting to talk to people you haven’t seen in a while, and it’s good to hammer out research ideas during the daytime, so I only made it to the keynote and Xiao-Li Meng’s tutorial on MCMC. I felt like I followed the tutorial at the beginning, but even Prof. Meng’s engaging speaking style lost me when it came to modeling something about stars (?). But there will be videos posted of the tutorials soon enough as well. I’ll probably make a post about those. For those who were at the entertainment program, of course the video for that was top priority. For the small number of those blog readers who wish to know what I was making:

• 2 oz. Maker’s Mark bourbon
• 1 oz. Carpano Antica formula vermouth
• 1 dash Angostura bitters
• 1 dash Regan’s No. 6 orange bitters

Shaken with ice, served up with a cherry. I opted for a bourbon Manhattan with a cherry rather than a rye Manhattan with an orange twist (or without garnish) because it was more convenient, and also more 1960s versus craft cocktail.

But on to the talks! I did manage to drag my lazy butt to some of them.

Improved rate-equivocation regions for secure cooperative communication
Ninoslav Marina, Hideki Yagi, H. Vincent Poor
They looked at a model where you have a transmitter and also a “blind” helper who is trying to help communicate over a wiretap channel. They show a better achievable rate-equivocation region by introducing another auxiliary random variable (big surprise!), but this doesn’t affect the best secrecy rate. So if you are willing to tolerate less than full equivocation at the eavesdropper then you’ll get an improvement.

Shannon’s inequality
S. Verdú, Princeton
Sergio talked about an alternative to Fano’s inequality used by Shannon:
$P_e \ge \frac{1}{6} \frac{ H(X|Y) }{ \log M + \log \log M - \log H(X|Y) }$
It was a nice talk, and the kind of talk I think is great at ITA. It’s not a new result, but ITA is a place where you can give a talk that explains some cool connection or new idea you have.

On the zero-error capacity threshold for deletion channels
Ian A. Kash, Michael Mitzenmacher, Justin Thaler, Jon Ullman
A nice piece of work on connecting zero-error capacity for deletion channels with longest common subsequences. The error model is adversarial. You can make a graph where each vertex is a length-$n$ binary string, and connect two vertices if the two strings have a longest common subsequence of length at least $(1 - p)n$. If two strings are connected then they can’t be in the same code since an adversary could delete $p n$ bits and create the common subsequence (note : not substring). So you can get a bound on the capacity by getting a bound on the largest independent set in this graph. So then you can use… Turan’s Theorem! Hooray! There are more results of course…

Data-driven decision making in healthcare systems
Mohsen Bayati, Stanford, Mark Braverman, U Toronto, Michael Gillam, Microsoft, Mark Smith, Medstart Health, and Eric Horvitz, Microsoft Research
This was a nice talk on doing feature selection via ideas from sparse signal processing/machine learning. The idea is to find a small set of features to help predict whether a patient is high-risk or low-risk for being readmitted soon after being discharged from the hospital. The idea is that the number of features is huge but the number of data points is small. They do an L1 penalized logistic regression and then derive a threshold based on the cost of doing an intervention (e.g. house-visits for high-risk patients).

Tracking climate models: advances in Climate Informatics
Claire Monteleoni, Columbia CCLS, Gavin A. Schmidt, NASA and Columbia, Shailesh Saroha, and Eva Asplund, Columbia Computer Science
This was an overview of Claire’s work on climate informatics. The basic problem was this : given several models (large-scale simulated systems based on PDEs etc. derived from physics) that predict future temperature, how should you combine them to produce more accurate predictions. She used some tools from her previous works on HMMs to get a system with better prediction accuracy.

On a question of Blackwell concerning hidden Markov chains
Ramon van Handel
The problem is trying to estimate the entropy rate of a process that is a function of a Markov chain (and hence not a Markov chain itself). “Does the conditional distribution of an ergodic hidden Markov chain possess a unique invariant measure?” This was a great talk for the Blackwell session because it started from a question posed by Blackwell and then revisited a few of his other works. Pretty amazing. Oh, and the paper (or one of them).

I think more talks will have to wait for another time (if ever).

Some interesting stuff has passed my way while being in India (and one or two things from before). Might as well post them before I forget, no?

Slavoj Žižek may be a curmudgeonly Marxist, but the animation helps soften it, I think. I don’t think I fully agree with him, but there’s stuff in there to chew on.