Domingos on what you should know about machine learning

Dhruv Batra forwarded this Communications of the ACM article by Pedro Domingos, entitled “A Few Useful Things to Know about Machine Learning” [free version] The main point from the abstract is:

However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions.

The article focuses on the classification problem to illustrate these “key lessons.” It’s well-worth reading, especially for people who don’t work on machine learning because it explains a number of important issues.

  1. It illustrates the gap between what the theory/research works on and the nitty-gritty of applying these algorithms to real data.
  2. It gives people who want to implement an ML method important fundamental questions to ask before starting : how do I represent my data? How do I evaluate performance? How do I do things efficiently? These have to get squared away first.
  3. Domain knowledge and feature engineering are the keys to success.

Since I’m guessing there are 2 machine learners who read this blog, go read it (unless you are one of my friends who doesn’t care about all of these technical posts).

Tracks : language is overrated

  1. Cliché Intro — Prefuse 73
  2. Nanorobot Tune — Tomas Dvorak, Machinarium Soundtrack
  3. Endorphin — Burial
  4. Missionary Ridge — William Tyler
  5. Hey-Hee-Hi-Ho — Medeski, Martin & Wood
  6. Soutoukou — Mamadou Diabate
  7. Rustem — Taraf de Haidouks
  8. Snowden’s Jig — Carolina Chocolate Drops
  9. Hashmal — Masada
  10. Captain Hook — Mar Caribe
  11. Black Unstoppable — Nicole Mitchell
  12. Stop Time — Horace Silver
  13. Pickin’ Up The Cabbage — Cab Calloway
  14. Smedley’s Melody — Squarepusher
  15. Baraat To Nowhere — Red Baraat
  16. Lou courut — Véronique Gens w/Orchestre National de
  17. Lille-Région Nord
  18. Saudade Dada — Arrigo Barnabé
  19. Watermelon Man — Mongo Santamaria
  20. Greensleeves — Matthew Shipp
  21. Clapping Music — Steve Reich/The Sixteen
  22. music for morning people — Kid Koala

Allerton 2012 : Karl J. Åström’s Jubilee Lecture

It’s the fall again, and this year it is the 50th anniversary of the Allerton Conference. Tonight was a special Golden Jubilee lecture by Karl Johan Åström from the Lund University. He gave an engaging view of the pre-history, history, present, and future of control systems. Control is a “hidden technology” he said — it’s everywhere and is what makes all the technology that we use work, but remains largely unknown and unnoticed except during catastrophic failures. He exhorted the young’uns to do a better job at letting people know how important control systems are in everyday life.

The main message of Åström’s talk is that control theory and control practice need to get back together so that we can develop new control theories for emerging areas, including biology and physics. He called this the “holistic” view and pointed out that it really emerged out of the war effort during WWII, when control systems had to be developed for all sorts of military tasks. This got the mathematicians in the same room as the “real” engineers, and led to a lot of new theory. I guess I had always known that was a big driver, but I guess I hadn’t thought of how control really was the glue that tied things together.

Sita tries to send a message to Rama using a digital certificate

Via Erin (via Bruce Schneier’s blog), I found out about S. Parthasarathy‘s proposal to replace Alice and Bob with Sita and Rama. I have been known to use Alice and Bob on occasion (unlike some people I find the anthropomorphizing to be good, on the balance), but perhaps I should develop some cultural pride and make the switch to “a smarter alternative to these characters.” According to Parthasarathy, there is greater literary relevance to the scenario where Sita wants to send a message to Rama. The dramatic personae in this version are:

  • Sita : kidnapped maiden who wishes to send a message
  • Rama : brave prince who is to receive the message
  • Hanuman : the honest broker who relays the message
  • Ravana : the rogue-in-the-middle who acts as the adversary. To avoid confusing first letters, let’s rename him Badmash.

There are a number of other appealing allusions in this scenario.

I think it’s a fun exercise — can one come up with other settings? Perhaps based on Gilgamesh, or Star Wars. I’m sure at least one reader of this blog could come up with a Battlestar Galactica scenario. Adama to Baltar?

Also, I couldn’t help but point to this chestnut, the real story of Alice and Bob (h/t to my father).

2nd iDASH Workshop on Privacy

On Saturday I attended the 2nd iDASH workshop on privacy — I thought overall it went quite well, and it’s certainly true that over the last year the dialogue and understanding has improved between the theory/algorithms, data management/governance, and medical research communities. I developed note fatigue partway through the day, but I wanted to blog a little bit about some of the themes which came up during the workshop. Instead of making a monster post which covers everything, I will touch on a few things here. In particular, there were other talks not mentioned below about issues in data governance, cryptographic approaches, special issues in genomics, study design, and policy. I may touch on those in later posts.

Cynthia Dwork and Latanya Sweeney gave the keynotes, as they did last year, and they dovetailed quite nicely this year. Cynthia’s talk centered on how to think of privacy risk in terms of resource allocation — you have a certain amount of privacy and you have to apportion it over multiple queries. Latanya Sweeney’s talk came from the other direction: the current legal framework in the US is designed to make information flow, and so it is already a privacy-unfriendly policy regime. These raise some serious impediments to practically implementing privacy protections that we develop on the technological side.

On the privacy models side, Ashwin Machanavajjhala, Chris Clifton talked about slightly different models of privacy that are based on differential privacy but have a less immediately statistical feel, based on work from PODS 2012 and KDD 2012. Kamalika Chaudhuri talked about our work on differentially private PCA, and Li Xiong talked about differential privacy on time series using adaptive sampling and prediction.

Guy Rothblum talked about something he called “concentrated differential privacy,” which essentially amounts to analyzing the measure concentration properties of the log-likelihood ratio that appears in the differential privacy definition : for any two databases D and D', we want to analyze the behavior of the random variable log \frac{ \mathbb{P}( M(D) \in S ) }{ \mathbb{P}( M(D') \in S ) } for measurable sets S. Aaron Roth talked about taking advantage of more detailed metric structure in differentially private learning problems to get better accuracy for the same privacy level.

William Thurston on proof and progress

William Thurston passed away a little over a month ago, and while I have never had the occasion to read any of his work, this article of his, entitled “On Proof and Progress in Mathematics” has been reposted, and I think it’s worth a read for those who think about how mathematical knowledge progresses. For those who do theoretical engineering, I think Thurston offers an interesting outside perspective that is a refreshing antidote to the style of research that we do now. His first point is that we should ask the question:

How do mathematicians advance human understanding of mathematics?

I think we could also ask the question in our own fields, and we can do a similar breakdown to what he does in the article : how do we understand information theory, and how is that communicated to others? Lav Varshney had a nice paper (though I can’t seem to find it) about the role of block diagrams as a mode of communicating our models and results to each other — this is a visual way of understanding. By contrast, I find that machine learning papers rarely have block diagrams or schematics to illustrate the geometric intuition behind a proof. Instead, the visual illustrations are plots of experimental results.

Thurston goes through a number of questions that interrogate the motives, methods, and outcomes of mathematical research, but I think it’s relevant for everyone, even non-mathematical researchers. In the end, research is about communication, and understanding the what, how, and why of that is always a valuable exercise.