Better late than never, I suppose. A few weeks ago I escaped the cold of New Jersey to my old haunts of San Diego. Although La Jolla was always a bit fancy for my taste, it’s hard to beat a conference which boasts views like this:
A view from the sessions at ITA 2015
I’ll just recap a few of the talks that I remember from my notes — I didn’t really take notes during the plenaries so I don’t have much to say about them. Mostly this was due to laziness, but finding the time to blog has been challenging in this last year, so I think I have to pick my battles. Here’s a smattering consisting of
Emina Soljanin talked about designing codes that are good for fast access to the data in distributed storage. Initial work focused on how to repair codes under disk failures. She looked at how easy it is to retrieve the information afterwords to guarantee some QoS for the storage system. Adam Kalai talked about designing compression schemes that work for an “audience” of decoders. The decoders have different priors on the set of elements/messages so the idea is to design an encoder that works for this ensemble of decoders. I kind of missed the first part of the talk so I wasn’t quite sure how this relates to classical work in mismatched decoding as done in the information theory world. Gireeja Ranade gave a great talk about defining notions of capacity/rate need to control a system which as multiplicative uncertainty. That is, where has the uncertainty. She gave a couple of different notions of capacity, relating to the ratio — either the expected value of the square or the log, appropriately normalized. She used a “deterministic model” to give an explanation of how control in this setting is kind of like controlling the number of significant bits in the state: uncertainty increases this and you need a certain “amount” of control to cancel that growth.
(Learning and statistics)
I learned about active regression approaches from Sivan Sabato that provably work better than passive learning. The idea there is do to use a partition of the X space and then do piecewise constant approximations to a weight function that they use in a rejection sampler. The rejection sampler (which I thought of as sort of doing importance sampling to make sure they cover the space) helps limit the number of labels requested by the algorithm. Somehow I had never met Raj Rao Nadakuditi until now, and I wish I had gotten a chance to talk to him further. He gave a nice talk on robust PCA, and in particular how outliers “break” regular PCA. He proposed a combination of shrinkage and truncation to help make PCA a bit more stable/robust. Laura Balzano talked about “estimating subspace projections from incomplete data.” She proposed an iterative algorithm for doing estimation on the Grassmann manifold that can do subspace tracking. Constantine Caramanis talked about a convex formulation for mixed regression that gives a guaranteed solution, along with minimax sample complexity bounds showing that it is basically optimal. Yingbin Liang talked about testing approaches for understanding if there is an “anomalous structure” in a sequence of data. Basically for a sequence , the null hypothesis is that they are all i.i.d. and the (composite) alternative is that there an interval of indices which are instead. She proposed a RKHS-based discrepancy measure and a threshold test on this measure. Pradeep Ravikumar talked about a “simple” estimator that was a “fix” for ordinary least squares with some soft thresholding. He showed consistency for linear regression in several senses, competitive with LASSO in some settings. Pretty neat, all said, although he also claimed that least squares was “something you all know from high school” — I went to a pretty good high school, and I don’t think we did least squares! Sanmi Koyejo talked about a Bayesian devision theory approach to variable selection that involved minimizing some KL-divergence. Unfortunately, the resulting optimization ended up being NP-hard (for reasons I can’t remember) and so they use a greedy algorithm that seems to work pretty well.
Cynthia Dwork gave a tutorial on differential privacy with an emphasis on the recent work involving false discovery rate. In addition to her plenary there were several talks on differential privacy and other privacy measures. Kunal Talwar talked about their improved analysis of the SuLQ method for differentially private PCA. Unfortunately there were two privacy sessions in parallel so I hopped over to see John Duchi talk about definitions of privacy and how definitions based on testing are equivalent to differential privacy. The testing framework makes it easier to prove minimax bounds, though, so it may be a more useful view at times. Nadia Fawaz talked about privacy for time-series data such as smart meter data. She defined different types of attacks in this setting and showed that they correspond to mutual information or directed mutual information, as well as empirical results on a real data set. Raef Bassily studied a estimation problem in the streaming setting where you want to get a histogram of the most frequent items in the stream. They reduce the problem to one of finding a “unique heavy hitter” and develop a protocol that looks sort of like a code for the MAC: they encode bits into a real vector, had noise, and then add those up over the reals. It’s accepted to STOC 2015 and he said the preprint will be up soon.
Posting a hodgepodge of links after a rather wonderful time hiking and camping, solving puzzles, and the semester starting all together too soon for my taste.
[Trigger warning] More details on Walter Lewin’s actions.
The unbearable maleness of Wikipedia.
Hanna Wallach’s talk at the NIPS Workshop on fairness.
Reframing Science’s Diversity Challenge by trying to move beyond the pipeline metaphor.
An essay by Daniel Solove on privacy (I’d recommend reading his books too but this is shorter). He takes on the “nothing to hide” argument against privacy.
I don’t like IPAs that much, but this lawsuit about lettering seems like a big deal for the craft beer movement.
I’ve always been a little skeptical of Humans of New York, but never was sure why. I think this critique has something to it. Not sure I fully agree but it does capture some of my discomfort.
Judith Butler gave a nice interview where she talks a bit about why “All Lives Matter,” while true, is not an appropriate rhetorical strategy: “If we jump too quickly to the universal formulation, ‘all lives matter,’ then we miss the fact that black people have not yet been included in the idea of ‘all lives.’ That said, it is true that all lives matter (we can then debate about when life begins or ends). But to make that universal formulation concrete, to make that into a living formulation, one that truly extends to all people, we have to foreground those lives that are not mattering now, to mark that exclusion, and militate against it.”
A nice essay on morality and progress with respect to Silicon Valley. Techno-utopianism running amok leads to bad results: “Silicon Valley’s amorality problem arises from the implicit and explicit narrative of progress companies use for marketing and that people use to find meaning in their work. By accepting this narrative of progress uncritically, imagining that technological change equals historic human betterment, many in Silicon Valley excuse themselves from moral reflection.”
So Waheed Bajwa and I have been teaching this Byrne Seminar on “data science.” At Allerton some people asked me how it was going and what we were covering in the class. These seminars are meant to be more discussion-based. This is a bit tough for us in particular:
- engineering classes are generally NOT discussion-based, neither in the US nor in Pakistan
- it’s been more than a decade since we were undergraduates, let alone 18
- the students in our class are fresh out of high school and also haven’t had discussion-based classes
My one experience in leading discussion was covering for a theater class approximately 10 years ago, but that was junior-level elective as I recall, and the dynamics were quite a bit different. So getting a discussion going and getting all of the students to participate is, on top of being tough in general, particularly challenging for us. What has helped is that a number of the students in the class are pretty engaged with the ideas and material, and we do in the end get to collectively think about the technologies around us and the role that data plays a bit differently.
What I wanted to talk about in this post was what we’ve covered in the first few weeks. If we offer this class again it would be good to revisit some of the decisions we’ve made along the way, as this is as much a learning process for us as it is for them. A Byrne Seminar meets for 10 times during the semester, so that it will end well before finals. We had some overflow from one topic to the next, but roughly speaking the class went in the following order:
- Introduction: what is data?
- Potentials and perils of data science
- The importance of modeling
- Statistical considerations
- Machine learning and algorithms
- Data and society: ethics and privacy
- Data visualizaion
- Project Presentations
I’ll talk a bit more on the blog about this class, what we covered, what readings/videos we ended up choosing, and how it went. I think it would be fun to offer this course again, assuming our evaluations pass muster. But in the meantime, the class is still on, so it’s a bit hard to pass retrospective judgement.
IEEE Signal Processing Society
IEEE Journal of Selected Topics in Signal Processing
Special Issue on Signal and Information Processing for Privacy
Aims and Scope: There has been a remarkable increase in the usage of communications and information technology over the past decade. Currently, in the backend and in the cloud, reside electronic repositories that contain an enormous amount of information and data associated with the world around us. These repositories include databases for data-mining, census, social networking, medical records, etc. It is easy to forecast that our society will become increasingly reliant on applications built upon these data repositories. Unfortunately, the rate of technological advancement associated with building applications that produce and use such data has significantly outpaced the development of mechanisms that ensure the privacy of such data and the systems that process it. As a society we are currently witnessing many privacy-related concerns that have resulted from these technologies—there are now grave concerns about our communications being wiretapped, about our SSL/TLS connections being compromised, about our personal data being shared with entities we have no relationship with, etc. The problems of information exchange, interaction, and access lend themselves to fundamental information processing abstractions and theoretical analysis. The tools of rate-distortion theory, distributed compression algorithms, distributed storage codes, machine learning for feature identification and suppression, and compressive sensing and sampling theory are fundamental and can be applied to precisely formulate and quantify the tradeoff between utility and privacy in a variety of domains. Thus, while rate-distortion theory and information-theoretic privacy can provide fundamental bounds on privacy leakage of distributed data systems, the information and signal processing techniques of compressive sensing, machine learning, and graphical models are the key ingredients necessary to achieve these performance limits in a variety of applications involving streaming data, distributed data storage (cloud), and interactive data applications across a number of platforms. This special issue seeks to provide a venue for ongoing research in information and signal processing for applications where privacy concerns are paramount.
Topics of Interest include (but are not limited to):
- Signal processing for information-theoretic privacy
- Signal processing techniques for access control with privacy guarantees in distributed storage systems
- Distributed inference and estimation with privacy guarantees
- Location privacy and obfuscation of mobile device positioning
- Interplay of privacy and other information processing tasks
- Formalized models for adversaries and threats in applications where consumer and producer privacy is a major concern
- Techniques to achieve covert or stealthy communication in support of private communications
- Competitive privacy and game theoretic formulations of privacy and obfuscation
Manuscript submission due: October 1, 2014
First review completed: December 15, 2014
Revised manuscript due: February 1, 2015
Second review completed: March 15, 2015
Final manuscript due: May 1, 2015
Publication date: October 2015
Prospective authors should visit the JSTSP homepage for information on paper submission. Manuscripts should be submitted using Manuscript Central.
I am traveling all over India at the moment so I’m not really able to write contentful posts. Here are even more links instead, sigh. Maybe later I’ll talk about log-Sobolev inequalities so I can be cool like Max.
Speaking of Max, he posted this hilarious bad lip reading version of Game of Thrones. Probably NSFW. I don’t even like the series but it’s pretty funny.
For those who are fans of Rejected, Don Hertzfeldt’s new film is available on Vimeo.
Those who were at Berkeley may remember seeing Ed Reed perform at the Cheeseboard. His album (which I helped fund via indiegogo, was named a Downbeat Editors’ Pick. It’s a great album.
In light of the Snowden leaks, some doubt has been cast on NIST’s crypto standards.
I’m super late to this, but I endorse Andrew’s endorsement of Sergio‘s interview with Robert Fano in the IT Newsletter. Here’s just the article, if you want that.
David McAllester, my department chair at TTI, has a started a new blog.
I thought it was pretty well known that people are fairly unique by ZIP code, but Forbes has an article about it now (h/t Raj). Of course, stores can also ping a smartphone’s WiFi to get more accurate location information about your activity within the store — when you check out they can tag your the MAC address of your device to all the other information about you. Creeptastic!
Bradley Efron’s perspective on the impact of Bayes’ Theorem from Science (h/t Kevin).
Some discussion on what makes a popular philosophy book. I wonder what, if anything, transfers over to a popular mathematical book?
Some thoughts from Larry Laudan on the mathematization of the presumption of innocence.
I’m on the program committee for the Cyber-Security and Privacy symposium, so I figured I would post this here to make more work for myself.
GlobalSIP 2013 – Call for Papers
IEEE Global Conference on Signal and Information Processing
December 3-5, 2013 | Austin, Texas, U.S.A.
GlobalSIP: IEEE Global Conference on Signal and Information Processing is a new flagship IEEE Signal Processing Society conference. The focus of this conference is on signal and information processing and up-and-coming signal processing themes.
GlobalSIP is composed of symposia selected based on responses to the call-for-symposia proposals. GlobalSIP is composed of symposia on hot topics related to signal and information processing.
The selected symposia are:
Paper submission will be online only through the GlobalSIP 2013 website Papers should be in IEEE two-column format. The maximum length varies among the symposia; be sure to check each symposium’s information page for details. Authors of Signal Processing Letters papers will be given the opportunity to present their work at GlobalSIP 2013, subject to space availability and approval by the Technical Program Chairs of GlobalSIP 2013. The authors need to specify in which symposium they wish to present their paper. Please check conference webpage for details.
*New* Paper Submission Deadline – June 15, 2013
Review Results Announce – July 30, 2013
Camera-Ready Papers Due – September 7, 2013
*New* SPL request for presentation – September 7, 2013