IHP “Nexus” Workshop on Privacy and Security: Day 1

The view from my office at IHP

The view from my office at IHP

I am attending the Nexus of Information and Computation Theories workshop at the Institut Henri Poincaré in Paris this week. It’s the last week of a 10 week program that brought together researchers from information theory and CS theory in workshops around various themes such as distributed computation, inference, lower bounds, inequalities, and security/privacy. The main organizers were Bobak Nazer, Aslan Tchamkerten, Anup Rao, and Mark Braverman. The last two weeks are on Privacy and Security: I helped organize these two weeks with Prakash Narayan, Salil Vadhan, Aaron Roth, and Vinod Vaikuntanathan.

Due to teaching and ICASSP, I missed last week, but am here for this week, for which the sub-topics are security multiparty computation and differential privacy. I’ll try to blog about the workshop since I failed to blog at all about ITA, CISS, or ICASSP. The structure of the workshop was to have 4 tutorials (two per week) and then a set of hopefully related talks. The first week had tutorials on pseudorandomness and information theoretic secrecy.

The second week of the workshop kicked off with a tutorial from Yuval Ishai and Manoj Prabhakaran on secure multiparty computation (MPC). Yuval gave an abbreviated version/update of his tutorial from the Simons Institute (pt1/pt2) that set up the basic framework and language around MPC: k parties with inputs x_1, x_2, \ldots, x_k want to exchange messages to implement a functionality (evaluate a function) f(x_1, x_2, \ldots, x_k) over secure point-to-point channels such they successfully learn the output of the function but don’t learn anything additional about each others’ inputs. There is a landscape of definitions within this general framework: some parties could collude, behave dishonestly with respect to the protocol, and so on. The guarantees could be exact (in the real/ideal paradigm in which you compare the real system with an simulated system), statistical (the distribution in the real system is close in total variation distance to an ideal evaluation), or computational (some notion of indistinguishability). The example became a bit clearer when he described a 2-party example with a “trusted dealer” who can give parties some correlated random bits and they could use those to randomly shift the truth table/evaluation of f(x_1, x_2) to guarantee correctness and security.

Manoj, on the other hand talked about some notions of reductions between secure computations: given a protocol which evaluates f, can you simulate/compute g using calls to f? How many do you need? this gives a notion of the complexity rate of one function in terms of another. For example, can Alice and Bob simulate a BEC using calls to an oblivious transfer (OT) protocol? What about vice versa? What about using a BSC? These problems seem sort of like toy channel problems (from an information theory perspective) but seem like fundamental building blocks when thinking about secure computation. As I discussed with Hoeteck Wee today, in information theory we often gain some intuition from continuous alphabets or large/general alphabet settings, whereas cryptography arguments/bounds come from considering circuit complexity: these are ideas that we don’t think about too much in IT since we don’t usually care about computational complexity/implementation.

Huijia (Rachel) Lin gave an introduction to zero-knowledge proofs and proof systems: a verifier wants to know if a statement X is true and can ask queries to a prover P which has some evidence w that it wants to keep secret. For example, the statement might be “the number y is a perfect square” and the evidence might be an \alpha such that y = \alpha^2 \mod n. The prover doesn’t want to reveal w = \alpha, but instead should convince the verifier that such an alpha exists. She gave a protocol for this before turning to a more complicated statement like proving that a graph has a Hamiltonian cycle. She then talked about using commitment schemes, at which point I sort of lost the thread of things since I’m not as familiar with these cryptography constructions. I probably should have asked more questions, so it was my loss.

Daniel Wichs discussed two problems he called “multi-key” and “spooky” fully-homomorphic encryption (FHE). The idea in multi-key FHE is that you have N users who encrypt values \{ x_i : i \in [N] \} with their public key and upload them to a server. Someone with access to the server wants to be able to decode only a function f(x_1, x_2, \ldots, x_N) using the combined private keys of all the users. In “spooky” FHE, you have N decoders, each with one of the private keys, but they want to decode values \{y_i : i \in [N]\} which are functions of all of the encoded data. A simple example of this is when y_1 \oplus y_2 = x_1 \wedge x_2: that is, the XOR of the outputs is equal to the AND of the inputs. This generalizes to the XOR of multiple outputs being some function of the inputs, something he called additive function sharing. He then presented schemes for these two problems based on the “learning with errors” from Gentry, Sahai, and Waters, which I would apparently have to read to really understand the scheme. It’s some sort of linear algebra thing over \mathbb{Z}_q. Perhaps there are some connections to linear block codes or network coding to be exploited here.

Readings

Thinking, Fast and Slow (Daniel Kahneman). This was recommended by Vivek Goyal, and is Kahneman’s popular nonfiction book about the psychology of decision making in humans (as opposed to rational-decision making models like those in economics). The System 1/System 2 model was new to me, even though the various biases and heuristics that he describes were things I had heard about in different contexts. While quite interesting and a book that anyone who works on decision making should read (I’m looking at you, statisticians, machine learners, and systems-EE folks), it’s a bit too long, I think. I found it hard to power through at the end, which is where he gets into prospect theory, a topic which my colleague Narayan Mandayam is trying to apply in wireless systems.

Men Explain Things To Me (Rebecca Solnit). A slim volume collecting several of Solnit’s essays on feminism and its discontents, from the last few years. I was familiar with some of the essays (including the first one) but was surprised by her ultimately hopeful tone (many of the essays come with introductions describing their context and how she feels about them now). Highly recommended, but I don’t think it will help with any Arguments On The Internet.

The Idea of India (Sunil Khilnani). This book is a bit older now but provides a lot of crucial context about the early Indian state, the relationship between urbanism and social change, and the nature of electoral politics in India. Reading this gave me a more nuanced view of the complexity of contemporary Indian politics, or at least a more nuanced view of how we got here (beyond the usual history of communalism). The origins of the cronyism of Congress and the causes and effects of the Emergency were also a new perspective for me.

The Sympathizer (Viet Thanh Nguyen). This is about an undercover Vietnamese (well, half-Vietnamese, as people keep pointing out) undercover agent who leaves during the evacuation of Saigon and embeds himself in the refugee community, sending coded messages about counter-revolutionary plans. Our unnamed narrator has a an epic adventure, darkly comic and tragic, initially told as a confessional in some sort of prison interrogation. He was educated in the US before going back to Vietnam — this puts him between two worlds, and the novel is fundamentally about this tension. Throughout people are archetypes: The General, The Auteur, the crapulent major. Although long, the novel is rewarding: the last quarter really put me through the wringer, emotionally.

Station Eleven (Emily St. John Mandel). A novel about a post-apocalyptic future (split between pre-slightly post-and much post) in which much of the world has been decimated by a mysterious infection. The novel revolves around a series of connected characters: an actor who dies on stage in a production of King Lear, his ex-wife, who wrote a series of comics about a remote station, a child actor from the same production who survives to become part of a traveling theater company in the post-apocalyptic wasteland that was once Michigan, an audience member who was once a paparazzo following the actor. The whole novel has a haunting air to it, a bit of a dreamy sensibility that makes it easy to read (too) quickly. The connections between the characters were not surprising when they were revealed, but they didn’t need to be — the book doesn’t rely on that kind of gimmickry. Read it while traveling: you won’t look at airports the same way again.

UCSD Data Science Postdocs

A bit of a delayed posting due to pre-spring break crunch time, but my inimitable collaborator and ex-colleague Kamalika Chaudhuri passed along the following announcement.

I write with the exciting news that UCSD has up to four postdoctoral fellowship openings in data science and machine learning.

The fellowships will prepare outstanding researchers for academic careers. The fellows will be affiliated with the CSE or ECE Departments, will enjoy broad freedom to work with any of or faculty, they will be allocated a research budget, and will teach one class per year.

If you know anyone who might be interested, please encourage them to apply!

The program is co-sponsored by UCSD’s CSE and ECE departments, the Interdisciplinary Qualcomm Institute, and the Information Theory and Applications Center.

More information is available at the UCSD Data Science site. Review begins March 21, so get your applications in!

Readings

The Buddha in the Attic (Julie Otsuka). This was a beautifully written short book written as the collective experiences of Japanese picture brides from the 19th century to the present. It’s one of those books you have to read all at once or in a short period of time in concentrated bursts. Each chapter is a different era and a different set of experiences. It might make you want to read more.

Dancing Lessons for the Advanced in Age (Bohumil Hrabal). Another stream of consciousness, Hrabal’s novella is from the perspective of an old man, a “palaverer” in the language of the introductory essay, addressing a group of young “beauties.” The narrator is unreliable, he tells stories that are shocking and backtracks to make himself seem better, he digresses into rants and waxes poetic about the “beauties” he has seen in his life. The introductory essay does a good job of situating the text and explaining Hrabal, about whom I knew nothing.

Ancillary Justice (Ann Leckie). This is the first in a trilogy which I swear I will take my time reading so I can enjoy them more. It’s impossible not to compare the world-building here to a point of reference like Banks’s Culture novels, but there’s quite a bit that’s different here. The main conceit is that ship-level AIs can spin of “ancillaries” into other bodies to act as surrogates. Beckie’s insistence on the default “she” for a genderless society angers whiny MRA-type sad puppies, but I didn’t find that it played a central role in the story, although it made me aware of my default assumptions and “desire to know” characters’ genders. I’m looking forward to the next two!

Between The World And Me (Ta-Nehisi Coates). Written as a letter to his son, Coates tries to work out the meaning of his college friend Prince Jones’s murder at the hands of police. The book stakes out a strong critical position on America as an enterprise, but, as Michelle Alexander’s review puts it, feels unfinished. It is, however, necessary reading.

Cannery Row (John Steinbeck). A classic that I never read until now! Steinbeck’s prose feels dated and his way of describing people like Lee Chong, made my skin crawl at times, but that man could write a sentence. The book surprised me.

LabTV Profiles Are Up!

And now, a little pre-ITA self-promotion. As I wrote earlier, LabTV interviewed me and a subset of the students in the lab last semester (it was opt-in). This opportunity came out of my small part in the a large-scale collaboration organized by Mind Research Network (PI: Vince Calhoun) on trying to implement distributed and differentially private algorithms in a system to enable collaborative neuroscience research. Our lab profiles are now up! They interviewed me, graduate students Hafiz Imtiaz, Sijie Xiong, and Liyang Xie, and an undergraduate student, Kevin Sun. In watching I found that I learned a few new things about my students…

Signal boost: IBM Social Good Fellowship for data science

This announcement came via Kush Varshney. IBM is launching a new fellowship program. This came out of his work on DataKind and Saška Mojsilović’s work on Ebola. It’s open to students and postdocs!

I am pleased to let you know that Saška Mojsilović and I are launching a new fellowship program at IBM Research related to data science for social good. We are offering both 3-month summer fellowships for PhD students and full-year fellowships for postdocs. The fellowship webpage and link to apply may be found here.

Fellows will come work with research staff members at our Yorktown Heights laboratory to complete projects in partnership with NGOs, social enterprises, government agencies, or other mission-driven organizations that have large social impact. We are currently in the process of scoping projects across various areas, such as health, sustainability, poverty, hunger, equality, and disaster management. The program is intended to allow students to develop their technical skills and produce publishable work while making a positive impact on the world.

I request that you spread the word to students in your respective departments and the broader community.

Call for Papers: T-SIPN Special Issue on Inference and Learning Over Networks

IEEE Signal Processing Society
IEEE Transactions on Signal and Information Processing over Networks
Special Issue on Inference and Learning Over Networks

Networks are everywhere. They surround us at different levels and scales, whether we are dealing with communications networks, power grids, biological colonies, social networks, sensor networks, or distributed Big Data depositories. Therefore, it is not hard to appreciate the ongoing and steady progression of network science, a prolific research field spreading across many theoretical as well as applicative domains. Regardless of the particular context, the very essence of a network resides in the interaction among its individual constituents, and Nature itself offers beautiful paradigms thereof. Many biological networks and animal groups owe their sophistication to fairly structured patterns of cooperation, which are vital to their successful operation. While each individual agent is not capable of sophisticated behavior on its own, the combined interplay among simpler units and the distributed processing of dispersed pieces of information, enable the agents to solve complex tasks and enhance dramatically their performance. Self-organization, cooperation and adaptation emerge as the essential, combined attributes of a network tasked with distributed information processing, optimization, and inference. Such a network is conveniently described as an ensemble of spatially dispersed (possibly moving) agents, linked together through a (possibly time – varying) connection topology. The agents are allowed to interact locally and to perform in-network processing, in order to accomplish the assigned inferential task. Correspondingly, several problems such as, e.g., network intrusion, community detection, and disease outbreak inference, can be conveniently described by signals on graphs, where the graph typically accounts for the topology of the underlying space and we obtain multivariate observations associated with nodes/edges of the graph. The goal in these problems is to identify/infer/learn patterns of interest, including anomalies, outliers, and existence of latent communities. Unveiling the fundamental principles that govern distributed inference and learning over networks has been the common scope across a variety of disciplines, such as signal processing, machine learning, optimization, control, statistics, physics, economics, biology, computer, and social sciences. In the realm of signal processing, many new challenges have emerged, which stimulate research efforts toward delivering the theories and algorithms necessary to (a) designing networks with sophisticated inferential and learning abilities; (b) promoting truly distributed implementations, endowed with real-time adaptation abilities, needed to face the dynamical scenarios wherein real-world networks operate; and (c) discovering and disclosing significant relationships possibly hidden in the data collected from across networked systems and entities. This call for papers therefore encourages submissions from a broad range of experts that study such fundamental questions, including but not limited to:

  • Adaptation and learning over networks.
  • Consensus strategies; diffusion strategies.
  • Distributed detection, estimation and filtering over networks.
  • Distributed dictionary learning.
  • Distributed game-theoretic learning.
  • Distributed machine learning; online learning.
  • Distributed optimization; stochastic approximation.
  • Distributed proximal techniques, sub-gradient techniques.
  • Learning over graphs; network tomography.
  • Multi-agent coordination and processing over networks.
  • Signal processing for biological, economic, and social networks.
  • Signal processing over graphs.

Prospective authors should visit http://www.signalprocessingsociety.org/publications/periodicals/tsipn/ for information on paper submission. Manuscripts should be submitted via Manuscript Central at http://mc.manuscriptcentral.com/tsipn-ieee.

Important Dates:

  • Manuscript submission: February 1, 2016
  • First review completed: April 1, 2016
  • Revised manuscript due: May 15, 2016
  • Second review completed: July 15, 2016
  • Final manuscript due: September 15, 2016
  • Publication: December 1, 2016

Guest Editors:

 

Randomized response, differential privacy, and the elusive biased coin

In giving talks to broader audiences about differential privacy, I’ve learned quickly (thanks to watching talks by other experts) that discussing randomized response first is an easy way to explain the kind of “plausible deniability” guarantee that differentially private algorithms give to individuals. In randomized response, the setup is that of local privacy: the simplest model is that a population of n individuals with data x_1, x_2, \ldots, x_n \in \{0,1\} representing some sensitive quantity are to be surveyed by an untrusted statistician. Concretely, suppose that the individual bits represent whether the person is a drug user or not. The statistician/surveyor wants to know the fraction p = \frac{1}{n} \sum x_i of users in the population. However, individuals don’t trust the surveyor. What to do?

The surveyor can give the individuals a biased coin that comes up heads with probability q < 1/2. The individual flips the coin in private. If it comes up heads, they lie and report y_i = 1 - x_i. If it comes up tails, they tell the truth y_i = x_i. The surveyor doesn’t see the outcome of the coin, but can compute the average of the \{y_i\}. What is the expected value of this average?

\mathbb{E}\left[ \frac{1}{n} \sum_{i=1}^{n} y_i \right] = \frac{1}{n} \sum_{i=1}^{n} (q (1 - x_i) + (1 -q) x_i) = q + (1 - 2q) p.

So we can invert this to solve for p: if we have a reported average \bar{y} = \frac{1}{n} \sum y_i then estimate p by

\hat{p} = \frac{\bar{y} - q}{ 1 - 2 q }.

What does this have to do with differential privacy? Each individual got to potentially lie about their drug habits. So if we look at the hypothesis test for a surveyor trying to figure out if someone is a user from their response, we get the likelihood ratio

\frac{ \mathbb{P}( y_i = 1 | x_i = 1 ) }{ \mathbb{P}( y_i = 1 | x_i = 0 ) } = \frac{1 - q}{q}

If we set \epsilon = \log \frac{1 - q}{q}, we can see that the protocol guarantees differential privacy. This gives a possibly friendlier interpretation of \epsilon in terms of the “lying probability” q. We can plot this:

Epsilon versus lying probability

Epsilon versus lying probability

This is a bit pessimistic — it says that to guarantee reasonable “lying probability” we need \epsilon \ll 1, but in practice this turns out to be quite difficult. Why so pessimistic? The differential privacy thread model is pretty pessimistic — it’s your plausible deniability given that everyone else in the data set has revealed their data to the surveyor “in the clear.” This is the fundamental tension in thinking about the practical implications of differential privacy — we don’t want to make conditional guarantees (“as long as everyone else is secret too”) but the price of an unconditional guarantee can be high in the worst case.

So how does randomized response work in practice? It seems we would need a biased coin. Maybe one can custom order them from Alibaba? Turns out, the answer is not really. Gelman and Nolan have an article about getting students to try and evaluate the bias of a coin — the physics of flipping would seem to dictate that coins are basically fair. You can load dice, but not coins. I recommend reading through the article — it sounds like a fun activity, even for graduate students. Maybe I’ll try it in my Detection and Estimation course next semester.

Despite the widespread prevalence of “flipping a biased coin” as a construction in probability, randomized algorithms, and information theory, a surprisingly large number of people I have met are completely unaware of the unicorn-like nature of biased coins in the real world. I guess we really are in an ivory tower, eh?

Rutgers ECE GAANN Fellowships for Graduate Students

In case there are any potential grad school applicants to Rutgers who read this blog, we recently were awarded a GAAAN award to help fund some graduate fellowships for US citizens or permanent residents interested in bioelectrical engineering (somewhat broadly construed). Application review will start soon, so if you’re interested in this opportunity, read on.

The Rutgers ECE Department is proud to announce the Graduate Assistance in Areas of National Need (GAANN) Fellowship. The GAANN Fellowship program provides need-based financial support to Ph.D. students pursuing a degree in areas related to bioelectrical engineering at the Department of Electrical and Computer Engineering, Rutgers University. Each GAANN Fellow receives a stipend to cover the Fellow’s financial need. A typical stipend is $34,000 per year for up to 5 years, subject to satisfactory performance. ECE is pleased to announce 5 GAANN Fellowships. Minority students, women and other underrepresented groups are particularly encouraged to apply.

Applicants must:

  • Be U.S. citizens or permanent residents
  • Have a GPA of 3.5/4.0 or higher
  • Plan to pursue a Ph.D. degree in Electrical and Computer Engineering at Rutgers University
  • Have Financial Need
  • Demonstrate excellent academic performance
  • Submit an application and supporting documents

Deadline: To apply, please email the application and supporting documents to Arletta Hoscilowicz AS SOON AS POSSIBLE.

Effective early anti-plagiarism interventions for (mostly international) Masters students

My department at Rutgers, like many engineering departments across the country, has a somewhat sizable Master’s program, mostly because it “makes money” for the department [1]. The vast majority of the students in the program are international students, many of whom have English as a second or third language, and whose undergraduate instruction was not necessarily in English. As a consequence, they face considerable challenges in writing in general, and academic writing in particular. Faced with the prospect of writing an introduction to a project report and wanting to sound impressive or sophisticated, many seem tempted into copying sentences or even paragraphs from references without citation. This is, of course, plagiarism, and what distresses me and many colleagues is that the students often don’t understand what they did wrong or how to write appropriately in an academic setting. Is this because most non-American universities don’t teach about referencing, citation, and plagiarism? I hesitate to lay the blame elsewhere — it’s hard (initially) to write formally in a foreign language. However, the students I have met say things like “oh, I thought you didn’t need to reference tutorials,” so there is definitely an element of ill-preparedness. Adding to this of course is that students are stressed, find it expedient, and hope that nobody will notice.

Most undergrad programs in the US have some sort of composition requirement, and at least at my high school, we learned basic MLA citation rules as part of English senior year. However, without assuming this background/pre-req, what can we do? My colleague Waheed Bajwa was asking if there are additional resources out there to help students learn about plagiarism before they turn in their assignments. Of course we put links to resources in syllabi, but as we all know, students tend to not read the syllabus, especially what seem like administrative and legalistic things. Academic misconduct is serious and can result in expulsion, but unless you’re a vindictive type, the goal shouldn’t be to have a “one strike and you’re out” policy. I’ve heard someone else suggest that students sign a contract at the beginning of the semester so they are forced to read it. Then, if they are given an automatic F for the class you can point to the policy. That also seems like dodging the underlying issue, pedagogically speaking.

Another strategy I have tried is to have students turn in a draft of a final project, which I then run through TurnItIn [2] or I manually search for copied sentences. I then issue a stern/threatening warning with links to information about plagiarism. Waheed does the same thing, but this is pretty time-intensive and also means that some students get the attention and some don’t. Students who are here for a Masters lack some incentives to do the right thing the first time — if this is the last semester of their program and suddenly this whole plagiarism thing rears its head in their last class, they may be tempted to just fix the issues raised in the draft and move on without really internalizing the ethics. I’m not saying students are unethical. However, part of engineering/academics, especially at the graduate level, is teaching the ethics around citation and attribution. I pointed out to one student that copying from sources without attribution is stealing and that kind of behavior could get them fired at a company, especially if they violate a law. They seemed surprised by this metaphor. That’s just an anecdote, but I find it telling.

The major issues I see are that:

  • Undergrad-focused models for plagiarism education do not seem to address the issue of ESL-writers or the particulars of scientific/engineering writing.
  • Educating short-term graduate students (M.S.) about plagiarism in classes alone results in uneven learning and outcomes.

What we (and I think most programs) really need is an earlier and better educational intervention that helps address the particulars of these programs. I was Googling around for possible solutions and came across a paper by Gunnarsson, Kulesza, and Pettersson on “Teaching International Students How to Avoid Plagiarism: Librarians and Faculty in Collaboration”:

This paper presents how a plagiarism component has been integrated in a Research Methodology course for Engineering Master students at Blekinge Institute of Technology, Sweden. The plagiarism issue was approached from an educational perspective, rather than a punitive. The course director and librarians developed this part of the course in close collaboration. One part of the course is dedicated to how to cite, paraphrase and reference, while another part stresses the legal and ethical aspects of research. Currently, the majority of the students are international, which means there are intercultural and language aspects to consider. In order to evaluate our approach to teaching about plagiarism, we conducted a survey. The results of the survey indicate a need for education on how to cite and reference properly in order to avoid plagiarism, a result which is also supported by students’ assignment results. Some suggestions are given for future development of the course.

This seems to be exactly the kind of thing we need. The premises of the paper are exactly as we experience in the US: reasons for plagiarism are complex, and most students plagiarize “unintentionally” in the sense that the balance between ethics and expediency is fraught. One issue the authors raise is that “views of the concept of plagiarism… may vary greatly among students from one country” so we must be “cautious about making assumptions based on students’ cultural background.” When I’ve talked to professional colleagues (in my field and in other technical fields) I often hear statements like “students from country X don’t understand plagiarism” — we have to be careful about generalizations!

The key aspect of the above intervention is partnering with librarians, who are the experts in teaching these concepts, as part of a research methods course. Many humanities programs offer field-specific research methods courses. These provide important training for academic work. We can do the same in engineering, but it would require more effort and resources. For those readers interested in the ESL issues, there are a lot of studies in the references that describe the multifaceted aspects of plagiarism, especially among international students. A major component of the authors’ proposed intervention is the Refero tutorial, which is a web course for students to take as part of the course. We can’t delegating plagiarism education to a web tutorial, but we have to start somewhere. Another resource I found was this large collection of tutorials collected by Macie Hall from Johns Hopkins, but these are focused more at US undergraduates.

Does your institution have a good anti-plagiarism orientation unit? Does it work? When and how do you provide this orientation?

[1] There is much ink to be spilled debating this claim.
[2] I have many mixed feeling about the ethics of TurnItIn, especially after discussions with others.