# IHP “Nexus” Workshop on Privacy and Security: Day 2

Verrrrrry belated blogging on the rest of the workshop, more than a month later. Day 2 had 5 talks instead of the tutorial plus talks, and the topics were a bit more varied (this was partly because of scheduling issues that prevented us from being strictly thematic).

Amos Beimel started out with a talk on secret sharing, which had a very nice tutorial/introduction to the problem, including the connection between Reed-Solomon codes and Shamir’s t-out-of-n scheme. For professional (and perhaps personal) reasons I found myself wondering how much more the connection between secret sharing and coding theory was — after all, this was a workshop about communication between information theory and theoretical CS. Not being a coding theory expert myself, I could only speculate. What I didn’t know about was the more general secret sharing structures and the results of Ito-Saito-Nishizeki scheme (published in Globecom!). Amos also talked about monotone span programs, which were new to me, and how to prove lower bounds. He concluded with more recent work on the related distribution design problem: how can we construct a distribution on n variables given constraints that specify subsets which should have identical marginals and subsets which should have disjoint support? The results appeared in ICTS.

Ye Wang talked about his work on common information and how it appears in privacy and security problems from an information theoretic perspective. In particular he talked about secure sampling, multiparty computation, and data release problems. The MPC and sampling results were pretty technical in terms of notions of completeness of primitives (conditional distributions) and triviality (a way of categorizing sources). For the data release problem he focused on problems where a sanitizer has access to a pair $(X,Y)$ where $X$ is private and $Y$ is “useful” — the goal is to produce a version of the data which reveals less about $X$ (privacy) and more about $Y$ (utility). Since they are correlated, there is a tension. The question he addressed is when having access to Y alone as as good as both X and Y.

Manoj, after giving his part of the tutorial (and covering for Vinod), gave his own talk on what he called “cryptographic complexity,” which is an analogy to computational complexity, but for multiparty functions. This was also a talk about definitions and reductions: if you can build a protocol for securely computing $f(\cdot)$ using a protocol for $g(\cdot)$, then $f(\cdot)$ reduces to $g(\cdot)$. A complete function is one for which everything reduces to it, and a trivial function reduces to everything. So with the concepts you can start to classify and partition out functions like characterizing all complete functions for 2 parties, or finding trivial functions under different security notions. He presented some weird facts, like an $n$ bit XOR doesn’t reduce to an $(n-1)$ bit XOR. It was a pretty interesting talk, and I learned quite a bit!

Elette Boyle gave a great talk on Oblivious RAM, a topic about which I was completely oblivious myself. The basic idea in oblivious RAM is (as I understood it) that an adversary can observe the accesses to a RAM and therefore infer what program is being executed (and the input). To obfuscate that, you introduce a bunch of spurious accesses. So if you have a program $\latex \Pi$ whose access pattern is fixed prior to execution, you can randomize the accesses and gain some security. The overhead is the ratio of the total accesses to the required accesses. After this introduction to the problem, she talked about lower bounds on the overhead (e.g. you need this much overhead) for a case where you have parallel processing. I admit that I didn’t quite understand the arguments, but the problem was pretty interesting.

Hoeteck Wee gave the last (but quite energetic) talk of the afternoon, on what he called “functional encryption.” The ideas is that Alice has $(x,M)$ and Bob has $y$. They both send messages to a third party, Charlie. There is a 0-1 function (predicate) $P(x,y)$ such that if $P(x,y) = 1$ then Charlie can decode the message $M$. Otherwise, they cannot. An example would be the predicate $P(x,y) = \mathbf{1}(x = y)$. In this case, Alice can send $h(x) \oplus M$ and Bob can send $h(y)$ for some 2-wise independent hash function, and then Charlie can recover $M$ if the hashes match. I think there is a question in this scheme about whether Charlie needs to know that they got the right message, but I guess I can read the paper for that. The kinds of questions they want to ask are what kinds of predicates have nice encoding schemes? What is the size of message that Alice and Bob have to send? He made a connection/reduction to a communication complexity problem to get a bound on the message sizes in terms of the communication complexity of computing the predicate $P$. It really was a very nice talk and pretty understandable even with my own limited background.

# Call for Papers: T-SIPN Special Issue on Distributed Information Processing in Social Networks

IEEE Signal Processing Society
IEEE Transactions on Signal and Information Processing over Networks
Special Issue on Distributed Information Processing in Social Networks

Over the past few decades, online social networks such as Facebook and Twitter have significantly changed the way people communicate and share information with each other. The opinion and behavior of each individual are heavily influenced through interacting with others. These local interactions lead to many interesting collective phenomena such as herding, consensus, and rumor spreading. At the same time, there is always the danger of mob mentality of following crowds, celebrities, or gurus who might provide misleading or even malicious information. Many efforts have been devoted to investigating the collective behavior in the context of various network topologies and the robustness of social networks in the presence of malicious threats. On the other hand, activities in social networks (clicks, searches, transactions, posts, and tweets) generate a massive amount of decentralized data, which is not only big in size but also complex in terms of its structure. Processing these data requires significant advances in accurate mathematical modeling and computationally efficient algorithm design. Many modern technological systems such as wireless sensor and robot networks are virtually the same as social networks in the sense that the nodes in both networks carry disparate information and communicate with constraints. Thus, investigating social networks will bring insightful principles on the system and algorithmic designs of many engineering networks. An example of such is the implementation of consensus algorithms for coordination and control in robot networks. Additionally, more and more research projects nowadays are data-driven. Social networks are natural sources of massive and diverse big data, which present unique opportunities and challenges to further develop theoretical data processing toolsets and investigate novel applications. This special issue aims to focus on addressing distributed information (signal, data, etc.) processing problems in social networks and also invites submissions from all other related disciplines to present comprehensive and diverse perspectives. Topics of interest include, but are not limited to:

• Dynamic social networks: time varying network topology, edge weights, etc.
• Social learning, distributed decision-making, estimation, and filtering
• Consensus and coordination in multi-agent networks
• Modeling and inference for information diffusion and rumor spreading
• Multi-layered social networks where social interactions take place at different scales or modalities
• Resource allocation, optimization, and control in multi-agent networks
• Modeling and strategic considerations for malicious behavior in networks
• Social media computing and networking
• Data mining, machine learning, and statistical inference frameworks and algorithms for handling big data from social networks
• Data-driven applications: attribution models for marketing and advertising, trend prediction, recommendation systems, crowdsourcing, etc.
• Other topics associated with social networks: graphical modeling, trust, privacy, engineering applications, etc.

Important Dates:

Manuscript submission due: September 15, 2016
First review completed: November 1, 2016
Revised manuscript due: December 15, 2016
Second review completed: February 1, 2017
Final manuscript due: March 15, 2017
Publication: June 1, 2017

Guest Editors:

Zhenliang Zhang, Qualcomm Corporate R&D (zhenlian@qti.qualcomm.com)
Wee Peng Tay, Nanyang Technological University (wptay@ntu.edu.sg)
Moez Draief, Imperial College London (m.draief@imperial.ac.uk)
Xiaodong Wang, Columbia University (xw2008@columbia.edu)
Edwin K. P. Chong, Colorado State University (edwin.chong@colostate.edu)
Alfred O. Hero III, University of Michigan (hero@eecs.umich.edu)

# Mathematical Tools of Information-Theoretic Security Workshop: Days 2-3

I took sketchier notes as the workshop progressed, partly due to the ICASSP deadline, but also because jet lag started to hit me. The second day was a half day, which started with Zhenjie Zhang giving a tutorial on differential privacy from a databases/data mining perspective and my talk on more machine learning aspects. In between us was a talk by Ben Smyth on building automatic verification for security protocols. Basically you write the protocol as a program and then the ProVerif verifier will go and try to break your protocol. As an example, it can automatically find/generate a man-in-the-middle attack if one exists. I thought it was pretty neat, especially after having recently talked to someone about automatic proof systems. It’s based on something called the applied pi calculus, which I did not understand at all, but hey, I learned something new, which was great. The last two talks of the day were by Lalitha Sankar and Mari Kobayashi. Lalitha talked about mutual information based measures of privacy leakage in an interactive communication setting that is the information-theoretic analogue of communication complexity models in CS. Mari talked about the broadcast channel with state feedback. This is trying to find secure analogues of these opportunistic multicast settings where you need to also generate a secret key.

The last day was on quantum! I learned a lot and took few notes, unfortunately. Andreas Winter gave a tutorial on quantum (the slides for most talks are online and his are as well) and Ciara Morgan discussed the challenges in proving a strong converse for the the capacity of quantum channels. Damian Markham talked about secret sharing in quantum systems. Masahito Hayashi gave a very densely-packed talk surveying a large number of results based on secure randomness extraction and hash functions using Rényi information measures. I think privacy amplification is really interesting but I think I need a tutorial on it before I can really get the research results. The last non-overview talk I have notes on was by David Elkouss (apologies to the remaining speakers): this was a really interesting presentation on how to decide which of two channels is better from a quantum communication sense. The slides are a little engimatic, but the papers are online.

Shlomo Shamai made it to the last day of the workshop (the intersection with High Holidays was unfortunate) — he talked about the layered secrecy view of the broadcast channel: rather than thinking only of the secret message as carrying information, one can think of certain layers (c.f. superposition coding) as being secured based on the channel to the non-legitimate receiver. For example, in a degraded broadcast channel, the strong receiver’s message can sometimes be thought of as secret from the weak receiver. This leads to a raft of models and setups based on who wants to keep what secret from whom, shedding some light on standard superposition, rate splitting, binning, and embedding constructions. The talk was largely based on a paper in the current issues of the Proceedings of the IEEE.

All in all, this was a really great workshop, and the organizers were very generous in the organization.

# Postdoc in privacy and security at Imperial College London

Denis Gündüz is looking for a postdoctoral researcher in the areas of privacy and security in cyber-physical systems, particularly for smart metering applications in smart grids. The position is in the Intelligent Systems and Networks Group within the Electrical and Electronic Engineering Department of Imperial College London.

Previous research experience and a strong track record in information theory, signal processing, and/or optimisation theory is required. This position will be supported through an international project, and will provide an excellent opportunity to work within an interdisciplinary team spanning top European institutions: Imperial College London, KTH, ETHZ and INRIA.
The position is available immediately for one year, with a potential to be extended another year depending on candidate’s performance.

Contact Dr. Gündüz directly if interested.

# Re-identification from microbiomes

A (now not-so-recent) paper by Homer et al. made a splash by showing that one could take a DNA sample from a person and detect whether they were part of the Human Genome Project (HGP) based on looking at the SNP variations from that individual together with the reported allele variations in the HGP data. More recently, a paper in PNAS by Franzosa et al. showed reidentification of individuals in the Human Microbiome Project.

Color me unsurprised. Given the richness of the data, from a purely informational point of view it seems pretty clear that people should be identifiable. As with many machine learning problems, however, the secret is in the feature encoding. Many approaches to comparing metagenomes, especially for bacterial ecologies, try to assess the variability in the population of bacteria, perhaps through mapping the to known strains. As mentioned in the Methods section, “reads were additionally mapped to a database of 649 microbial reference genomes using the Burrows-Wheeler aligner.” However, in addition to these mapping statistics, they used a few other more complicated features to help gain some additional robustness in their identification procedure.

Somehow being able to be identified by your microbiome seems less scary than being able to be identified by your genome, perhaps because we have a sense that genes are more “determining” than microbiomes. After all, you could get a fecal transplant and change your gut flora significantly. Is it the same as burning off your fingerprints? Probably not. But perhaps in the future, perpetrators of certain campus shenanigans may be easier to catch.

# Signal boost: Postdoc in Privacy at Penn State

Sofya Raskhodnikova and Adam Smith are looking to fill a postdoc position at Penn State for a multi-year project on privacy, streaming and learning.

Qualifications: Ph.D., with expertise in the theoretical foundations of at least one of the research areas (algorithms, machine learning and statistics, data privacy). Willingness to work on a cross-disciplinary project.

Duration and compensation: At least one year, renewable. Start date is
negotiable, though we slightly prefer candidates starting fall 2015. Salary is competitive.

Applicants should email a CV, short research statement and list of references directly to the project leaders ({asmith,sofya}@cse.psu.edu) with “postdoc” in the subject line.

Location: The university is located in the beautiful college town of
State College in the center of Pennsylvania. The State College area has 130,000 inhabitants and offers a wide variety of cultural and outdoor recreational activities. The university offers outstanding events from collegiate sporting events to fine arts productions. Many major population centers on the east coast (New York, Philadelphia, Pittsburgh, Washington D.C., Baltimore) are only a few hours’ drive away and convenient air services to several major hubs are operated by three major airlines out of State College.

Penn State is an equal opportunity employer. We encourage applications from underrepresented minorities.

# ITA 2015: quick takes

Better late than never, I suppose. A few weeks ago I escaped the cold of New Jersey to my old haunts of San Diego. Although La Jolla was always a bit fancy for my taste, it’s hard to beat a conference which boasts views like this:

A view from the sessions at ITA 2015

I’ll just recap a few of the talks that I remember from my notes — I didn’t really take notes during the plenaries so I don’t have much to say about them. Mostly this was due to laziness, but finding the time to blog has been challenging in this last year, so I think I have to pick my battles. Here’s a smattering consisting of

$\{ \mathrm{talks\ attended} \} \cap \{ \mathrm{talks\ with\ understandable\ notes} \}$

(Information theory)
Emina Soljanin talked about designing codes that are good for fast access to the data in distributed storage. Initial work focused on how to repair codes under disk failures. She looked at how easy it is to retrieve the information afterwords to guarantee some QoS for the storage system. Adam Kalai talked about designing compression schemes that work for an “audience” of decoders. The decoders have different priors on the set of elements/messages so the idea is to design an encoder that works for this ensemble of decoders. I kind of missed the first part of the talk so I wasn’t quite sure how this relates to classical work in mismatched decoding as done in the information theory world. Gireeja Ranade gave a great talk about defining notions of capacity/rate need to control a system which as multiplicative uncertainty. That is, $x[n+1] = x[n] + B[n] u[n]$ where $B[n]$ has the uncertainty. She gave a couple of different notions of capacity, relating to the ratio $| x[n]/x[0] |$ — either the expected value of the square or the log, appropriately normalized. She used a “deterministic model” to give an explanation of how control in this setting is kind of like controlling the number of significant bits in the state: uncertainty increases this and you need a certain “amount” of control to cancel that growth.

(Learning and statistics)
I learned about active regression approaches from Sivan Sabato that provably work better than passive learning. The idea there is do to use a partition of the X space and then do piecewise constant approximations to a weight function that they use in a rejection sampler. The rejection sampler (which I thought of as sort of doing importance sampling to make sure they cover the space) helps limit the number of labels requested by the algorithm. Somehow I had never met Raj Rao Nadakuditi until now, and I wish I had gotten a chance to talk to him further. He gave a nice talk on robust PCA, and in particular how outliers “break” regular PCA. He proposed a combination of shrinkage and truncation to help make PCA a bit more stable/robust. Laura Balzano talked about “estimating subspace projections from incomplete data.” She proposed an iterative algorithm for doing estimation on the Grassmann manifold that can do subspace tracking. Constantine Caramanis talked about a convex formulation for mixed regression that gives a guaranteed solution, along with minimax sample complexity bounds showing that it is basically optimal. Yingbin Liang talked about testing approaches for understanding if there is an “anomalous structure” in a sequence of data. Basically for a sequence $Y_1, Y_2, \ldots, Y_n$, the null hypothesis is that they are all i.i.d. $\sim p$ and the (composite) alternative is that there an interval of indices which are $\sim q$ instead. She proposed a RKHS-based discrepancy measure and a threshold test on this measure. Pradeep Ravikumar talked about a “simple” estimator that was a “fix” for ordinary least squares with some soft thresholding. He showed consistency for linear regression in several senses, competitive with LASSO in some settings. Pretty neat, all said, although he also claimed that least squares was “something you all know from high school” — I went to a pretty good high school, and I don’t think we did least squares! Sanmi Koyejo talked about a Bayesian devision theory approach to variable selection that involved minimizing some KL-divergence. Unfortunately, the resulting optimization ended up being NP-hard (for reasons I can’t remember) and so they use a greedy algorithm that seems to work pretty well.

(Privacy)
Cynthia Dwork gave a tutorial on differential privacy with an emphasis on the recent work involving false discovery rate. In addition to her plenary there were several talks on differential privacy and other privacy measures. Kunal Talwar talked about their improved analysis of the SuLQ method for differentially private PCA. Unfortunately there were two privacy sessions in parallel so I hopped over to see John Duchi talk about definitions of privacy and how definitions based on testing are equivalent to differential privacy. The testing framework makes it easier to prove minimax bounds, though, so it may be a more useful view at times. Nadia Fawaz talked about privacy for time-series data such as smart meter data. She defined different types of attacks in this setting and showed that they correspond to mutual information or directed mutual information, as well as empirical results on a real data set. Raef Bassily studied a estimation problem in the streaming setting where you want to get a histogram of the most frequent items in the stream. They reduce the problem to one of finding a “unique heavy hitter” and develop a protocol that looks sort of like a code for the MAC: they encode bits into a real vector, had noise, and then add those up over the reals. It’s accepted to STOC 2015 and he said the preprint will be up soon.