Secrecy and Security in Distributed Storage

Secure Distributive Storage of Decentralized Source Data: Can Interaction Help?

Salim El Rouayheb, Vinod Prabhakaran and Kannan Ramchandran

Salim presented this work on creating a distributed storage system that also has secrecy. As far as I understood, the model is that there are k distributed sources and they want to create a good erasure code in a network: a data collector who accesses any k out of n storage nodes should be able to recover the data. In addition, one storage node is an eavesdropper who gets access to everything communicated to her. The question is to create a distributed code that also has secrecy (i.e. the eavesdropper learns nothing other than what is contained in the nodes she controls) but minimize the communication in the network.

In the model each storage node has a private key and these keys can be used to provide the desired secrecy. The nodes could use interactive protocols of exchanging keys etc, but the main result of the paper is that this does not reduce the required communication compared to a one-way dissemination scheme.

Security in Distributed Storage Systems by Communicating a Logarithmic Number of Bits

by Theodoros K. Dikaliotis, myself and Tracey Ho,

Ted presented our work on security (error correction) for distributed storage systems. We are looking into the problem of finding if a storage node in an (n,k) system is storing incorrect equations. These could be originating from bit-flipping errors in the storage medium or (more likely) if a malicious node was injecting errors during a previous repair in the storage system.

The model is that we assume a trusted verifier that communicates to the nodes and uses the built in redundancy of the code to find where the errors are located. The first observation is in most distributed storage systems the same code is used multiple times to reduce the overhead from storing the coefficients of the code. Instead of sending all the packets, all the storage nodes can project onto the same vector r and only send the projections. The error detection works since the projected data follows the same (n,k) code which can tolerate up to half the minimum distance adversarial errors (under optimal decoding). The problem is that the vector r is very long, around the same order as the stored data. One way around this is to create a pseudorandom r' using a small seed s, and project on that. All the storage nodes need to share the small seed s and if we use pseudorandom generators that fool linear functions we can bound the probability of failure of a computationally unbounded adversary who places the errors. This relies  on the pseudorandom generators of Naor & Naor (but for large finite fields).

This trick applies more generally: Assume you can prove some bad event that involves a linear function of  an n length i.i.d. random vector r happens with very small probability p_{bad}.

Then instead of using the random vector, you can use a pseudo-random vector r' that is generated from only \log n random bits and still have a bound on the bad event being less than p_{bad}+ \epsilon where \epsilon is the total variation bound obtained from the pseudo-random generator.

On Secure Distributed Data Storage Under Repair Dynamics

by Sameer Pawar, Salim El Rouayheb, Kannan Ramchandran

Sameer presented his work the repair problem under secrecy constraints. The problem is to construct a distributed storage system that stays secure when repairs are taking place. The model is that an eavesdropper can observe the data arriving at \ell nodes in an (n,k) distributed storage system. The static version of this problem (no repairs) is equivalent to the erasure-erasure wiretap-II channel studied in this paper by Arunkumar and Mclaughlin while this paper generalizes this problem under the natural repair dynamics. As far as I understood, there were two results. One was a general bound on the secrecy capacity for any distributed storage system. There is also a surprising separation result:

For distributed storage operating at the minimum bandwidth point (MBR) secrecy and repair can be separated without loss of optimality. The authors show that the combination of using an MDS code to achieve secrecy (by ‘mixing’ the data with random noise keys) and exact Minimum bandwidth regenerating codes that were recently invented in this paper by Rashmi et al. for repair, achieves the secrecy capacity.

Interestingly, the same separation is no longer optimal for other points of the storage-repair tradeoff and as far as I understood the secrecy capacity remains open.

Codes for Distributed Storage: The repair problem

The first two talks in this session

Explicit and Optimal Exact-Regenerating Codes for the Minimum-Bandwidth Point in Distributed Storage by K. V. Rashmi , Nihar B. Shah , P. Vijay Kumar , Kannan Ramchandran

and

A Flexible Class of Regenerating Codes for Distributed Storage by Nihar B. Shah, K. V. Rashmi, P. Vijay Kumar

where given by Vijay Kumar.  To summarize the results in these two papers (and also

Exact-Repair MDS Codes for Distributed Storage Using Interference Alignment

by Changho Suh and Kannan Ramchandran

that appeared in another session), it is useful to briefly explain the current landscape of the repair problem:

(for more information see the Distributed storage wiki and this survey paper which however do not contain these new results yet.)

Assume a file to be stored, of size B (sometimes also found as \mathcal{M} in the literature) one can use an (n,k) MDS code for storage with maximum reliability: just cut the file into k packets of size B/k, encode into n pieces (of the same size \alpha=B/k ) and store each packet into a storage node. The reliability achieved is that any k storage nodes contain enough information to recover the file, which is the best possible. So we can tolerate any n-k failures without losing data. But most of the time we only have one failure. We then want to repair this failure by creating a new encoded packet by communicating as little information as possible from d  surviving storage nodes. There are two cases:

1. If we ask that the new packet still forms a good code (i.e. any k out of the new n nodes suffice to recover), the problem is called functional repair and is completely solved and equivalent to multicasting.  The minimum repair bandwidth can be determined by a cut-set bound and we will refer to the achievable region as the cut-set region.

2. If we ask that the new packet formed is exactly identical to the lost one: This problem is called exact repair and is very difficult. It is equivalent to a multicasting network coding problem with intermediate nodes having requests for subsets and therefore prior tools from network coding do not suffice. The repair problem seems close to locally decodable codes but this connection has not been investigated yet.

The amount of information communicated from d surviving nodes is called the repair bandwidth \gamma=d\beta. Another issue is that one can reduce this repair bandwidth if the storage per node \alpha is allowed to increase and there is a tradeoff involved.

The two most important points of this tradeoff are called Minimum Storage point (MSR)  (where \alpha=B/k corresponds to MDS codes) and Minimum Bandwidth point (MBR) (where \alpha is larger but the repair bandwidth is minimized).

Clearly exact repair is a strictly harder than functional repair and the cut-set region is still a valid outer bound. A substantial amount of recent work has been focused on understanding the communication required for exact repair. All the results here establish that the cut-set region is achievable for exact repair for the parameters discussed.

Vijay presented a storyline that concluded to the settings which were known before.

This paper had completely solved the problem for the Minimum Bandwidth (MBR) point, for any (n,k), but fixing d=n-1.

The first paper (Explicit and Optimal Exact-Regenerating Codes for the Minimum-Bandwidth Point in Distributed Storage) in this session established that the MBR cut-set bound point can be achieved with exact repair for any d, corresponding to an arbitrary number d\leq n-k of failed nodes. The second result in this paper is that all the other interior points of the tradeoff curve (except the points of minimum repair and minimum storage) cannot be achieved with linear codes for exact repair. ( Of course this does not preclude the possibility that they can be approached infinitely close by making the code packets smaller and smaller).

In the second paper A Flexible Class of Regenerating Codes for Distributed Storage, the authors generalize the requirements of the distributed storage system: the initial requirement was that any data collector who connects to any k storage nodes (and receives all their stored data ) obtains enough information to recover the file. Here the authors want to enable data collectors who have limited communication with each storage node to reconstruct the file. The only requirement is that the total amount of information communicated is sufficient. This is, of course, a harder requirement on the storage code and the authors characterize the repair bandwidth for functional repair under this new requirement of flexible recovery. The technical part involves graph theoretically bounding the flow for these flexible information flow graphs. The question of exact repair of these flexible regenerating codes naturally arises and it seems to me that previous exact regenerating code constructions do not immediately generalize.

Finally, in another session was the closely related paper

Exact-Repair MDS Codes for Distributed Storage Using Interference Alignment

by Changho Suh and Kannan Ramchandran

Changho presented two new results: both are for the minimum storage point (MSR) and hence correspond to results for MDS codes.

One is that for the low-rate regime k/n \leq 1/2 if the number of nodes helping for the repair is sufficiently large d \geq 2k-1,  the MSR cut-set point can be achived for exact repair. For this regime the paper presents explicit code constructions that use a small finite fields and build on this paper which established similar results but only when repairing systematic nodes (and not nodes storing parities).

The second result of this paper closes the open problem of MSR exact repair for the high-rate k/n\geq 1/2 regime. The paper uses the symbol extension technique introduced in the Cadambe-Jafar paper Interference Alignment and the Degrees of Freedom for the K User Interference Channel to MDS codes that asymptotically approach the cut-set bound.

This paper appeared simultaneously with

Distributed Data Storage with Minimum Storage Regenerating Codes – Exact and Functional Repair are Asymptotically Equally Efficient, by Viveck R. Cadambe, Syed A. Jafar, Hamed Maleki,

Which independently obtain very similar results using the same technique. Note that the symbol extension approach requires vector linear network coding (sub-packetization of each stored packet) and to approach the cut-set bound an enormous amount of subpacketization (and field size) seem to be required. I think it is correct to interpret these results as existence proofs of high-rate regenerating codes approaching the cut-set bound for exact repair, rather than explicit constructions.

I think that the applicability of interference alignment to problems that seem to have nothing to do with wireless channels is quite remarkable. Another related case is the paper

Network Coding for Multiple Unicasts: An Interference Alignment Approach

by A. Das, S. Vishwanath, S. Jafar, A. Markopoulou.

I will blog my notes on this paper (hopefully) soon. Are the slides available anywhere? I took pictures of slides from other talks but my camera run out of battery.

ISIT 2010: Bethe Entropy and the Edge Zeta Function

Connecting the Bethe Entropy and the Edge Zeta Function of a Cycle Code
by Pascal Vontobel

Pascal talked about a connection of the Bethe entropy to the edge zeta function of a cycle code, building on previous work by Koetter, Li Vontobel and Walker. Starting with the first quantity, as is well known, Yedidia, Freeman and Weiss established that sum-product (if it converges) can only converge to fixed points of the Bethe free energy. The Bethe entropy is the non-linear term added in the ‘average energy’ (optimized by the LP decoder) and is therefore a fundamental quantity to understand sum-product and its connections to LP relaxations. A cycle code is a linear code where all the variables have degree two (and can be represented as edges).

Now what is the second quantity in this connection, the edge zeta function of a cycle code? As far as I understood (and this is probably inaccurate), for each edge (=variable of the code) define an indeterminate quantity U_{e}.

For a primitive cycle c_1 (without allowing backtracking) of this graph, consisting of edges e_1, e_2, \cdots, define a monomial \gamma(c_1) that is the product of terms \gamma(c_1)= U_{e1}U_{e2}\cdots for all the edges of the cycle.

The edge zeta function is a product over all the cycles of these monomials inverted:  \prod_{\forall c_i} \frac{1}{1-\gamma(c_i)}.

The edge function can therefore be evaluated at every point of the fundamental polytope by setting the U_{e} appropriately. Previous work by Koetter, Li Vontobel and Walker established that each monomial in the Taylor expansion of this function corresponds to an (unscaled) pseudo-codeword of the cycle code. In particular the powers appearing for each edge (=variable of the code) correspond to re-scaled pseudo-codeword coordinates. As far as I understood, this paper shows that the coefficients of the monomials associated with a given pseudo-codeword have a meaning: their asymptotic growth rate equals to the directional derivative of the (induced) Bethe entropy in the direction of that pseudo-codeword.

I was wondering on the connections to the self-avoiding walks of Dror Weitz [Counting independent sets up to the tree threshold] (also Local approximate inference algorithms)  and the connections of the random walks to the weights of the Arora,Daskalakis, Steurer Message Passing algorithms and improved LP decoding paper.

ISIT 2010 : gossip and consensus

THE MISSING PIECE SYNDROME IN PEER-TO-PEER COMMUNICATION (Bruce Hajek, Ji Zhu; University of Illinois at Urbana Champaign)
This paper proposes a model for peer-to-peer content distribution in a Bit-Torrent-like setup where there is a seed node and everybody wants to get K pieces of a file held by the seed. Users arrive according to a Poisson process and peers randomly collect and transfer (instantaneously) one piece. The paper provides a stability analysis for this system based on queueing. It’s a cool model, and the talk had some rather amusing moments for those who were there…

WEIGHTED GOSSIP: DISTRIBUTED AVERAGING USING NON-DOUBLY STOCHASTIC MATRICES (Florence Bénézit; Ecole Normale Supérieure-INRIA, Vincent Blondel; UC Louvain, Patrick Thiran; Ecole polytechnique fédérale de Lausanne, John Tsitsiklis; Massachusetts Institute of Technology, Martin Vetterli; Ecole polytechnique fédérale de Lausanne)
Florence presented convergence results for an algorithm based on one-way path averaging. Inspired by the Push-Sum protocol of Kempe et al., she described a one-way method in which a node “gives away” a fraction of its estimate and pushes it along a random direction in the network. The receiving node takes some of the value and passes the rest along — It’s kind of like passing a plate of food around a table while keeping a little (or a lot) for yourself. It’s a cool algorithm, and it works really well in experiments. However, the rate of convergence is still an open question — it seems related to the convergence of non-homogeneous Markov chains.

TIGHT BOUNDS FOR ALGEBRAIC GOSSIP ON GRAPHS (Michael Borokhovich, Chen Avin, Zvi Lotker; Ben Gurion University of the Negev)
This paper was more discrete in nature. There are n nodes in a network and each has a value in a finite field. They pass linear combinations of their symbols around. The goal for every node to learn all the information, or equivalently to gather a full-rank set of equations. Nodes can communicate according to a graph structure — they presented upper and lower bounds of n d_{\max} where d_{\max} is the maximum degree in the graph. They also showed the barbell graph is very very slow.

DISTRIBUTED CONSENSUS WITH FINITE MESSAGING (Debashis Dash, Ashutosh Sabharwal; Rice University)
This was on distributed vertex coloring in which each node gets to know something about the colors in its local neighborhood. This is a bit tough (which they prove), but the authors allow themselves a little slack in that they want to minimize the number of defects (nodes with an adjacent node of the same color), rather than make it $0$. A number of algorithms were presented, many of them based on an initial random assignment followed by a refinement step using the local information.

A NEAR-OPTIMAL ALGORITHM FOR NETWORK-CONSTRAINED AVERAGING WITH NOISY LINKS (Nima Noorshams, Martin J. Wainwright; University of California, Berkeley)
This paper was essentially about packing routes in a “gossip along the way” paradigm — if a node wakes up and starts a path (say horizontally), it can also send a message vertically to trigger path-averaging along parallel paths. This gives a two-phase algorithm and the number of rounds ends up looking like the diameter of the graph. However, the number of one-hop messages scales in the same way. Thus the gain is through parallelization.

ISIT 2010 : some talks on statistics and probability

For many of the talks I attended I didn’t take notes — partly this is because I didn’t feel expert enough to note things down correctly, and partly because I am

RÉNYI DIVERGENCE AND MAJORIZATION (Tim van Erven; Centrum Wiskunde & Informatica, Peter Harremoës; Copenhagen Business College)
Peter gave a talk on some properties of the Rényi entropy

D_{\alpha}(P \| Q) = \frac{1}{\alpha - 1} \log \int p^{\alpha} q^{1 - \alpha} d \mu

where \mu is a dominating measure, and a related geometric quantity, the Lorenz diagram. This is a set of all points in \mathbb{R}^2 which are equal to (\int f dP, \int f dQ) for some unit measure function f on [0,1]. It turns out that subset relationships of Lorenz diagrams are the same as majorization relationships between the measures. This is related to something called Markov ordering from a recent paper by Gorban, Gorban, and Judge. This was one of those nice talks which pointed me towards some results and tools that are new to me but that I could not fully understand during the talk.

MINIMAX LOWER BOUNDS VIA F-DIVERGENCES (Adityanand Guntuboyina; Yale University)
This talk was on minimax lower bounds on parameter estimation which can be calculated in terms of generalized f divergences, which are of the form

D_f(P \| Q) = \int f \left( \frac{dP}{dQ} \right) dQ

Suppose T is an estimator for some parameter \theta. We want universal (over T) lower bounds on \mathbb{P}(T(X^n) \ne \theta). This gives a generalization of Fano’s inequality:

\mathbb{P}(T(X^n) \ne \theta) \ge \inf_Q \sum_{\theta} D_f(P_{\theta} \| Q)

This has a very geometric feel to it but I had a bit of a hard time following all of the material since I didn’t know the related literature very well (much of it was given in relation to Yang and Barron’s 1999 paper).

MUTUAL INFORMATION SADDLE POINTS IN CHANNELS OF EXPONENTIAL FAMILY TYPE (Todd Coleman; University of Illinois, Maxim Raginsky; Duke University)
In some cases where we have a class of channels and a class of channel inputs, there is a saddle point. The main example is that Gaussian noise is the worst for a given power constraint and Gaussian inputs are the best. Another example is the “bits through queues” This paper gave a more general class of channels of “exponential family type” and gave a more general condition under which you get a saddle point for the mutual information. The channel models are related to Gibbs variational principle, and the arguments had a kind of “free energy” interpretation. Ah, statistical physics rears its head again.

INFORMATION-THEORETIC BOUNDS ON MODEL SELECTION FOR GAUSSIAN MARKOV RANDOM FIELDS (Wei Wang, Martin J. Wainwright, Kannan Ramchandran; University of California, Berkeley)
This was on two problems: inferring edges in a graphical model from iid samples from that model, and inverse covariance estimation. They are similar, but not quite the same. The goal was to prove necessary conditions for doing this; these necessary conditions match the corresponding achievable rates from polytime algorithms. The main result was that you need n = \Omega( d^2 \log p) samples, where p is the number of variables, and d is the max degree in the graphical model. The proof approach was through modeling graph selection as channel coding in which the channel outputs samples from a graphical model chosen by the encoder. If the decoder can identify the model then the problem is solvable. This is analogous to the talk Sriram Vishwanath gave today on matrix completion.

ISIT 2010 : Michael Jordan

Perhaps malapropos for the NBA Finals, Prof. Michael Jordan gave the first plenary talk at ISIT. It was a great overview of nonparametric Bayesian modeling. In particular, he covered his favorite Chinese restaurant process (also known as the Pitman-Yor stick-breaking process), hierarchical Dirichlet priors, and all the other jargon-laden elements of modeling. At the end he covered some of the rather stunning successes of this approach in applications with lots of data to learn from. What was missing for me was a sense of how these approaches worked in the data-poor regime, so I asked a question (foolishly) about sample complexity. Alas, since that is a “frequentist” question and Jordan is a “Bayesian,” I didn’t quite get the answer to the question I was trying to ask, but that’s what happens when you don’t phrase things properly.

One nice thing that I learned was the connection to Kingman’s earlier work on characterizing random measures via non-homogeneous Poisson processes. Kingman has been popping up all over the place in my reading, from urn processes to exchangeable partition processes (also known as paintbox processes). When I get back to SD, it will be back to the classics for me!