My ex-groupmate and fellow Uni High graduate Galen Reeves told me about a paper a few weeks ago when I visited him at Stanford:
Successive Normalization of Rectangular Arrays
Richard A. Olshen and Bala Rajaratnam
The Annals of Statistics 38(3), pp.1369-1664, 2010
Apparently, however, the arguments in the paper are not quite correct [1], and they recently uploaded a correction to ArXiV.
This paper looks at the effect of a very common preprocessing step used to transform an data matrix
into a form acceptable for statistical or machine learning algorithms that assume things like zero-mean or bounded vectors. Here
may represent the number of individuals, and
the number of features, for example. Or the data may come from a DNA microarray (their motivating example). This preprocessing is often done without much theoretical justification — the mathematical equivalence of tossing spilled salt over your shoulder. This paper looks at the limiting process of standardizing rows and then columns and then rows and then columns again and again. They further need that
. “Readers will see that the process and perhaps especially the mathematics that underlies it are not as simple as we had hoped they would be.”
So what exactly is the preprocessing? I am going to describe things in pseudocode (too lazy to do real code, sorry). Given a data matrix X[i,j]
they look at
for i = 1:n { X[i,1:k] = X[i,1:k] - sum(X[i,1:k]) }
for j = 1:k { X[1:n,j] = X[l:n,j] - sum(X[1:n,j]) }
They call the first a “row mean polish” and the second a “column mean polish.” They show this converges in one step.
But what about standardizing? The more complicated polishing procedure looks like this:
for i = 1:n {
mu = sum(X[i,1:k])
sigma = sqrt( sum( (X[i,1:k] - mu)^2 ) )
X[i,1:k] = (X[i,1:k] - mu)/sigma
}
for j = 1:k {
mu = sum(X[1:n,j])
sigma = sqrt( sum( (X[1:n,j] - mu)^2 ) )
X[1:n,j] = (X[1:n,j] - mu)/sigma
}
This standardizes rows first, and then columns (or “lather” and “rinse,” since we are going to “repeat”). They call this Efron‘s algorithm because he told them about it. So what happens if we repeat these two steps over and over again on a matrix with i.i.d. entries from some continuous distribution?
Theorem 4.1 Efron’s algorithm converges almost surely for X on a Borel set of entries with complement a set of Lebesgue measure 0.
So what does it look like in practice? How fast is this convergence? Empirically, it looks exponential, and they have some theoretical guarantees in the paper, kind of hidden in the discussion. The proofs are not terribly ornate but are tricky, and I don’t quite get all the details myself, but I figured readers of this blog would certainly be interested in this cute result.
[1] A fun quote from the paper “Theorem 4.1 of [2] is false. A claimed backwards martingale is NOT. Fortunately, all that seems damaged by the mistake is pride. Much is true.” I really liked this.