I came across some references to “Bernstein’s trick” in some papers I was reading, but had to do a little digging to find out what it really meant. Suppose you have independent and identically distributed random variables Xi for i = 1, 2, … n, taking values in [-1, 1], and with E[Xi] = 0. Our goal is to bound
Pr[ (1/n) ∑ Xi > a ] .
The Bernstein trick is to multiply both sides by a dummy variable t and exponentiate both sides:
Pr[ (1/n) ∑ Xi > a ] = Pr[ exp(t ∑ Xi) > exp(a n t) ] ,
Now we use the Markov inequality on the random variable exp(t ∑ X_i):
Pr[ exp(t ∑ Xi) > exp(a n t) ] ≤ exp(- a n t) E[ exp(t ∑ Xi) ],
Now we can leverage the fact that the X_i‘s are independent:
exp(- a n t) E[ exp(t ∑ Xi) ] = exp(- a n t) E[ exp(tX) ]n,
where X has the same distribution as Xi. Then we can expand exp(tX) as a Taylor series:
exp(- a n t) E[ exp(tX) ]n = exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] .
We now leverage the fact that X is in the interval [-1,1] and that E[X] = 0 to bound the sum of the higher-order terms in the Taylor series:
exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] ≤ exp(- a n t) exp(b k t2) .
For some constant b, which you can pick based on how much you know about the distribution of X.
Now the more probability-savvy may say “wait a minute, E[ exp(tX) ] is just the moment generating function mX(t), so isn’t this just a fancy Chernoff bound? The answer is basically yes, but the term “Chernoff bound” is used for a lot of related bounds. The one I’m most familiar with requires you to know the moment generating function E[ exp(tX) ]:
P( exp(t X) > exp(a t) ) ≤ exp(-a t) mX(t) .
Since this holds for all t, you differentiate with respect to t and find the minimum to get the tightest bound.
The main difference in these two Chernoff-style bounds is the lack of knowledge needed to apply the former to sums of iid random variables. Of course, you can get all of these bounds from large deviations theory or concentration inequalities, but sometimes a rough trick is all you need to get a sufficiently fast exponential decay so that you can move on to the next step.
Note : updated with nicer characters, thanks to Erin.
 Ahlswede, R. “Elimination of Correlation in Random Codes for Arbitrarily Varying Channels,” Zeitschr. Wahr. und Verw. Geb. V.44, 1978, pp. 159–175.
 Wigderson, A. and Xiao, D. “A Randomness-Efficient Sampler for Matrix-valued Functions and Applications.” FOCS 2005.
 Gallager, R. Discrete Stochastic Processes, Boston : Kluwer, 1996.
7 thoughts on “Chernoff bounds and Bernstein’s trick”
I read blogs to avoid the fact that I’m supposed to be working on my problem set. And I just had to do a problem on Chernoff bounds (and more specifically, the Hoeffding inequality) for random algorithms. And so now I actually took the time to understand all that, and I do not feel like I’ve taken a break. I hate you. :-p
Useful things for you to know (remove spaces to get the characters):
Σ (& Sigma 😉
∑ (& sum 😉
≥ (& ge 😉
≤ (& le 😉
Moveable type doesn’t seem to support the and tags, which would give you exponents and labels.
Ahh… that would be the <sup> and <sub> tags.
Okay, since you’re clearly taking my suggestions and using them, I just checked over on my blog… <sup> and <sub> tags work, just not in comments. So X<sub>i</sub> would look nicer than X_i and foo<sup>n</sup> would look nicer than foo^n.
Wanna see what I’m doing today?
It’s an assignment for my computational complexity class… I’ve gotten really good at writing math using ASCII special characters and special HTML tags.
total bullshit. you liar. everyone knows that Chernoff was discredited in the autoerotic-asphyxiation scandal of ’42.
My head hurts after reading that.
Reblogged this on Epanechnikov's Blog and commented:
A useful trick…