Chernoff bounds and Bernstein’s trick

I came across some references to “Bernstein’s trick” in some papers I was reading, but had to do a little digging to find out what it really meant. Suppose you have independent and identically distributed random variables Xi for i = 1, 2, … n, taking values in [-1, 1], and with E[Xi] = 0. Our goal is to bound

Pr[ (1/n) ∑ Xi > a ] .

The Bernstein trick is to multiply both sides by a dummy variable t and exponentiate both sides:

Pr[ (1/n) ∑ Xi > a ] = Pr[ exp(t ∑ Xi) > exp(a n t) ] ,

Now we use the Markov inequality on the random variable exp(t ∑ X_i):

Pr[ exp(t ∑ Xi) > exp(a n t) ] ≤ exp(- a n t) E[ exp(t ∑ Xi) ],

Now we can leverage the fact that the X_i‘s are independent:

exp(- a n t) E[ exp(t ∑ Xi) ] = exp(- a n t) E[ exp(tX) ]n,

where X has the same distribution as Xi. Then we can expand exp(tX) as a Taylor series:

exp(- a n t) E[ exp(tX) ]n = exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] .

We now leverage the fact that X is in the interval [-1,1] and that E[X] = 0 to bound the sum of the higher-order terms in the Taylor series:

exp(- a n t) E[ 1 + t X + (1/2!) (t X)2 + … ] ≤ exp(- a n t) exp(b k t2) .

For some constant b, which you can pick based on how much you know about the distribution of X.

Now the more probability-savvy may say “wait a minute, E[ exp(tX) ] is just the moment generating function mX(t), so isn’t this just a fancy Chernoff bound? The answer is basically yes, but the term “Chernoff bound” is used for a lot of related bounds. The one I’m most familiar with requires you to know the moment generating function E[ exp(tX) ]:

P( exp(t X) > exp(a t) ) ≤ exp(-a t) mX(t) .

Since this holds for all t, you differentiate with respect to t and find the minimum to get the tightest bound.

The main difference in these two Chernoff-style bounds is the lack of knowledge needed to apply the former to sums of iid random variables. Of course, you can get all of these bounds from large deviations theory or concentration inequalities, but sometimes a rough trick is all you need to get a sufficiently fast exponential decay so that you can move on to the next step.

Note : updated with nicer characters, thanks to Erin.

[1] Ahlswede, R. “Elimination of Correlation in Random Codes for Arbitrarily Varying Channels,” Zeitschr. Wahr. und Verw. Geb. V.44, 1978, pp. 159–175.
[2] Wigderson, A. and Xiao, D. “A Randomness-Efficient Sampler for Matrix-valued Functions and Applications.” FOCS 2005.
[3] Gallager, R. Discrete Stochastic Processes, Boston : Kluwer, 1996.