I was looking up axiomatic characterizations of entropy today and figured I’d share. There are two different axiomatizations I found while rooting around my books. I’m sure there are more, but these have nice plausible arguments. The entropy of a random variable measures the uncertainty inherent to that random variable. Earlier I argued that one should think of this uncertainty in terms of praxis — how many bits it takes to resolve something, or how many bits it can resolve. Here the question is more fundamental: what do we mean by the uncertainty?

Firstly, the uncertainty about a random variable should not depend on the specific values that it takes, but only on the frequency or probability with which it takes on those values. For example, if I flip a fair coin it comes up heads or tails with even probability. If I draw a ball from an urn containing an equal number of red and blue balls, the chances of getting a red or blue ball is the same. The uncertainty in both of these experiments is the same, and depends only on the probabilities involved. Suppose *{p_1, p_2, … p_n}* are the probabilities of *n* possible values for a random variable X.

Claude Shannon, in his 1948 paper, has three conditions he claims should be intuitively true about any function *H(p_1, p_2, … p_n)* that we want to call the uncertainty. Firstly, he says that *H* should be continuous. This makes sense — if I change the the probabilities by a little, so the coin is now 49% likely to come up tails, we don’t expect the uncertainty to change by much. His second claim is that the uncertainty of an experiment with *n* equally likely outcomes is less than the uncertainty of an experiment with *n+1* equally likely outcomes. This is again pretty straightforward.

Shannon’s third axiom is a little tougher to visualize. Consider the following scenario. You have *n* equally likely outcomes, but *n* is very large. To make your life easier, you divide them in to *k* groups of size *b_1, b_2, … b_k*. You pick a group *j* with a probability *b_j / n*, that is, weighted according to size. Then you pick an outcome equally likely from all of the outcomes in in group *j*. This process is equivalent to just picking one of the *n* choices at random, and so should generate the same uncertainty. In other words, we have

H(1/n, 1/n, … 1/n) = H(b_1/n,… b_k/n) + b_1/n H(1/b_1, … 1/b_1) + b_2/n H(1/b_2, … 1/b_2) + … + b_k/n H(1/b_k, … 1/b_k)

Khinchin has a different set of axioms which I find a little easier to think about. First off, he says that if the number of outcomes is fixed, then you have the most uncertainty (*H* is maximum) when all the outcomes are equally likely. His second assumption is that if two random variables are independent, then the uncertainty of them together is the sum of the uncertainty in them separately. Lastly, if you add some impossible outcomes (outcomes with probability zero) the uncertainty is unchanged.

In both cases, these assumptions uniquely define a measure of information or uncertainty, which we call the entropy becuse of those crazy physicists. The formula is:

H(p_1, p_2, … p_n) = – p_1 log(p_1) – p_2 log(p_2) – … – p_n log(p_n)

Actually, we can take any constant multiple of the right hand side, but there’s not too much point to that, except to change the base of the logarithm. It’s one of the neat features of mathematics that we can get a clean formula based entirely on verbal arguments of what we think a good measure of “uncertainty” is.

… but isn’t that formula just a little bit uncertain? 😉

Shannon’s third axiom reminds me a lot of the syphillis tests you were telling me about. And somewhere there’s a joke about the fact that you used the variable b_j and how that might increase the probably of finding syphillis, thus throwing off your uncertainty measurements.

Damn it, I’m attempting (poorly, I realize) to make a joke about math and STDs. This is a bad sign.

thanks for stating the obvious, anand.