I cleared up in my head a long-seated confusion regarding some of the explanations used in information theory. Or rather, I had unconfused myself a while ago, but now I have a way to explain it. The discussion below is not too technical.
In classical Information Theory, we are concerned with how much information is contained by data that we observe, and how much information is inherently contained in objects. This is by necessity a vague statement — the clarification of this idea is the whole point of the field. Suppose I have a random quantity (variable) X. Then we can compute a number H(X) which represents the minimal number of bits that it takes to describe X completely. This is a bit surprising in and of itself, but given the world that we live in nowadays, it’s not too much to swallow.
The real confusion steps in when people try to explain H(X). It’s called the entropy of the random variable X, which is in and of itself a minor problem. People have all sorts of associations with entropy — it always increases, you can’t fight it, the heat death of the universe, and so on. Not too big a deal, this is just another meaning of the word entropy. But then H(X) gets explained intuitively as the amount of information contained in X, or the amount of uncertainty in X. I propose that we think of H(X) as the amount of uncertainty that can be resolved by X. That is, knowing X lets me figure out H(X) bits of information.
Consider the case when we have X and we want to communicate it to someone else. We send it to them, but they get Y, which is a corrupted version of X, perhaps due to noise, malcious jamming, or something else. How much can the receiver learn of what we tried to send them? This quantity is called the mutual information I(X; Y) between X and Y. We can think of it using the first intuition as the amount of information contained in Y about X. Using the second intuition, it’s harder — we are trying to figure out what X is, so one stab might be to say that it’s the reduction in uncertainty in X having seen Y. This is close to the the third intuitive definition, which would say that it’s the amount of uncertainty that can be resolved by X and by Y.
People like diagrams, so here’s a diagram which illustrates a problem with calling H(X) the information/uncertainty in X:
The left circle is H(X), the right circle is H(Y), and their intersection is I(X;Y). If we think of H(X) as the information contained in X, then “mutual information” means “information known by both X and Y,” hence the intersection. If we think of H(X) as uncertainty, then the diagram is confusing — the mutual information is the uncertainty shared by X and Y? What does that have to do with communication? The third interpretation in this case is more akin to the first: the mutual information is the uncertainty that can be resolved both by X and by Y.
Maybe this is all in my head, but it seems to me that calling entropy information or uncertainty is misleading because it is descriptive. In applied mathematics we look for the applications, and so a prescriptive intuition is more satisfying for me, and perhaps also for other students like me. Referring to entropy of X as the uncertainty that can be resolved by X may be turning some of the uses of the quantity on their head (e.g. for source coding), but highlights the intuitive properties of higher-order entropy functions like mutual information.
And that is my two cents.