I realized the other day that for two completely unrelated papers I’ve had to prove statements about how many unique elements appear in a sample from distribution on a discrete set. The thing is we only have a handle on the entropy but not on the distribution itself.
Let have entropy and consider a sample of size drawn i.i.d. from . Let be the number of distinct symbols in the sample. Then
Let be the probabilities of the different elements. There could be countably many, but we can write the expectation of the number of unique elements as
By doing a bit of series manipulation you can get the bound above. This came up (perhaps unsurprisingly) in some work on large-alphabet probability estimation.
Let have entropy and support size and consider a sample of size drawn i.i.d. from . Let be the number of distinct symbols in the sample. Then
For this result we need a bound on the support of as well, which makes things a little easier. This result came up when we were proving bound on adversarial strategies when eavesdropping. We needed to show that if represents some sort of list-decoding based on what the adversary has seen over the channel so far, then the list has some structure/distance properties. This lemma came in handy.
I’m sure there’s a better way to clean up these kinds of results to get something more “off the shelf” for general distributions. In both cases we didn’t find these result easily somewhere else so we ended up proving them as one-off lemmas.