I realized the other day that for two completely unrelated papers I’ve had to prove statements about how many unique elements appear in a sample from distribution on a discrete set. The thing is we only have a handle on the entropy
but not on the distribution itself.
Let
have entropy
and consider a sample of size
drawn i.i.d. from
. Let
be the number of distinct symbols in the sample. Then
![]()
Let be the probabilities of the different elements. There could be countably many, but we can write the expectation of the number of unique elements as
By doing a bit of series manipulation you can get the bound above. This came up (perhaps unsurprisingly) in some work on large-alphabet probability estimation.
Let
have entropy
and support size
and consider a sample of size
drawn i.i.d. from
. Let
be the number of distinct symbols in the sample. Then
![]()
For this result we need a bound on the support of as well, which makes things a little easier. This result came up when we were proving bound on adversarial strategies when eavesdropping. We needed to show that if
represents some sort of list-decoding based on what the adversary has seen over the channel so far, then the list has some structure/distance properties. This lemma came in handy.
I’m sure there’s a better way to clean up these kinds of results to get something more “off the shelf” for general distributions. In both cases we didn’t find these result easily somewhere else so we ended up proving them as one-off lemmas.