# Collisions from entropy bounded random variables

I realized the other day that for two completely unrelated papers I’ve had to prove statements about how many unique elements appear in a sample from distribution $P$ on a discrete set. The thing is we only have a handle on the entropy $H(P)$ but not on the distribution itself.

Let $P$ have entropy $H(P)$ and consider a sample of size $m$ drawn i.i.d. from $P$. Let $M$ be the number of distinct symbols in the sample. Then $\mathbb{E}[M] \le \frac{m H(P)}{\log m} + 1$

Let $\{p_i\}$ be the probabilities of the different elements. There could be countably many, but we can write the expectation of the number of unique elements as $\mathbb{E}[M] = \sum_{i=1}^{\infty} (1 - (1 - p_i)^m)$
By doing a bit of series manipulation you can get the bound above. This came up (perhaps unsurprisingly) in some work on large-alphabet probability estimation.

Let $P$ have entropy $H(P)$ and support size $K$ and consider a sample of size $n$ drawn i.i.d. from $P$. Let $M$ be the number of distinct symbols in the sample. Then $\mathbb{P}(M = n) \ge \left( \frac{H(P) - 1 - \log n}{\log K} \right)^{n-1}$

For this result we need a bound on the support of $P$ as well, which makes things a little easier. This result came up when we were proving bound on adversarial strategies when eavesdropping. We needed to show that if $P$ represents some sort of list-decoding based on what the adversary has seen over the channel so far, then the list has some structure/distance properties. This lemma came in handy.

I’m sure there’s a better way to clean up these kinds of results to get something more “off the shelf” for general distributions. In both cases we didn’t find these result easily somewhere else so we ended up proving them as one-off lemmas.

/