Suppose that we toss balls into \(b\) bins until some bin contains two balls. Each toss is independent, and each ball is equally likely to end up in any bin. What is the expected number of ball tosses?
This is the balls-and-bins version of the birthday paradox. Think of each ball toss as a person entering the room, and each bin as a possible birthday. We stop as soon as two balls land in the same bin (two people share a birthday).
Let \(T\) be the random variable representing the number of tosses until we get a collision (two balls in the same bin). We want to find \(\mathrm{E}[T]\).
We can partition the tosses into stages, similar to the birthday paradox analysis. Let \(T_i\) be the number of tosses between having \(i-1\) occupied bins and \(i\) occupied bins.
The first ball always goes into an empty bin, so \(T_1 = 1\). For the second ball, it lands in a new bin with probability \((b-1)/b\), and lands in the occupied bin with probability \(1/b\).
More generally, when we have \(i-1\) occupied bins, the probability that the next ball lands in a new bin is \((b-i+1)/b\), and the probability it creates a collision is \((i-1)/b\).
The number of tosses \(T_i\) until we either get a collision or occupy the \(i\)-th bin follows a geometric distribution. We have:
\[\Pr\{\text{no collision on a single toss} \mid i-1 \text{ bins occupied}\} = \frac{b-i+1}{b}\]The expected number of tosses to get a collision starting from \(i-1\) occupied bins is:
\[\mathrm{E}[T_i \mid \text{no collision yet}] = \frac{b}{b-i+1}\]However, we need to be careful. Let \(N\) be the number of distinct bins occupied before the first collision. Then:
\[T = T_1 + T_2 + \cdots + T_{N+1}\]where \(T_{N+1}\) is the final toss that creates the collision.
Let’s use a different approach. Let \(p_k\) be the probability that the first collision occurs on the \(k\)-th toss. For a collision on toss \(k\), we need:
- The first \(k-1\) balls all land in different bins
- The \(k\)-th ball lands in one of the already-occupied bins
The expected value is:
\[\begin{align*} \mathrm{E}[T] &= \sum_{k=2}^{b+1} k \cdot p_k \end{align*}\]This is complex to compute directly. Instead, let’s use the decomposition by stages.
The expected number of tosses is:
\[\begin{align*} \mathrm{E}[T] &= \sum_{i=1}^{b} \Pr\{T > i\} \\ &= 1 + \sum_{i=1}^{b-1} \Pr\{\text{first } i \text{ balls in different bins}\} \\ &= 1 + \sum_{i=1}^{b-1} \frac{b \cdot (b-1) \cdot (b-2) \cdots (b-i+1)}{b^i} \\ &= 1 + \sum_{i=1}^{b-1} \frac{b!}{(b-i)! \cdot b^i} \end{align*}\]For the case where \(b\) is large, we can use an approximation. The probability that we don’t have a collision after \(k\) tosses is approximately:
\[\Pr\{T > k\} \approx e^{-k(k-1)/(2b)}\]The median number of tosses (where \(\Pr\{T \leq k\} = 1/2\)) occurs when:
\[\begin{align*} e^{-k(k-1)/(2b)} &= 1/2 \\ k(k-1) &\approx 2b \ln 2 \\ k &\approx \sqrt{2b \ln 2} \\ k &\approx 1.177\sqrt{b} \end{align*}\]The expected value is similar:
\[\mathrm{E}[T] \approx \sqrt{\frac{\pi b}{2}} \approx 1.253\sqrt{b}\]Therefore, the expected number of ball tosses is \(\Theta(\sqrt{b})\).