probability - Expiring coupon collector's problem

The well-studied coupon collector's problem asks, given $N$ different coupons from which coupons are being drawn with equal probability and with replacement:

How many coupons do you expect to need to draw before having drawn each coupon at least once?

For large $N$, the expected number of draws is known to be $\Theta(N\log N)$. What is known about the following variation?

One coupon is drawn each day, and each coupon expires $M \ge N$ days after being drawn.
How many coupons do you expect to need to draw before having at least one unexpired copy of each coupon?

Presumably if $M$ is large enough (say, $M\in\Omega(N\log N)$), it will have no effect on the problem, and the expected number of draws will still be $\Theta(N\log N)$. On the other hand, if $M=N$, a perfect run of $N$ distinct coupons is required. For any $N$ consecutive days, the probability of this happening is
$$
\frac{N!}{N^N}\sim \frac{\sqrt{2\pi N}}{e^N},$$
and so the expected number of draws must grow exponentially with $N$. This is a large change in the expected time (from almost linear to exponential) corresponding to a small change in the expiration behavior (from almost linear to linear). Can anything precise be said about the intermediate cases?

Answer

This answer isn't rigorous in justifying approximations, but the result is confirmed numerically.

I'll call the $N$ different coupons colours to distinguish them more clearly from the coupons drawn.

Let $M=\alpha N\log N$, and consider the limit $N\to\infty$ for fixed $\alpha$. First, let's calculate the variance of the number of coupons drawn in the unmodified coupon collector's problem. As the expectation is obtained as the sum of the expectations of $N$ independent values, the variance is the sum of the variances of these values. The number of draws to get a new colour when $k$ colours are still missing is geometrically distributed with $p=k/N$ and thus expectation $1/p=N/k$ and variance $(1-p)/p^2=(N^2-kN)/k^2$. The sum of the expectations is the well-known result

$$
\sum_{k=1}^N\frac Nk=NH_N\sim N\log N\;,
$$

where $H_N$ is the $N$-th harmonic number. The sum of the variances is

$$
\sum_{k=1}^N\frac{N^2-kN}{k^2}\sim\frac{\pi^2}6N^2-N\log N\sim\frac{\pi^2}6N^2\;.

$$

Thus the standard deviation is asymptotically a fixed fraction $\pi/\sqrt6$ of $N$, and by Chebyshev's inequality for fixed $\alpha\gt1$ the process asymptotically almost surely ends before expiration sets in, so the expected number of coupons in this case is just the unmodified expected number $NH_N$.

On the other hand, for the same reason, for fixed $\alpha\lt1$ the process asymptotically almost surely doesn't end before expiration sets in, so the expected number of coupons in this case is $M$ plus the expected number of coupons drawn after the onset of expiration.

To estimate the latter, let's first estimate the probability that all $N$ colours are represented in $M$ uniformly independently drawn coupons. According to Byron's answer to this question, this is

$$
\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M=\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^{\alpha N\log N}\;.

$$

We can approximate this by

$$
\sum_{k=0}^N (-1)^k {N\choose k}\mathrm e^{-k\alpha\log N}=\sum_{k=0}^N (-1)^k {N\choose k}\left(N^{-\alpha}\right)^k=\left(1-N^{-\alpha}\right)^N\sim\exp\left(-N^{1-\alpha}\right)
$$

for $N\to\infty$ if the terms of the series become negligible before the approximation breaks down. To check this, consider the logarithm of the absolute value of the (approximated) terms,

$$
\log\left(\binom Nk\mathrm e^{-k\alpha\log N}\right)\approx N\log N-k\log k-(N-k)\log(N-k)-k\alpha\log N\;,
$$

and set the derivative with respect to $k$ to zero:

$$
-\log k+\log(N-k)-\alpha\log N=0
$$

to find the maximum at $k=N/(1+N^\alpha)$. Thus for $N\to\infty$ the maximum shifts towards vanishing fractions of $N$, and the approximation is asymptotically valid.

Now a first estimate of the expected number of coupons drawn after the onset of expiration would be $\exp\left(N^{1-\alpha}\right)$, the result if at every draw the $M$ unexpired coupons were independent of the ones at previous draws. This already exhibits the desired feature of interpolating between exponential behaviour for $\alpha\to0$ and $N\log N$ behaviour for $\alpha\to1$. (Remember that $M=\alpha N\log N$ gets added to this to obtain the total expected number of coupons.)

To improve the estimate, we need to condition on the previous batches not containing all colours. Since asymptotically a batch almost surely doesn't contain all colours, the denominator in the definition of the conditional probability tends to $1$, and the probability for the current batch to contain all colours conditioned on the previous batches not containing all colours is asymptotically equal to the probability that the current batch contains all colours and the previous batches didn't.

The most important part of the condition, which is independent of the colours of recently expired coupons, is simply that the $M-1$ unexpired coupons we already had last time don't contain all $N$ colours. The probability that $M$ coupons contain all $N$ colours but the first $M-1$ of them don't is

$$
\begin{align}

&\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M-\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^{M-1}
\\
\sim&\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M-\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M\left(1+{k\over N}\right)
\\
=&\sum_{k=0}^N (-1)^k {N\choose k}\left(1-{k\over N}\right)^M\left(-\frac kN\right)
\\
\sim&\sum_{k=0}^N (-1)^k {N\choose k}\left(N^{-\alpha}\right)^k\left(-\frac kN\right)
\\
=&N^{-\alpha}\left(1-N^{-\alpha}\right)^{N-1}
\\

\sim&N^{-\alpha}\exp\left(-N^{1-\alpha}\right)\;.
\end{align}
$$

Thus we obtain an improved estimate for the expected number of draws after the onset of expiration, $N^{\alpha}\exp\left(N^{1-\alpha}\right)$. In fact this will turn out to be asymptotically correct, but we need to check the effect of the conditions implied by the colours of the recently expired coupons.

To do so, imagine the drawing process reversed in time, with recently drawn coupons being removed and recently expired coupons being added. We can interpret the above calculation to show that, conditional on all $M$ coupons containing all $N$ colours, removing one coupon has a probability of $1-N^{-\alpha}$ of removing a unique colour, whereas with probability $N^{-\alpha}$ all colours remain represented. This result remains valid if we remove further coupons; the changes in $M$ and $N$ by $O(1)$ only change the result by a factor $1+O(N^{-1})$. Thus, asymptotically, conditional on all $M$ coupons containing all $N$ colours, each removed recently drawn coupon independently has a probability of $1-N^{-\alpha}$ of reducing the number of colours represented by one.

On the other hand, the recently expired coupons are not affected by the condition that our current set of coupons contains all colours, so the probability of regaining a particular missing colour by adding a recently expired coupon back in is just $1-N^{-1}$.

With this model, we can obtain a systematic expansion of the steady-state probability of completing the colours on a given draw, by considering increasing numbers of missing colours. I'll show the calculation for one additional missing colour, which is straightforward and demonstrates that the corrections don't affect the asymptotic behaviour.

We know that one colour immediately goes missing when we remove the coupon just drawn. Let $j+1$ be the number of recently drawn coupons we need to remove beyond that to lose another colour, and let $l+1$ be the number of expired coupons we have to recoup to replace the colour of the coupon just drawn. Then this history is excluded if $l\le j$, since in that case the colour just drawn gets replaced before another one goes missing, implying a full set of $N$ colours in the past. Thus we want the fraction of histories for which $l\gt j$. This is

$$
\begin{align}
&\sum_{j=0}^\infty N^{-\alpha}\left(1-N^{-\alpha}\right)^j\sum_{l=j+1}^\infty N^{-1}\left(1-N^{-1}\right)^l
\\
=&\sum_{j=0}^\infty N^{-\alpha}\left(1-N^{-\alpha}\right)^j\left(1-N^{-1}\right)^{j+1}
\\

\sim&\frac{N^{-\alpha}}{N^{-\alpha}+N^{-1}}
\\
=&
\frac1{1+N^{\alpha-1}}\;.
\end{align}
$$

Multiplying this by the probability $N^{-\alpha}\exp\left(-N^{1-\alpha}\right)$ and taking the reciprocal yields an improved estimate for the expected number of coupons drawn after the onset of expiration, $N^\alpha\exp\left(N^{1-\alpha}\right)\left(1+N^{\alpha-1}\right)$. Note that the correction doesn't affect the asymptotic behaviour, since $1+N^{\alpha-1}\sim1$.

I also carried out the calculations for two and three additional missing colours, which are a bit more involved. I won't bore you with the details; the result is that the expected number of coupons is multiplied by rational functions of $N^{\alpha-1}$ that go to $1$ for $N^{\alpha-1}\to0$. The expansion only seems to converge for rather small values of $N^{\alpha-1}$, but that doesn't matter asymptotically.

Thus, the analysis suggests that the expected number of coupons drawn after the onset of expiration is asymptotic to $N^{\alpha}\exp\left(N^{1-\alpha}\right)$. This is difficult to test numerically for most $\alpha$, since for $\alpha$ close to $1$ the expansion in $N^{1-\alpha}$ converges very slowly and for $\alpha$ close to $0$ the expected number of draws is prohibitively large. A reasonable compromise is $\alpha=0.8$, for which the following table shows the average number of coupons drawn after the onset of expiration in $5000$ runs for $N=10\cdot2^n$ with $n=0,\dotsc,12$ and $M$ the closest integer to $0.8N\log N$. Also shown is the ratio to the asymptotic result $N^{\alpha}\exp\left(N^{1-\alpha}\right)$ and to the result of the first-order correction, $N^{\alpha}\exp\left(N^{1-\alpha}\right)\left(1+N^{\alpha-1}\right)$. Both ratios appear to be approaching $1$, the corrected one more quickly.

$$
\begin{array}{r|r|r|r|r|r|r}
N&M&\langle\text{#draws}\rangle&N^{0.8}\exp(N^{0.2})&\cdot\,(1+N^{-0.2})&\text{ratio}&\text{corrected}\\\hline
10 & 18 & 28 & 31 & 50 & 0.9115 & 0.5589\\
20 & 48 & 62 & 68 & 105 & 0.9196 & 0.5936\\
40 & 118 & 158 & 155 & 229 & 1.0226 & 0.6918\\
80 & 280 & 428 & 368 & 521 & 1.1638 & 0.8217\\

160 & 650 & 1097 & 916 & 1247 & 1.1976 & 0.8790\\
320 & 1477 & 3019 & 2403 & 3161 & 1.2563 & 0.9550\\
640 & 3308 & 8994 & 6703 & 8544 & 1.3418 & 1.0527\\
1280 & 7326 & 25913 & 20055 & 24850 & 1.2921 & 1.0428\\
2560 & 16072 & 85089 & 65037 & 78573 & 1.3083 & 1.0829\\
5120 & 34984 & 294659 & 231341 & 273258 & 1.2737 & 1.0783\\
10240 & 75645 & 1122292 & 915127 & 1059479 & 1.2264 & 1.0593\\
20480 & 162647 & 4998493 & 4089855 & 4651474 & 1.2222 & 1.0746\\
40960 & 348008 & 24025351 & 21028673 & 23542526 & 1.1425 & 1.0205\\
\end{array}

$$

Here's the code I used to produce the table.

Blog

Sunday, September 2, 2018

probability - Expiring coupon collector's problem

No comments:

Post a Comment

analysis - Injection, making bijection