Abstract

Alec Campbell of Bellevue College writes: I've read about the birthday problem, and how you only need 23 randomly chosen people for there to be a 50% chance that two people share a birthday. But how many people would you need for there to be a 50% chance that every possible birthday is represented by at least one person?

Mario Cortina Borja replies: More than 365 people, clearly. But how many more? According to my estimates, you would need to gather together 2285 people for there to be a greater than 50% chance that all birthdays (excluding the leap year day of 29 February) are taken by at least one person, and more than 2980 for there to be a greater than 90% chance.

In his email to Significance, Alec says he became interested in this question when he noticed that “among our 3000 or so graduates last year only one birthday was not taken”. Based on my estimates, there was less than a 10% chance that this one birthday would go unclaimed.

I used simulation in the statistical software R to estimate these numbers. In general terms, I was looking to work out the probability of observing all possible birth dates among a sample of people, using the simplifying assumption that births within the population are uniformly distributed over all possible days.

Mathematically, we express this as estimating p(n, M), which is the probability of observing all the elements of the set Dn = {1, 2, …, n} in a sample of M subjects, assuming a uniform distribution over Dn. For birthdays, excluding 29 February, we have n = 365. To estimate p(n, M), I simulated B samples of size M using the R function p_hat (see box). I quickly found that M ≈ 3000 was an approximate solution, so I simulated B = 10 000 samples each for sizes 1200 ≤ M ≤ 5000 in increments of 10.

The row marked uniform365d in Table 1 shows values for selected quantiles resulting from these simulations; the values were obtained as predictions from a smoothing spline model. I do not include the confidence intervals for these estimates, but they are quite tight. The median (0.5) is 2285 people, and the 0.9 quantile is 2980.

TABLE 1

Estimated quantiles for the modified birthday problem, using one uniform and two empirical distributions based on live births from England and Wales, 1979–2014

Probabilities
Distributions0.010.100.500.900.99
uniform365d16101858228529803794
empirical366d 1657 1916 2435 3642 6758 
empirical365d 1603 1862 2296 3002 3849 
Probabilities
Distributions0.010.100.500.900.99
uniform365d16101858228529803794
empirical366d 1657 1916 2435 3642 6758 
empirical365d 1603 1862 2296 3002 3849 
TABLE 1

Estimated quantiles for the modified birthday problem, using one uniform and two empirical distributions based on live births from England and Wales, 1979–2014

Probabilities
Distributions0.010.100.500.900.99
uniform365d16101858228529803794
empirical366d 1657 1916 2435 3642 6758 
empirical365d 1603 1862 2296 3002 3849 
Probabilities
Distributions0.010.100.500.900.99
uniform365d16101858228529803794
empirical366d 1657 1916 2435 3642 6758 
empirical365d 1603 1862 2296 3002 3849 

What would happen if we relaxed the assumption of uniformity of birthdays and a 365-day year? Using data provided by the Office for National Statistics, I considered the birthdays of the 23 872 409 live births registered in England and Wales between 1979 and 2014. This adjusts for (i) leap year births on 29 February, which constitute just 0.068% of all births; (ii) the excess of births in the last week of September, corresponding to conceptions in the Christmas holidays, and the deficit of births in the Christmas holidays, reflecting health services management policies; and (iii) the marked dependence on day of the week of birth, which is integrated out by accumulating the live birth frequencies by day of the year.

Clearly birth dates now vary in frequency, but how does this affect the distribution quantiles? The row in Table 1 marked empirical366d is based on the frequencies of live births including 29 February. The median of 2435 is 6.6% higher than that based on the uniform distribution, while the 0.9 quantile is 22% higher. To clarify this “leap day” effect, I omitted births on 29 February and re-estimated the empirical quantiles. Results in the row marked empirical365d show that the median and 0.9 quantile are now 2296 and 3002, only 1% greater than the uniform distribution quantiles.

Further reading

An alternative solution to this month's question is published at bit.ly/2fnWIDy

Author notes

Next month, we ask: What are the odds of a person becoming a statistician? As suggested by @BobOHara, via Twitter (bit.ly/2f8Mas7)

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)