How many duplicate UUIDs will you get?

10 minute read

Published:

The Universally Unique IDentifier will generate a most probably unique 32 digit long hexadecimal string. That makes $16^32 = 340,282,366,920,938,463,463,374,607,431,768,211,456$ possible permutations of these UUID codes.

There are a number of versions of UUID so clarifying which version you are using in the id itself takes up a few bits of information. $32$ hexadecimals means $\log_2(16) \cdot 32= 4 \cdot 32 = 128$ bits of information to play with. I only care about Version 4 which takes 4 bits to encode and variant 1 which takes 2 bits. So, overall we have $122$ bits left over to generate. This means there are

\[2^{122} = 5,316,911,983,139,663,491,615,228,241,121,378,304\]

possible version 4 variant 1 UUID strings. What are the odds that uniformly sampling these ids will accidentally result in duplicate values? This is an instance of the birthday problem. When we have $m$ objects and we uniformly sample from them with replacement $k$ times, what is the probability that we draw any object at least twice? The answer can be found in Wikipedia, but my main focus here is on having fun with alternative ways to solve the equation and some funky units.

The fairly unintuitive result of the birthday problem is that the probability of having at least two people who share the same birthday is surprisingly high even with a relatively small number of people in a room Let the event $S$ denote this event and $U$ be the complement — meaning the event where everyone has unique birthdays. Suppose there are $k$ people in the room. Then,

\[\begin{split} P(S) & = 1- P(U) = 1- \frac{365}{365}\frac{364}{365} \dots \frac{365 - k + 1}{365} \\ & = 1- \frac{365!}{365^k (365 - k)!} \end{split}\]

The counter-intuitive answer is that it only takes $k = 23$ people in the room for this probability to be over $1/2$. With a $100$ person class the probability of a shared birthday is $0.9999996927$ which might as well be guaranteed. Replace $365$ with $5,316,911,983,139,663,491,615,228,241,121,378,304$ (which I’ll call $m$ for short) and we can answer our problem. Let $U$ denote the event that each of the $k$ samples are unique . Then, the probability that this occurs is exactly

\[P(U) = \frac{m!}{m^k (m - k)!}\]

If we have $100$ people in a room and each of them sample a UUID the probability each id is unique is roughly $0.99999999999999999999999999999999906900847$. Which truly may as well be $1$. So, we’ll need some bigger numbers. Matt Parker of Stand-up Maths youtube fame proposed the Ten Billion Human Second Century (TBHSC) unit to measure the probability of rare things occuring. If $10$ billion people, very roughly the world’s current population, did something exactly once a second for a century, they would have done this thing about $1 TBHSC = 3x10^{19}$ times. Trying to find $P(U)$ led me to a fun—although probably entirely useless—way of approximating products with very large terms. First, let’s see what the real answer is. Here’s a link to a jupyter notebook I wrote for this. I’ll also post little dropdown code blocks as we go.


from mpmath import mp, mpf, fac, gamma, floor # mpmath will help us with big numbers

# First, we'll initialize our two big constants
mp.dps = 100 # 100 decimal places of accuracy
m = mpf('5316911983139663491615228241121378304') # number of unique UUIDs - order 10^36
h = mpf('30000000000000000000')             # A Ten Billion Human Second Century)

def birthday_solution(m, k):
    return fac(m)/ fac(m-k) / m**k

prob_of_a_unique = birthday_solution(m,h)
print(f'The probability each draw is unique is: {prob_of_a_unique}')
print(f'So the probability that we draw a duplicate is: {1 - prob_of_a_unique}')   

\[\frac{m!}{m^k (m - TBHSC)!} = 1.750769x10^{-37}\]

which was really counter-intuitive to me. I really thought that the TBHSC samples stood a good shot at being unique. But, I would not personally call something at $10^{-37}$ a good shot. Ok, let’s see how I approximated this.


a_big_value = 1000000
prod = 1 / m**a_big_value
for j in range(a_big_value):
    prod *= (m-j) # In reality we would use some log-sum-exp trick here

# The above took about 5 seconds to run on my machine. Thus, since this is linear, the full calculation would take about
years = h / a_big_value * 5 / 3600 / 24 / 365 
print(f'The full calculate would take about {years} years to run')

Before approximating anything, I tried to run a for loop on my laptop. After waiting about a minute, I stopped and tried running a much, much smaller for loop to try to guess how long it would take the code to finish. It took my compute about $5$ seconds to get through $1,000,000$ iterations which means that I would have to wait about $\frac{h}{1000000} \cdot \frac{5}{3600 \cdot 24 \cdot 365} \approx 4,756,468$ years in order to finish running. Not wanting to wait that long, I tried to think of another way. (I tried thinking of anything except for the right way that it)

Since the terms are monotonically decreasing, we are able to bound the terms from above by taking the largest term and pretending each other term is equal to that. Similarly, we can bound from below by using the smallest term. For the case where $m=10$ and $k=4$ this bound is expressed

\[\left ( \frac{10 - 3}{10}\right)^4 \leq P(unique) \leq 1^4\]

which is a pretty awful bound. Which is about

\[0.2401 \leq 0.504 \leq 1\]

But we can do better. The above approach splits divides the terms into $1$ group — either the smallest or the biggest term. We could just as easily divide into $d$ groups. Before going that general, let’s say $d=2$. There will be a little thinking to do when $d$ doesn’t divide $k$ but we’ll just be left with an extra remainder group. Splitting into $2$ groups to make our bound using otherwise the same values as above makes

\[\left( \frac{10 - 3}{10} \right)^2 \left( \frac{10 -1 }{10} \right)^2 \leq P(unique) \leq \left( \frac{10-2}{10} \right)^2 \cdot 1^2\]

which is approximately

\[0.3969 \leq 0.504 \leq 0.64\]

still not great but y’know better. And I’m just going to claim it without proof, but we are obviously able to extend this approach up until arbitrary $d$ until $d=k$. So, let’s go ahead and extend this approach. We’ll have to deal with the remainder terms and the edge cases when $d$ does not divide $k$ but with enough trial and error and/or thought this problem is not too bad. The generalization of the problem gives, for $d \leq k$ the bounds

\[\frac{(m - k + 1)}{m}^{k \bmod d} \prod_{j=1}^d \frac{(m - j\cdot \lfloor \frac{k}{d} \rfloor + 1)}{m}^{\lfloor \frac{k}{d} \rfloor} \leq P(unique) \leq \prod_{j=0}^{d-1} \frac{(m - (k \bmod d) - j \cdot \lfloor \frac{k}{d} \rfloor)}{m}^{\lfloor \frac{k}{d} \rfloor}\]

I’ll a drawing follows right after this to help explain the bounds. It kinda of takes a Riemann sum-like approach.

Illustration of the lower bound

Illustration of the upper bound

Using $k=10000$, we’re able to get a pretty good estimate. Yielding,

\[1.7360x10^{-37} \leq \frac{m!}{m^k (m - TBHSC)!} \leq 1.7656 x 10^{-37}\]

def lower_bound(m,k,d):
    flr = floor(k / d)
    bound = ((m-k+1) / m)**(k % d)
    for j in range(1,d+1):
        bound *= ((m-j*flr + 1)/ m) **flr

    return bound

def upper_bound(m,k,d):
    bound = 1
    r = k % d # Number of straggelers in remainder group
    flr = floor(k / d)
    for j in range(0,d):
        bound *= ((m - r - j*flr) / m) ** flr

    return bound

# Using this we can get a bound on the probability that a Ten Billion Human Second Century of UUIDs contains at least 1 pair of duplicates
print(f'Lower bound of a TBHSC of UUIDs containing duplicates: {lower_bound(m,h,10000)}') 
print(f'Upper bound of a TBHSC of UUIDs containing duplicates: {upper_bound(m,h,10000)}') 
print(f'The real probability: {birthday_solution(m,h)}')  

giving us the order of magnitude and the first two non-zero digits. I was honestly pretty satisfied with this. Having done this, let’s wrap it up by having some fun making a new unit of measurement.

Let’s do one more thing and compute the Ten Billion Human Second Bad Stuff 50% Threshold — or the TBHSBS-50 — for UUIDs. This is a new unit that I am making up. It measures the amount of time it would take to have a 50% chance of drawing duplicate UUIDs (bad stuff) if every ten billion people all drew a single UUID every second. We already know the probability that every draw is unique, so we just have to solve for when this equals 0.5. That is, we need to solve

\[\frac{m!}{m^k (n - x)!} = 0.5\]

through an optimization method to get a really good answer. But just by guess-and-check, the number should be around $k = 2,700,000,000,000,000,000$. Ok, performing bisection search we find that

\[\frac{m!}{m^k (m - 2714922669395445310)!} = 0.50000000000000000043\]

def bisection_search(func, a, b, epsilon=1e-10, max_iterations=100):
    """
    This code was written by ChatGPT. Performs a bisection search to find the root of a continuous function within a given interval.
    """

    if func(a) * func(b) > 0:
        raise ValueError("The function must have opposite signs at the endpoints of the interval.")

    iteration = 0
    while iteration < max_iterations:        
        c = (a + b) / 2
        if abs(func(c)) < epsilon:
            print('Tolerance achieved')
            return c
        elif func(a) * func(c) < 0:
            b = c
        else:
            a = c
        iteration += 1

    return (a + b) / 2

f = lambda x: gamma(m-1) / (gamma(m-x-1)) / (m**x) - 0.5
# By guess and check I found some decent starting values
a = mpf('2700000000000000000')
b = mpf('2800000000000000000')
x = bisection_search(f, a = a, b= b, epsilon=1e-40)
print(f'We require {round(x)} draws to have a {birthday_solution(m,x)} chance that our set of draws contains a duplicate')

ten_billion = 10000000000
years = round(x) / 3600 / 24 / 365 / ten_billion
print(f'This translates to {years} years!')

So generating $2714922669395445310$ UUIDs gives about a $50$% chance of having at least one duplicate. This means that the TBHSBS-50 of UUIDs is about $8.6$ years.