That was what I tweeted the other day. Of course, it’s an homage to this famous xkcd cartoon. The tweet was a link to this Stack Overflow question and my answer. This article provides a bit more background and describes a little adventure I had computing the probability I gave in the answer.
The background is that the original poster (OP) of the question was generating a million data objects and was storing them in a TreeSet. But the TreeSet ended up with only 975,000 elements in it. The obvious reason that there would be fewer elements in a set than were added is that some of the data objects are duplicates. Somebody asked about this, and the OP said the chance of generating duplicate data objects was “minuscule.” To somebody like me, this is like waving a red flag in front of a bull, so I had to go investigate. (You know, Duty Calls.)
I estimated that there was a possible space of 18,000,000 data objects, and the OP was generating 1,000,000 of them at random. What’s the possibility of there being at least one pair of duplicates among the generated objects? This is a variation of the Birthday Problem, also known as the Birthday Paradox. It’s not really a paradox. It is, however, quite counterintuitive how quickly the probability approaches certainty that there will be a duplicate as the number of trials increases.
Briefly, the birthday problem is, given a certain number of people, what’s the probability that two will have the same birthday? The probability reaches 50% at only 23 people, and at 70 people it has risen above 99.9%. Most people find this pretty surprising. Certainly the OP did; given 1,000,000 generated objects out of a space of 18,000,000, the probability is not minuscule at all, but is in fact astonishingly close to 100%.
It’s actually a bit easier to talk about the probability of there not being a duplicate, that is, the probability of the choices all being unique, so I’ll talk about that. (Of course, the probability of the choices being unique is simply 1.0 minus the probability of a collision.) The Wikipedia article gives a couple formulas for computing this probability. One is an approximation:
The second is a product involving a large number of factors:
In both formulas, d is the number of possible values in the domain, and n is the number of elements chosen. Naturally, the closed form approximation involves many fewer computations, so let’s start with that one.
What should we use to do the computation? Well I’m an old Unix hack, so I immediately reached for the venerable bc program. First let’s try some of the cases from the original birthday problem to see if we have this right (bold italic text is the program’s output):
$ bc -l (364/365)^(23*22/2) .49952284596341798480 (364/365)^(70*69/2) .00132609259546606814
These are only approximations, but they seem about right. Let’s try the exact computations:
p=1.0 for (k=1; k<23; k++) p *= (1 - k/365) p .49270276567601459277 p=1.0 for (k=1; k<70; k++) p *= (1 - k/365) p .00084042403484290862
The result for 23 people matches the figure given in the Wikipedia article (at least, to six decimal places) so it seems accurate. Great! Now let’s try the real problem.
d=18000000 n=1000000 ((d-1)/d)^(n*(n-1)/2) Runtime error (func=(main), adr=19): exponent too large in raise
Hm, that didn’t work. If the power operator isn’t working, let’s try the old trick of taking the logarithm, multiplying, and then exponentiating:
e(l((d-1)/d)*n*(n-1)/2)
I let this run at 100% CPU for five minutes and I didn’t get any output. I don’t know whether it was an infinite loop or what, but it certainly didn’t seem promising. All right then, let’s just try the exact computation:
p=1.0 for (k=1; k<n; k++) p *= (d-k)/d p 0
Zero. Crap, underflow. The probabilities get pretty small, so I guess I shouldn’t be surprised. Let’s try Java instead.
static void doubleApprox() { double d = 18_000_000.0; double n = 1_000_000.0; System.out.println(Math.pow((d-1.0)/d, n*(n-1.0)/2.0)); } 0.0
Underflow again. At least it ran quickly instead of looping infinitely. Let’s try the exact computation:
static void doubleProduct() { int d = 18_000_000; int n = 1_000_000; double p = 1.0; for (int k = 1; k < n; k++) { p *= (double)(d - k) / (double)d; } System.out.println(p); } 4.4E-323
Aha! Now we’re getting somewhere. I put this into the initial version of my answer and declared it done.
∞
But there were a couple suspicious things nagging me about this result. First, the exponent of -323 seemed awfully familiar. Second, there are only two digits of precision. Usually a floating point double gives about 17 digits. It turns out that this result is very close to Double.MIN_VALUE, which is about 4.9E-324. When the numbers are this small, they are denormalized. As they get smaller, they have fewer and fewer digits of precision. With such a huge loss of precision, continued multiplication by a fraction such as (d – k) / d becomes highly inaccurate.
It turns out that this result of 4.4E-323 is incredibly inaccurate. (In fact, as we’ll see later, it’s off by ten thousand of orders of magnitude.) In order to combat the underflow problem, I put a little hack into the loop to scale up the partial product by 10 until it was above 1.0. That should keep the values well within range, so we avoid precision loss. Of course, I kept track of the number of times I scaled by 10. It’s negative because scaling up by 10 means a negative exponent. (I have no idea whether this is acceptable numeric computation practice, but it seemed to work out in the end.) Here’s the code to do that, and the result.
static void doubleScaled() { int d = 18_000_000; int n = 1_000_000; int scale = 0; double p = 1.0; for (int k = 1; k < n; k++) { p *= (double)(d - k) / (double)d; while (p < 1.0) { p *= 10.0; scale--; } } System.out.printf("%11.9fE%d%n", p, scale); } 2.843374644E-12294
Ten to the minus twelve thousandth? Nah, that can’t be right. Can it?
I wasn’t sure how to verify this, so I talked to my friend and colleague Joe Darcy (blog, twitter). He suggested I use the floating-point mode of Java’s BigDecimal. Floating point? I thought BigDecimal only supported fixed-point arithmetic. In fact, there are variations of the BigDecimal operations that take a MathContext object, and if you set it up properly, it will perform floating point decimal arithmetic. Cool! Joe also mentioned that when used in this mode, BigDecimal stores its exponent as an int, so this should help avoid underflow.
Let’s try out the approximation first:
static void bdApprox() { int id = 18_000_000; int n = 1_000_000; MathContext mc = new MathContext(10, RoundingMode.HALF_EVEN); BigDecimal d = new BigDecimal(id, mc); BigDecimal base = d.subtract(BigDecimal.ONE, mc).divide(d, mc); BigDecimal result = base.pow(n * (n - 1) / 2, mc); System.out.println(result); } 622319181.9
WAT. This is totally wrong. Has Joe led me astray? Well, no. It turns out that BigDecimal.pow() takes an int argument as the exponent, and given n = 1,000,000 this clearly overflows an int. Whoops. All right then, let’s just go straight to the exact product computation:
static void bdExact() { int id = 18_000_000; int n = 1_000_000; MathContext mc = new MathContext(20, RoundingMode.HALF_EVEN); BigDecimal prob = new BigDecimal(1, mc); BigDecimal d = new BigDecimal(id, mc); for (int k = 1; k < n; k++) { BigDecimal num = new BigDecimal(id - k, mc); prob = prob.multiply(num, mc) .divide(d, mc); } System.out.println(prob); } 2.8433746444606670057E-12294
Whoa. Look at that: the same answer, to as many significant digits as I printed out from the scaled double precision computation.
That’s a pretty amazing number. The probability of choosing 1,000,000 unique, random values from a space of 18,000,000 is ten to the minus fricken’ twelve thousand. That’s what I call minuscule. And it totally explains why the Stack Overflow poster was getting duplicates.
Math. It works, bitches.
And BigDecimal too.
Leave a Reply