Feeds:
Posts
Comments

Archive for the ‘Software’ Category

That was what I tweeted the other day. Of course, it’s an homage to this famous xkcd cartoon. The tweet was a link to this Stack Overflow question and my answer. This article provides a bit more background and describes a little adventure I had computing the probability I gave in the answer.

The background is that the original poster (OP) of the question was generating a million data objects and was storing them in a TreeSet. But the TreeSet ended up with only 975,000 elements in it. The obvious reason that there would be fewer elements in a set than were added is that some of the data objects are duplicates. Somebody asked about this, and the OP said the chance of generating duplicate data objects was “minuscule.” To somebody like me, this is like waving a red flag in front of a bull, so I had to go investigate. (You know, Duty Calls.)

I estimated that there was a possible space of 18,000,000 data objects, and the OP was generating 1,000,000 of them at random. What’s the possibility of there being at least one pair of duplicates among the generated objects? This is a variation of the Birthday Problem, also known as the Birthday Paradox. It’s not really a paradox. It is, however, quite counterintuitive how quickly the probability approaches certainty that there will be a duplicate as the number of trials increases.

Briefly, the birthday problem is, given a certain number of people, what’s the probability that two will have the same birthday? The probability reaches 50% at only 23 people, and at 70 people it has risen above 99.9%. Most people find this pretty surprising. Certainly the OP did; given 1,000,000 generated objects out of a space of 18,000,000, the probability is not minuscule at all, but is in fact astonishingly close to 100%.

It’s actually a bit easier to talk about the probability of there not being a duplicate, that is, the probability of the choices all being unique, so I’ll talk about that. (Of course, the probability of the choices being unique is simply 1.0 minus the probability of a collision.) The Wikipedia article gives a couple formulas for computing this probability. One is an approximation:

\displaystyle \left(\frac{d - 1}{d}\right)^{n(n-1)/2}

The second is a product involving a large number of factors:

\displaystyle \prod\limits_{k=1}^{n-1}(1 - \textstyle\frac{k}{d})

In both formulas, d is the number of possible values in the domain, and n is the number of elements chosen. Naturally, the closed form approximation involves many fewer computations, so let’s start with that one.

What should we use to do the computation? Well I’m an old Unix hack, so I immediately reached for the venerable bc program. First let’s try some of the cases from the original birthday problem to see if we have this right (bold italic text is the program’s output):

$ bc -l
(364/365)^(23*22/2)
.49952284596341798480
(364/365)^(70*69/2)
.00132609259546606814

These are only approximations, but they seem about right. Let’s try the exact computations:

p=1.0
for (k=1; k<23; k++) p *= (1 - k/365)
p
.49270276567601459277
p=1.0
for (k=1; k<70; k++) p *= (1 - k/365)
p
.00084042403484290862

The result for 23 people matches the figure given in the Wikipedia article (at least, to six decimal places) so it seems accurate. Great! Now let’s try the real problem.

d=18000000
n=1000000
((d-1)/d)^(n*(n-1)/2)
Runtime error (func=(main), adr=19): exponent too large in raise

Hm, that didn’t work. If the power operator isn’t working, let’s try the old trick of taking the logarithm, multiplying, and then exponentiating:

e(l((d-1)/d)*n*(n-1)/2)

I let this run at 100% CPU for five minutes and I didn’t get any output. I don’t know whether it was an infinite loop or what, but it certainly didn’t seem promising. All right then, let’s just try the exact computation:

p=1.0
for (k=1; k<n; k++) p *= (d-k)/d
p
0

Zero. Crap, underflow. The probabilities get pretty small, so I guess I shouldn’t be surprised. Let’s try Java instead.

static void doubleApprox() {
    double d = 18_000_000.0;
    double n =  1_000_000.0;
    System.out.println(Math.pow((d-1.0)/d, n*(n-1.0)/2.0));
}
0.0

Underflow again. At least it ran quickly instead of looping infinitely. Let’s try the exact computation:

static void doubleProduct() {
    int d = 18_000_000;
    int n =  1_000_000;
    double p = 1.0;
    for (int k = 1; k < n; k++) {
        p *= (double)(d - k) / (double)d;
    }
    System.out.println(p);
}
4.4E-323

Aha! Now we’re getting somewhere. I put this into the initial version of my answer and declared it done.

But there were a couple suspicious things nagging me about this result. First, the exponent of -323 seemed awfully familiar. Second, there are only two digits of precision. Usually a floating point double gives about 17 digits. It turns out that this result is very close to Double.MIN_VALUE, which is about 4.9E-324. When the numbers are this small, they are denormalized. As they get smaller, they have fewer and fewer digits of precision. With such a huge loss of precision, continued multiplication by a fraction such as (d – k) / d becomes highly inaccurate.

It turns out that this result of 4.4E-323 is incredibly inaccurate. (In fact, as we’ll see later, it’s off by ten thousand of orders of magnitude.) In order to combat the underflow problem, I put a little hack into the loop to scale up the partial product by 10 until it was above 1.0. That should keep the values well within range, so we avoid precision loss. Of course, I kept track of the number of times I scaled by 10. It’s negative because scaling up by 10 means a negative exponent. (I have no idea whether this is acceptable numeric computation practice, but it seemed to work out in the end.) Here’s the code to do that, and the result.

static void doubleScaled() {
    int d = 18_000_000;
    int n =  1_000_000;
    int scale = 0;
    double p = 1.0;
    for (int k = 1; k < n; k++) {
        p *= (double)(d - k) / (double)d;
        while (p < 1.0) {
            p *= 10.0;
            scale--;
        }
    }
    System.out.printf("%11.9fE%d%n", p, scale);
}
2.843374644E-12294

Ten to the minus twelve thousandth? Nah, that can’t be right. Can it?

I wasn’t sure how to verify this, so I talked to my friend and colleague Joe Darcy (blog, twitter). He suggested I use the floating-point mode of Java’s BigDecimal. Floating point? I thought BigDecimal only supported fixed-point arithmetic. In fact, there are variations of the BigDecimal operations that take a MathContext object, and if you set it up properly, it will perform floating point decimal arithmetic. Cool! Joe also mentioned that when used in this mode, BigDecimal stores its exponent as an int, so this should help avoid underflow.

Let’s try out the approximation first:

static void bdApprox() {
    int id = 18_000_000;
    int n  =  1_000_000;
    MathContext mc = new MathContext(10, RoundingMode.HALF_EVEN);
    BigDecimal d = new BigDecimal(id, mc);
    BigDecimal base = d.subtract(BigDecimal.ONE, mc).divide(d, mc);
    BigDecimal result = base.pow(n * (n - 1) / 2, mc);
    System.out.println(result);
}
622319181.9

WAT. This is totally wrong. Has Joe led me astray? Well, no. It turns out that BigDecimal.pow() takes an int argument as the exponent, and given n = 1,000,000 this clearly overflows an int. Whoops. All right then, let’s just go straight to the exact product computation:

static void bdExact() {
    int id = 18_000_000;
    int n  =  1_000_000;
    MathContext mc = new MathContext(20, RoundingMode.HALF_EVEN);
    BigDecimal prob = new BigDecimal(1, mc);
    BigDecimal d = new BigDecimal(id, mc);

    for (int k = 1; k < n; k++) {
        BigDecimal num = new BigDecimal(id - k, mc);
        prob = prob.multiply(num, mc)
                   .divide(d, mc);
    }

    System.out.println(prob);
}
2.8433746444606670057E-12294

Whoa. Look at that: the same answer, to as many significant digits as I printed out from the scaled double precision computation.

That’s a pretty amazing number. The probability of choosing 1,000,000 unique, random values from a space of 18,000,000 is ten to the minus fricken’ twelve thousand. That’s what I call minuscule. And it totally explains why the Stack Overflow poster was getting duplicates.

Math. It works, bitches.

And BigDecimal too.

Read Full Post »

Earlier this month I had the pleasure of attending Donald Knuth’s 20th annual Christmas Tree Lecture. (announcement, video)

For me it was quite a bit of nostalgia. When I was a Computer Science student at Stanford, uh, a number of decades ago, of course I had to take a class from Knuth, who was (and still is) a Very Famous Computer Science Professor. The class was nominally a computer science class, but it was mostly math. In fact, it was all math. I’m good at math, having completed several semesters of college-level math even after placing out of introductory calculus classes. However, I’m not really, really good at math, which is what you have to be to keep up with Knuth. As a result I spent most of that class wallowing in confusion.

Knuth’s Christmas Tree lectures aren’t about Christmas trees, of course, but they are ostensibly about the Computer Science kind of trees, and they occur around Christmastime. Hence, Christmas Tree. But they aren’t so much about Computer Science as they are about math. So it was quite a familiar feeling for me to sit in a Stanford lecture hall, for the first time in decades, listening to Knuth give a lecture on math, with me wallowing in confusion most of the time.

A few years after I left Stanford, Knuth (along with Ronald Graham and Oren Patashnik) published Concrete Mathematics:

ConcreteMathematics

(Amazon link)

For some reason I was compelled to buy it. I probably bought it because it must be a Very Important Computer Science Book because it has the name of a Very Famous Computer Science Professor on it. So I bought it and flipped through it, but it was mostly about math, and I’m good at math but not that good, so I eventually shelved it and didn’t look at it for a long time.

A number of years later (perhaps April, 2008) I was quite into playing Grand Theft Auto: Vice City. Yes, this is that notorious game where you’re the criminal and you do a variety of nasty things. One of the different things you can do is to steal a police car and run “Vigilante Missions” where you chase down and kill criminals. You start off at the first level where the goal is to kill one criminal, and you get a reward of $50. At the second level, you have to kill two criminals, and the reward is $200. At level three, there are three criminals, and the reward is $450. At level four the reward is $800, and at level five the reward is $1,250. The rewards keep rising at each level, even though the number of criminals maxes out after a while.

The reward given at each level isn’t necessarily obvious. After playing for a while, I paused the game to see if there was any information about this on the internet. Various gaming guide and FAQ authors have gone to the trouble of documenting the reward amounts. For example, the Grand Theft Auto: Vice City Vigilante Reward FAQ lists the reward for each level, as well as the cumulative reward, for every level up to level 1,000! Furthermore, it gives a formula of sorts for computing the reward amount for each level:

You take the level you are at, subtract one, multiply by 100, add 50 and add that to the previous level’s reward to compute the level you want.

(edited for clarity)

Stated mathematically, this is a recurrence relation where the reward for level n is given as follows:

R_n = 100(n-1) + 50 + R_{n-1}

This can be simplified a bit:

R_n = 100n - 50 + R_{n-1}

It’s pretty easy to see that each level k adds a reward of 100k – 50 dollars, so this can be expressed as a summation:

R_n = \sum\limits_{k=1}^{n}(100k - 50)

This is pretty easy to simplify to closed form:

\begin{aligned} \\ R_n &= \sum\limits_{k=1}^{n}(100k - 50) \\ &= 100 \sum\limits_{k=1}^{n} k - \sum\limits_{k=1}^{n} 50 \\ \end{aligned}

That first summation term is simply the sum of integers from 1 to n so it can be replaced with the well-known formula for that sum:

\begin{aligned} \\ R_n &= 100\frac{n(n+1)}{2} - 50n \\ &= 50n^2 \end{aligned}

OK, pretty simple. But this is just the reward for a given level n. What about the cumulative reward for having completed all levels 1 through n? Now this is a bit more interesting. Let’s define the cumulative reward for level n as follows:

\begin{aligned} \\ S_n &= \sum\limits_{k=1}^{n} 50k^2 \\ \end{aligned}

I don’t know how the guy made the table for levels up to 1,000. He might have used a spreadsheet, or he might have even played out all 1,000 levels and painstakingly kept track of the reward and cumulative reward at each level. (That would be entirely believable; gamers are nothing if not compulsive.) Looking at the “formula” he gave for computing the reward at each level, I’m quite certain he didn’t compute a closed form for the cumulative reward at each level. Clearly, it was important for me to figure that out.

At this point I had forgotten the game and was busy filling up sheets of paper with equations.

Deriving a closed form for this should be simple, right? Just perturb the indexes:

\begin{aligned} \\ S_n &= \sum\limits_{k=1}^{n} 50k^2 \\ S_{n+1} &= \sum\limits_{k=1}^{n+1} 50k^2 \\ S_n + 50(n+1)^2 &= 50 + \sum\limits_{k=2}^{n+1}50k^2 \\ &= 50 + \sum\limits_{k=1}^{n}50(k+1)^2 \\ &= 50 + \sum\limits_{k=1}^{n}50(k^2 + 2k + 1) \\ &= 50 + \sum\limits_{k=1}^{n}50k^2 + \sum\limits_{k=1}^{n} 100k + \sum\limits_{k=1}^{n} 50 \\ &= 50 + S_n + 100 \sum\limits_{k=1}^{n} k + 50n \\ \end{aligned}

Oh crap. The S_n terms just cancel out. That’s what we’re trying to solve for. I tried several different techniques but nothing I tried worked. I was stumped. Hunting around on the web didn’t give me any helpful ideas, either.

At this point I couldn’t finish the game, because I just had to know how to compute the cumulative reward for completing n levels of the mission. I mean, this was important, right?

Swimming from the depths of my memory came a realization that somewhere I might have a book that would have some helpful information about solving sums and recurrences like this. I quickly found and pulled Concrete Mathematics from the shelf, blew the dust off, and curled up on the couch and started reading. Sure enough, right there at the start of section 2.5, General Methods, there’s a discussion of how to use several techniques to find the sum of the first n squares. (Page 41, equation 2.37) Great! How did Knuth et. al. solve it?

Method 0 is simply to look it up. There’s a well-known formula for this available in many reference books such as the CRC Handbook. But that’s cheating. I needed to know how to derive it. I kept reading.

Method 1 is to guess the answer and prove it by induction. I’m really bad at guessing stuff like this. I kept reading.

Method 2 is to perturb the sum. Aha! This is what I tried to do. Where did I go wrong? Turns out they did the same thing I did and ran into exactly the same problem — the desired sum cancels out. But they had one additional bit of insight: in attempting to find the sum of n squares, they ended up deriving the sum of the first n integers. Huh? Let’s pick up where we left off:

\begin{aligned} \\ S_n + 50(n+1)^2 &= 50 + S_n + 100 \sum\limits_{k=1}^{n} k + 50n \\ 50n^2 + 50n &= 100 \sum\limits_{k=1}^{n} k \\ \frac{n(n+1)}{2} &= \sum\limits_{k=1}^{n} k \\ \end{aligned}

Knuth et. al. continue by conjecturing that, if we tried to perturb the sum of cubes, a closed form for the sum of squares might just fall out. Let’s try that!

\begin{aligned} \\ T_n &= \sum\limits_{k=1}^{n} k^3 \\ T_{n+1} &= \sum\limits_{k=1}^{n+1} k^3 \\ T_n + (n+1)^3 &= 1 + \sum\limits_{k=2}^{n+1} k^3 \\ &= 1 + \sum\limits_{k=1}^{n} (k+1)^3 \\ &= 1 + \sum\limits_{k=1}^{n} (k^3 + 3k^2 + 3k + 1) \\ &= 1 + \sum\limits_{k=1}^{n} k^3 + 3 \sum\limits_{k=1}^{n} k^2 + 3 \sum\limits_{k=1}^{n} k \ + \sum\limits_{k=1}^{n} 1 \\ &= 1 + T_n + 3 \sum\limits_{k=1}^{n} k^2 + \frac{3}{2}n(n+1) + n \end{aligned}

OK, there’s our sum of squares term. Canceling T_n and simplifying,

\begin{aligned} \\ n^3 + 3n^2 + 3n &= 3 \sum\limits_{k=1}^{n} k^2 + \frac{3}{2}n^2 + \frac{3}{2}n + n \\ n^3 + \frac{3}{2}n^2 + \frac{1}{2}n &= 3 \sum\limits_{k=1}^{n} k^2 \\ \frac{1}{3}n^3 + \frac{1}{2}n^2 + \frac{1}{6}n &= \sum\limits_{k=1}^{n} k^2 \\ \frac{2n^3 + 3n^2 + n}{6} &= \sum\limits_{k=1}^{n} k^2 \\ \frac{n(n+1)(2n+1)}{6} &= \sum\limits_{k=1}^{n} k^2 \\ \end{aligned}

And there we have it. This matches the formula given in the CRC Handbook. The solution to my particular problem, the cumulative reward for Vigilante Mission levels, is simply 50 times this quantity.

What does this have to do with Christmas trees? Not much, but it does have a lot to do with Donald Knuth. It was good to see his lecture, even if I didn’t understand a lot of it. But I did apparently pick up something from his class, and I was able to figure out how to solve a problem (even if it was a silly one) with help from one of his books. And the equations this article use the WordPress plugin for LaTeX, which of course originally came from Knuth. So here’s to you, Don, have a happy new year!

Read Full Post »

Last night, I attended a “class reunion” of sorts, a reunion of the first Stanford CS108A/B class, Fundamentals of Computer Science, that Brian Reid taught in 1982-83. I was one of the teaching assistants. It was intended to be the beginning of an undergraduate Computer Science curriculum at Stanford. There still is a CS108 at Stanford, but it’s now called Object Oriented System Design. (Hmmm.)

Obviously a lot has changed in 30 years: personal computing, the internet, immense increases in computing power and storage, and so forth. But one thing that struck me about how much has changed is programming languages. I learned to program in BASIC in the 1970s. It was Wang 2200 BASIC, not one of the microcomputer-based BASICs of the time. Probably it was gratuitously different from them but fundamentally the same. From what I recall it had the following characteristics:

  • The only types were numbers and strings and arrays of them.
  • Variables were all globals.
  • Variable names were A, A0 through A9, …, Z9 and corresponding string variables A$, A0$ through A9$, … Z9$. At least one could have both a numeric variable Q7 and a string variable Q7$. (I seem to recall some BASICs prohibiting that.) But I don’t think you could have both scalar and array variables with the same name.
  • The control structures were IF, GOTO, GOSUB, and ON ERROR. You couldn’t put a statement in the IF statement, just a line number. There was no ELSE, so you had to GOTO around the then-block.
  • Oh yeah, line numbers.
  • GOSUB took line numbers. Wang BASIC had a variant of GOSUB that would pass parameters, but there were no return values from subroutines.

Despite all these limitations, we all had a lot of fun programming BASIC, didn’t we?

When I arrived at Stanford in the early 1980s, the programming language was Pascal. After BASIC, Pascal was a breath of fresh air. It had:

  • The ability to give names for things, and for names to reside within scopes.
  • The ability to put a group of statements into a begin/end block and use them in an if-then-else statement.
  • Structured programming constructs like while, repeat, and case.
  • It did have goto, but it was rarely used, and in fact I didn’t actually know how it worked. (The obvious cases worked as expected, of course, but what happens if you goto a label located in a nested scope? Or goto a label that’s in an outer scope but that’s not on your call stack?)
  • Data structures!
  • Local variables.
  • User-defined types.
  • A real boolean type.
  • Dynamic memory allocation and pointers.

Compared to BASIC, Pascal was a huge step forward, freed from restrictive variable names and line numbers. Dynamic allocation of instances of user-defined types enabled one to create data structures like linked lists and trees, which were almost impossible in BASIC.

All the CS108 programming assignments were in Pascal. We were writing really big programs. They must have have been hundreds or even thousands of lines long. 🙂 And as one of the TAs, I had to read a bunch of them.

People have dumped on Pascal a lot, unfairly so in my opinion. It had its share of problems, but it had huge advantages over BASIC, and I was also able to write some useful programs in Turbo Pascal on the PC in the mid 1980s.

Some of its problems did prove quite irritating though and possibly fatal. Among them are:

  • A string is simply an array of char, and the length of an array is part of its type. This made string handling incredibly cumbersome. BASIC’s substring and appending operations were fluid by comparison. I always wrestled with Pascal strings and never understood why they were so difficult until, many years later, I read a commentary by one of the C guys (Kernighan or Ritchie) that pointed out Pascal’s mistake was including the length of an array in its type. (I can’t find a reference to this though.) (See update below.)
  • Inability to terminate a loop in the middle. The typical practice was to use boolean flags to handle loop termination (with an embedded if-test) but this was cumbersome and error-prone. Of course, you could goto out of a loop, but that was frowned upon.
  • No short-circuit expression evaluation. This made loop termination even harder.
  • Weird lookahead-based I/O.
  • Lexical nesting.

UPDATE: In the comments, Joe Darcy pointed me to Kernighan’s article Why Pascal is Not My Favorite Language. It seems to be available in a bunch of places around the web. One particularly convenient place is here. Kernighan explains quite well not only the issue of array length, but also the problems with loop termination and lack of short-circuit evaluation. Money quote: “Early exits are a pain, almost always requiring the invention of a boolean variable and a certain amount of cunning.”

That point on lexical nesting deserves some further discussion. Lexical nesting is of course very useful: it allows certain details of internal data structures and code to be hidden from the outside. The problem with Pascal is that lexical nesting is the only means for creating large-scale program structures. (Of course some Pascal systems had separate compilation and libraries, but those were extensions.) Some of the toughest problems I helped students was with nesting problems. In one particular case I was trying to diagnose a compiler error where a variable wasn’t declared. Those were usually pretty easy: just look up towards the top of the program to find where the missing declaration needed to go. In this particular case there was a declaration that seemed to be in the right place. I was mystified.

After a lot of analysis, we discovered that the nesting was off. The begins and ends were balanced, but they were shifted in a way that caused an entire section of the program to be nested a level deeper than it seemed it was. Thus the declaration was in a nested scope and wasn’t visible where it was needed. The problem was that this was probably a 20- or 30-page program. The declaration was at one point, the compiler error (undeclared variable) was at a different point very far away, and the actual error (misplaced begin) was in still another location, also far away from the other two. Thus, to diagnose the problem, one had to read and understand the structure of the entire program.

Not long after I TA’d CS108, I started working for Brian in the Stanford Computer Systems Lab. This was a Unix/C shop, so I quickly switched from Pascal to C. Reminiscing last night, Brian said Pascal was like a nanny who always was saying “No, you can’t do that!” Programming in C was like Pascal with the training wheels taken off. (I’m sure I stole that line from somewhere.) Sure, there were weirdnesses (the * pointer dereference is a prefix operator? What’s with this weird array/pointer duality?) But most of the hassles of everyday programming in Pascal were gone:

  • The break statement.
  • Short-circuit expression evaluation.
  • Reasonable library-based I/O.
  • Reasonable library-based memory allocation.
  • Files-as-modules program structure.
  • Unchecked array length (length not part of array type).

This last is of course a blessing and a curse. As in Pascal, a C string is (mostly) just an array of char, but of variable length. This made string processing much easier. Actual string length was determined by a trailing zero byte. Of course, since the language didn’t keep track of the actual array capacity, programs would have to keep track of it themselves. Or not. But the programs worked anyway. Usually. This gave rise to a bunch of sloppy programming practices, leading to memory smashes, buffer overruns, security breaches, and so forth.

Nevertheless, C has proven to be incredibly useful and is still one of the most popular programming languages today. It’s important to understand, though, that the C we used in the 1980s (“K&R C”) isn’t the same C we program in today. I’ll have more to say about that in a separate article.

Read Full Post »