The distinct() stream operation compares the stream’s elements using Object.equals(). That is, for any set of stream elements that are all equals() to each other, the distinct() operation will let just one of them through. However, sometimes you want the notion of “distinct” to be based on some property or other value derived from the stream element, but not the value itself. You could use map() to map the stream element into some derived value and use distinct() on those, but the result would be a stream of those derived values, not the original stream element.

It would be nice if there were some construct like

distinct(Function<T,U> keyExtractor)

that would call keyExtractor to derive the values that are compared for uniqueness, but there isn’t. However, it’s not too difficult to write your own.

The first insight is that you can think of the distinct() operation as a ​stateful filter​. It’s like a filter() operation, which takes a predicate that determines whether to let the element though. It’s ​stateful​ because whether it lets an element through is determined by what elements it has seen previously.

This state needs to be maintained somewhere. Internally, the distinct() operation keeps a Set that contains elements that have been seen previously, but it’s buried inside the operation and we can’t get to it from application code. But we could write something similar ourselves. The usual way to maintain state in Java is to create a class that has fields in which the state is maintained. We need a predicate, and that predicate could be a method on that class. This will work, but it’s rather cumbersome.

The second insight is that lambdas can capture local variables from their enclosing lexical environment. These local variables cannot be mutated, but if they are references to mutable objects, ​those objects​ can be mutated. Thus we can write a higher-order function whose local variables contain references to the state objects, and we can have our higher-order function return a lambda that captures those locals and does its processing based on the captured, mutable state.

This function will want to take a keyExtractor function that’s used to derive a value from each stream element. Conceptually we’ll want a Set to keep track of values we’ve seen already. However, in case our stream is run in parallel, we’ll want some thread-safe data structure. A ConcurrentHashMap is a simple way to do this, with each existing key representing membership in the set, and the value being a dummy object such as the empty string. (That’s how many Set implementations in the JDK work already.) Ideally we’d want to use an existing object as the dummy value and not create one each time. The empty string literal is used many times in the core JDK classes, so it’s certainly already in the constant pool.

Here’s what the code looks like:

public static <T> Predicate<T> distinctByKey(
        Function<? super T,Object> keyExtractor) {
    Map<Object,String> seen = new ConcurrentHashMap<>();
    return t -> seen.put(keyExtractor.apply(t), "") == null;

This is a bit subtle. This is intended to be used within a filter() operation, so we’re returning a lambda that’s a predicate that computes a boolean based on whether it’s seen the value before. This value is derived from the stream element by calling the key extractor function. The put() method returns the previous value in the map, or null if there was no value. That’s the case we’re interested in, so if it returns null, we want the predicate to return true just for this first time. Subsequent times it will return non-null, so we return false those times, so the filter operation won’t pass through the element in those cases. I had used putIfAbsent() at first, since it has first-time-only semantics, but it turns out to be unnecessary, and using put() makes the code a bit shorter.

Here’s how it’s used. Suppose we have a Book class that has fields for title and author, and the obvious constructor and getters, and we have a list of books that we want to process:

List<Book> list = Arrays.asList(
    new Book("This Side of Paradise", "F. Scott Fitzgerald"),
    new Book("The Beautiful and Damned", "F. Scott Fitzgerald"),
    new Book("The Great Gatsby", "F. Scott Fitzgerald"),
    new Book("Tender is the Night", "F. Scott Fitzgerald"),
    new Book("The Sound and the Fury", "William Faulkner"),
    new Book("Absalom, Absalom!", "William Faulkner"),
    new Book("Intruder in the Dust", "William Faulkner"),
    new Book("The Sun Also Rises", "Ernest Hemingway"),
    new Book("A Farewell to Arms", "Ernest Hemingway"),
    new Book("The Old Man and the Sea", "Ernest Hemingway"),
    new Book("For Whom the Bell Tolls", "Ernest Hemingway"),
    new Book("A Moveable Feast", "Ernest Hemingway")

If we wanted one book from each author, we could do this:


The output from running this is:

Book[This Side of Paradise,F. Scott Fitzgerald]
Book[The Sound and the Fury,William Faulkner]
Book[The Sun Also Rises,Ernest Hemingway]

Since this is a sequential stream, the first book by each author is the one that ends up in the output. If we were to run this stream in parallel, it would still work “correctly” in that one book from each author would be output. However, which book is output would differ from run to run.

It takes a bit of tinkering to use higher-order functions to write stateful stream operations. You can see the evolution of my thinking by looking at my answer to a Stackoverflow question on this topic. I started out writing a class, but after chipping away at it a bit I realized a class was no longer necessary and a higher-order function could be used instead. This is a powerful technique that can be used to write all kinds of stateful stream operations. You just have to make sure to be careful they’re thread-safe.

Earlier this month I had the pleasure of attending Donald Knuth’s 20th annual Christmas Tree Lecture. (announcement, video)

For me it was quite a bit of nostalgia. When I was a Computer Science student at Stanford, uh, a number of decades ago, of course I had to take a class from Knuth, who was (and still is) a Very Famous Computer Science Professor. The class was nominally a computer science class, but it was mostly math. In fact, it was all math. I’m good at math, having completed several semesters of college-level math even after placing out of introductory calculus classes. However, I’m not really, really good at math, which is what you have to be to keep up with Knuth. As a result I spent most of that class wallowing in confusion.

Knuth’s Christmas Tree lectures aren’t about Christmas trees, of course, but they are ostensibly about the Computer Science kind of trees, and they occur around Christmastime. Hence, Christmas Tree. But they aren’t so much about Computer Science as they are about math. So it was quite a familiar feeling for me to sit in a Stanford lecture hall, for the first time in decades, listening to Knuth give a lecture on math, with me wallowing in confusion most of the time.

A few years after I left Stanford, Knuth (along with Ronald Graham and Oren Patashnik) published Concrete Mathematics:


(Amazon link)

For some reason I was compelled to buy it. I probably bought it because it must be a Very Important Computer Science Book because it has the name of a Very Famous Computer Science Professor on it. So I bought it and flipped through it, but it was mostly about math, and I’m good at math but not that good, so I eventually shelved it and didn’t look at it for a long time.

A number of years later (perhaps April, 2008) I was quite into playing Grand Theft Auto: Vice City. Yes, this is that notorious game where you’re the criminal and you do a variety of nasty things. One of the different things you can do is to steal a police car and run “Vigilante Missions” where you chase down and kill criminals. You start off at the first level where the goal is to kill one criminal, and you get a reward of $50. At the second level, you have to kill two criminals, and the reward is $200. At level three, there are three criminals, and the reward is $450. At level four the reward is $800, and at level five the reward is $1,250. The rewards keep rising at each level, even though the number of criminals maxes out after a while.

The reward given at each level isn’t necessarily obvious. After playing for a while, I paused the game to see if there was any information about this on the internet. Various gaming guide and FAQ authors have gone to the trouble of documenting the reward amounts. For example, the Grand Theft Auto: Vice City Vigilante Reward FAQ lists the reward for each level, as well as the cumulative reward, for every level up to level 1,000! Furthermore, it gives a formula of sorts for computing the reward amount for each level:

You take the level you are at, subtract one, multiply by 100, add 50 and add that to the previous level’s reward to compute the level you want.

(edited for clarity)

Stated mathematically, this is a recurrence relation where the reward for level n is given as follows:

R_n = 100(n-1) + 50 + R_{n-1}

This can be simplified a bit:

R_n = 100n - 50 + R_{n-1}

It’s pretty easy to see that each level k adds a reward of 100k – 50 dollars, so this can be expressed as a summation:

R_n = \sum\limits_{k=1}^{n}(100k - 50)

This is pretty easy to simplify to closed form:

\begin{aligned} \\  R_n &= \sum\limits_{k=1}^{n}(100k - 50) \\  &= 100 \sum\limits_{k=1}^{n} k - \sum\limits_{k=1}^{n} 50 \\  \end{aligned}

That first summation term is simply the sum of integers from 1 to n so it can be replaced with the well-known formula for that sum:

\begin{aligned} \\  R_n &= 100\frac{n(n+1)}{2} - 50n \\  &= 50n^2  \end{aligned}

OK, pretty simple. But this is just the reward for a given level n. What about the cumulative reward for having completed all levels 1 through n? Now this is a bit more interesting. Let’s define the cumulative reward for level n as follows:

\begin{aligned} \\  S_n &= \sum\limits_{k=1}^{n} 50k^2 \\  \end{aligned}

I don’t know how the guy made the table for levels up to 1,000. He might have used a spreadsheet, or he might have even played out all 1,000 levels and painstakingly kept track of the reward and cumulative reward at each level. (That would be entirely believable; gamers are nothing if not compulsive.) Looking at the “formula” he gave for computing the reward at each level, I’m quite certain he didn’t compute a closed form for the cumulative reward at each level. Clearly, it was important for me to figure that out.

At this point I had forgotten the game and was busy filling up sheets of paper with equations.

Deriving a closed form for this should be simple, right? Just perturb the indexes:

\begin{aligned} \\  S_n &= \sum\limits_{k=1}^{n} 50k^2 \\  S_{n+1} &= \sum\limits_{k=1}^{n+1} 50k^2 \\  S_n + 50(n+1)^2 &= 50 + \sum\limits_{k=2}^{n+1}50k^2 \\  &= 50 + \sum\limits_{k=1}^{n}50(k+1)^2 \\  &= 50 + \sum\limits_{k=1}^{n}50(k^2 + 2k + 1) \\  &= 50 + \sum\limits_{k=1}^{n}50k^2 + \sum\limits_{k=1}^{n} 100k + \sum\limits_{k=1}^{n} 50 \\  &= 50 + S_n + 100 \sum\limits_{k=1}^{n} k + 50n \\  \end{aligned}

Oh crap. The S_n terms just cancel out. That’s what we’re trying to solve for. I tried several different techniques but nothing I tried worked. I was stumped. Hunting around on the web didn’t give me any helpful ideas, either.

At this point I couldn’t finish the game, because I just had to know how to compute the cumulative reward for completing n levels of the mission. I mean, this was important, right?

Swimming from the depths of my memory came a realization that somewhere I might have a book that would have some helpful information about solving sums and recurrences like this. I quickly found and pulled Concrete Mathematics from the shelf, blew the dust off, and curled up on the couch and started reading. Sure enough, right there at the start of section 2.5, General Methods, there’s a discussion of how to use several techniques to find the sum of the first n squares. (Page 41, equation 2.37) Great! How did Knuth et. al. solve it?

Method 0 is simply to look it up. There’s a well-known formula for this available in many reference books such as the CRC Handbook. But that’s cheating. I needed to know how to derive it. I kept reading.

Method 1 is to guess the answer and prove it by induction. I’m really bad at guessing stuff like this. I kept reading.

Method 2 is to perturb the sum. Aha! This is what I tried to do. Where did I go wrong? Turns out they did the same thing I did and ran into exactly the same problem — the desired sum cancels out. But they had one additional bit of insight: in attempting to find the sum of n squares, they ended up deriving the sum of the first n integers. Huh? Let’s pick up where we left off:

\begin{aligned} \\  S_n + 50(n+1)^2 &= 50 + S_n + 100 \sum\limits_{k=1}^{n} k + 50n \\  50n^2 + 50n &= 100 \sum\limits_{k=1}^{n} k \\  \frac{n(n+1)}{2} &= \sum\limits_{k=1}^{n} k \\  \end{aligned}

Knuth et. al. continue by conjecturing that, if we tried to perturb the sum of cubes, a closed form for the sum of squares might just fall out. Let’s try that!

\begin{aligned} \\  T_n &= \sum\limits_{k=1}^{n} k^3 \\  T_{n+1} &= \sum\limits_{k=1}^{n+1} k^3 \\  T_n + (n+1)^3 &= 1 + \sum\limits_{k=2}^{n+1} k^3 \\  &= 1 + \sum\limits_{k=1}^{n} (k+1)^3 \\  &= 1 + \sum\limits_{k=1}^{n} (k^3 + 3k^2 + 3k + 1) \\  &= 1 + \sum\limits_{k=1}^{n} k^3 + 3 \sum\limits_{k=1}^{n} k^2 + 3 \sum\limits_{k=1}^{n} k \  + \sum\limits_{k=1}^{n} 1 \\  &= 1 + T_n + 3 \sum\limits_{k=1}^{n} k^2 + \frac{3}{2}n(n+1) + n  \end{aligned}

OK, there’s our sum of squares term. Canceling T_n and simplifying,

\begin{aligned} \\  n^3 + 3n^2 + 3n &= 3 \sum\limits_{k=1}^{n} k^2 + \frac{3}{2}n^2 + \frac{3}{2}n + n \\  n^3 + \frac{3}{2}n^2 + \frac{1}{2}n &= 3 \sum\limits_{k=1}^{n} k^2 \\  \frac{1}{3}n^3 + \frac{1}{2}n^2 + \frac{1}{6}n &= \sum\limits_{k=1}^{n} k^2 \\  \frac{2n^3 + 3n^2 + n}{6} &= \sum\limits_{k=1}^{n} k^2 \\  \frac{n(n+1)(2n+1)}{6} &= \sum\limits_{k=1}^{n} k^2 \\  \end{aligned}

And there we have it. This matches the formula given in the CRC Handbook. The solution to my particular problem, the cumulative reward for Vigilante Mission levels, is simply 50 times this quantity.

What does this have to do with Christmas trees? Not much, but it does have a lot to do with Donald Knuth. It was good to see his lecture, even if I didn’t understand a lot of it. But I did apparently pick up something from his class, and I was able to figure out how to solve a problem (even if it was a silly one) with help from one of his books. And the equations this article use the WordPress plugin for LaTeX, which of course originally came from Knuth. So here’s to you, Don, have a happy new year!

JavaOne 2014 Schedule

I have a jam-packed schedule for JavaOne 2014. My sessions are as follows:

TUT3371 Jump-Starting Lambda (Tue 30 Sep 0830 Hilton Yosemite B/C)

This is my gentle introduction to lambda tutorial. Download presentation (PDF).

CON3374 Lambda Q&A Panel (Tue 30 Sep 1230 Hilton Yosemite B/C)

This panel session will explore the impact of Java 8 Lambdas on the Java ecosystem.

BOF6244 You’ve Got Your Streams On My Collections! (Tue 30 Sep 1900 Hilton Yosemite A)

Community discussion of collections and the new streams APIs.

IGN12431 Ignite Session (Tue 30 Sep 1900 Hilton Imperial A)

This is at the same time as the BOF, so my session will be later on, perhaps 2000 or so. Make sure to come; I have a little surprise planned!

CON3372 Parallel Streams Workshop (Wed 1 Oct 1000 Hilton Yosemite A)

Writing parallel streams code can be easy and effective, but you have to avoid some pitfalls. Download Presentation (PDF)

CON6377 Debt and Deprecation (Wed 1 Oct 1500 Hilton Yosemite A)

Given by my alter ego, Dr. Deprecator, this talk explores the principles and prescriptions of deprecation. Download Presentation (PDF)

HOL3373 Lambda Programming Laboratory (Thu 2 Oct 1200-1400 Hilton Franciscan A/B)

This is your chance to try out some lambda expressions and stream APIs introduced in Java 8, in order to solve a couple dozen challenging exercises. View Introductory Slide Presentation. Download Exercises (NetBeans Project).

See you in San Francisco!

Here’s the slide presentation I gave at the Silicon Valley JavaFX Users Group this evening, June 4th 2014:

Java 8 Lambda and Streams Overview Slides (PDF)

This is an approximately one-hour talk that covers lambdas, default methods, method references, and the streams API. It necessarily leaves out a lot of details, but it serves as a reasonably brief overview to the new features we’ve added in Java 8. (This is the same presentation I gave at the Japan JUG  a couple weeks ago.)

The Meetup page for the SVJUGFX event is here:

Meetup Page

The video replay isn’t available as of this writing, but I imagine that a link will be placed there once the replay is available.

Finally, the source code for the “Bug Dates” charting application I showed in the talk is here:

Bug Dates demo (github)

We discovered another bug in a shell script test the other day. It’s incredibly simple and stupid but it’s been covering up errors for years. As usual, there’s a story to be told.

About a month ago, my colleague Tristan Yan fixed some tests in RMI to remove the use of fixed ports. The bug for this is JDK-7190106 and this is the changeset. Using a fixed port causes intermittent failures when there happens to be something else already running that’s using the same port. Over the past year or so we’ve been changing the RMI tests to use system-assigned ports in order to avoid such collisions. Tristan’s fix was another in the step of ridding the RMI test suite of its use of fixed ports. Using system-assigned ports requires using a Java-based test library, and these tests were shell scripts, so part of Tristan’s fix was also converting them to Java.

The converted tests worked mostly fine, except that over the following few weeks an occasional test failure would occur. The serialization benchmark, newly rewritten into Java, would occasionally fail with a StackOverflowError. Clearly, something must have changed, since we had never seen this error occur with the shell script version of the serialization benchmark. Could the conversion from shell script to Java have changed something that caused excessive stack usage?

It turns out that the serialization benchmark does use a larger-than-ordinary amount of stack space. One of the tests loads a class that is fifty levels deep in the class inheritance hierarchy. This wasn’t a case of infinite recursion. Class loading uses a lot of stack space, and this test sometimes caused a StackOverflowError. Why didn’t the stack overflow occur with the shell script test?

The answer is … it did! It turns out that the shell script version of the serialization test was throwing StackOverflowError occasionally all along, but never reported failure. It’s pretty easy to force the test to overflow the stack by specifying a very small stack size (e.g., -Xss200k). Even when it threw a StackOverflowError, the test would still indicate that it passed. Why did this happen?

After the test preamble, last line of the shell script invoked the JVM to run the serial benchmark like this:

$TESTJAVA/bin/java \
    -cp $TESTCLASSES \
    bench.serial.Main \
    -c $TESTSRC/bench/serial/jtreg-config &

Do you see the bug?

The pass/fail status of a shell script test is the exit status of the script. By usual UNIX conventions, a zero exit status means pass and a nonzero exit status means failure. In turn, the exit status of the script is the exit status of the last command the script executes. The problem here is that the last command is executed in the background, and the “exit status” of a command run in the background is always zero. This is true regardless of whether the background command is still running or whether it has exited with a nonzero status. Thus, no matter what happens in the actual test, this shell script will always indicate that it passed!

It’s a simple matter to experiment with the -Xss parameter in both the shell script and Java versions of the test to verify they both use comparable amounts of stack space. And given that the test workload sometimes overflowed the stack, the fix is to specify a sufficiently large stack size to ensure that this doesn’t happen. See JDK-8030284 and the changeset that fixes it.

How did this happen in the first place? I’m not entirely sure, but this serialization benchmark test was probably derived from another shell script test right nearby, an RMI benchmark. Tristan also rewrote the RMI benchmark into a Java test, but it’s a bit more complicated. The RMI benchmark needs to run a server and a client in separate JVMs. Simplified, the RMI benchmark shell script looked something like this:

echo "Starting RMI benchmark server "
java bench.rmi.Main -server &

# wait for the server to start
sleep 10 

echo "Starting RMI benchmark client "
java bench.rmi.Main -client

When the serialization benchmark script was derived from the RMI benchmark script, the original author simply deleted the invocation of the RMI client side and modified the server-side invocation command line to run the serialization benchmark instead of the RMI benchmark, and left it running in the background.

(This test also exhibits another pathology, that of sleeping for a fixed amount of time in order to wait for a server to start. If the server is slow to start, this can result in an intermittent failure. If the server starts quickly, the test must still wait the full ten seconds. The Java rewrite fixes this as well, by starting the server in the first JVM, and forking the client JVM only after the server has been initialized.)

This is clearly another example of the fragility of shell script tests. A single character editing error in the script destroyed this test’s results!

Further Reading

  1. Testing Pitfalls (pdf). I gave this presentation at the TestFest around the time of Devoxx UK in March, 2013. This presentation has a lot of material on the problems that can arise with shell script tests in OpenJDK that use the jtreg test harness.
  2. Jon Gibbons (maintainer of jtreg) has an article entitled Shelling Tests that also explains some of the issues with shell script tests. It also describes the progress being made converting shell tests to Java in the langtools repository of OpenJDK.

JavaOne 2013 Has Begun!

This year’s JavaOne has begun! The keynotes were today, in Moscone for the first time since Oracle acquired Sun. It was a bit strange, having a little bit of JavaOne in the midst of Oracle OpenWorld. Red was everywhere. The rest of the week, JavaOne is in The Zone in the hotels by Union Square.

As usual, I’m involved in several activities at JavaOne. Oddly enough I’m not on any normal technical sessions. But I have one of almost everything else: a tutorial, a BOF, and a Hands-on lab.

Jump-starting Lambda Programming [TUT3877] – 10:00am-12:00pm Monday.

A gentle introduction to lambdas. This is early in the schedule, so you should start here, and then progress to some of the other, more advanced lambda sessions later in the conference.

[update] Slides available here: TUT3877_Marks-JumpStartingLambda-v6.pdf

Ten Things You Should Know When Writing Good Unit Test Cases in Java [BOF4255] – 6:30pm-7:15pm Monday.

Paul Thwaite (IBM) submitted this and invited me to contribute. I think we have some good ideas to share. Ideally a BOF should be a conversation among the audience members and the speakers. This might be difficult, as it looks like over 250 people have signed up so far! It’s great that there’s so much interest in testing.

[update] Paul has posted his slides.

Lambda Programming Laboratory [HOL3970] – 12:30pm-2:30pm Wednesday.

Try your hand at solving a dozen lambda-based exercises. They start off simple but they can get quite challenging. You’ll also have a chance to play with a JavaFX application that illustrates how some Streams library features work.

[update] I’ve uploaded the lab exercises in the form of a NetBeans project (zip format). Use the JDK 8 Developer Preview build or newer and use NetBeans 7.4 RC1 or newer.

Java DEMOgrounds in the Exhibition Hall – 9:30am-5:00pm Monday through Wednesday.

The Java SE booth in the DEMOgrounds has a small lambda demo running in NetBeans. I wrote it (plug, plug). I plan to be here from 2:00pm-3:00pm on Monday (the dedicated exhibition hours, when no sessions are running) so drop by to chat, ask questions, or to play around with the demo code.

Enjoy the show!

No, not that fixed point.

In the current sex-scandal-of-the-week, New York mayoral candidate Anthony Weiner has basically admitted to sending lewd messages under the pseudonym “Carlos Danger.” Where the heck did that name come from?

Clearly, there is a function that maps from one’s ordinary name to one’s “Carlos Danger” name. Slate has helpfully provided an implementation of the Carlos Danger name generator function. Using this tool, for example, one can determine that the Carlos Danger name for me (Stuart Marks) is Ricardo Distress. Hm, not too interesting. Of course, the Carlos Danger name for Anthony Weiner is Carlos Danger.

Now, what is the Carlos Danger name for Carlos Danger? It must be Carlos Danger, right? Apparently not, as the generator reveals that it is Felipe Menace.

Inspecting the source code of the web page reveals that the generator function basically hashes the input names a couple times and uses those values to index into predefined tables of Carlos-Danger-style first and last names. So, unlike Anthony Weiner, which is special-cased in the code, there’s nothing special about Carlos Danger. It’ll just map into some apparently-random pair of entries from the tables.

If the Carlos Danger name for Carlos Danger isn’t Carlos Danger, is there some other name whose Carlos Danger name is itself? Since there is a fairly small, fixed set of names, this is pretty easy to find out by searching the entire name space, as it were. A quick transliteration of the function into Java later (including a small wrestling match with character encodings), I have the answer:

  • The Carlos Danger name for Mariano Dynamite is Mariano Dynamite.
  • The Carlos Danger name for Miguel Ãngel Distress is Miguel Ãngel Distress.

You heard it here first, folks.

Finally, if you ever run into Ricardo Distress, tell him I said hi.


Get every new post delivered to your Inbox.