Feeds:
Posts
Comments

Archive for the ‘Java’ Category

A new default method CharSequence.isEmpty() was added in the just-released JDK 15. This broke the Eclipse Collections project. Fortunately, the EC developers were testing the JDK 15 early access builds. They noticed the incompatibility, and they were able to ship a patch release (Eclipse Collections 10.4.0) before JDK 15 shipped. They also reported this to the OpenJDK Quality Outreach program. As a result, we were able to document this change in a release note for JDK 15.

Kudos to Nikhil Nanivadekar and Don Raab and the Eclipse Collections team for getting on top of this issue!

What’s the story here? Aren’t new JDK releases supposed to be compatible? In general, yes, we try really hard to keep everything compatible. But sometimes incompatibilities are unavoidable, and sometimes we just miss stuff. To understand what happened, we need to discuss two distinct concepts: source incompatibility and binary incompatibility.

A source incompatible change is one where a source file compiles just fine on an earlier JDK release but fails to compile on a more recent JDK release. A binary incompatible change is one where a compiled class file runs fine on an earlier JDK release but fails at runtime on a more recent JDK release.

In development of the JDK, we put in quite a bit of effort to avoid binary incompatible changes, since it’s unreasonable to force people to recompile everything, and potentially maintain different artifacts, for different JDK releases. Ideally, we’d like to enable people to provide a single binary artifact (e.g., a jar file) that runs on all of the JDK releases that their project supports.

We are somewhat more tolerant of source incompatible changes. If you’re recompiling something, then presumably you have access to the source code in order to make a few minor adjustments. We’re willing to make minor source incompatible changes to the JDK if the change provides enough value to justify the incompatibility.

It turns out that adding a default method to an interface is potentially both a source and binary incompatible change. I was a bit surprised by this. What’s going on?

Let’s first set aside default methods on interfaces and look just at adding methods to classes. Making changes to a class potentially affects subclasses. In most cases, adding a method to a class is a binary compatible change, even if the subclass has methods that are apparently in conflict with the new method in the superclass. For example, consider this class compiled on JDK 8:

class MyInputStream extends InputStream {
    public String readAllBytes() { ... }
    ...
}

This works fine. However, a method was added to InputStream on JDK 9:

public byte[] readAllBytes()

Now there is a conflict between InputStream and MyInputStream, since they have methods with the same name, the same parameters (none), but different return types. Despite this conflict, this is a binary compatible change. Any already-compiled classes that invoke the readAllBytes() method on an instance of MyInputStream will do so using this bytecode:

invokevirtual #6 // Method MyInputStream.readAllBytes:()Ljava/lang/String;

(I determined this by compiling a program that uses MyInputStream on JDK 8, and then running the javap -c command on the resulting class file.) Roughly, this says “invoke the method named «readAllBytes» that takes no arguments and returns a String.” That method exists on MyInputStream and not on InputStream, so the method invocation works even on JDK 9.

However, this is a source incompatible change. When I try to recompile MyInputStream.java on JDK 9, the result is this:

MyInputStream.java:13: error: readAllBytes() in MyInputStream cannot override readAllBytes() in InputStream
public String readAllBytes() {
^
return type String is not compatible with byte[]

The compatibility analysis of adding methods to classes is fairly straightforward. There is only one path from the current class up the superclass chain to the root class, java.lang.Object. Any conflicts among methods can only occur on this path.

Analysis of adding default methods to interfaces is more complicated, because a class or interface can inherit from multiple interfaces. This means that, looking upward from the current class, instead of there being a linear chain of superclasses up to Object, there is a branching tree (actually a DAG) of interface inheritance. This gives rise to several inheritance possibilities that cannot occur with class-only inheritance.

Also, since default methods are a relatively recent feature, the Java community has relatively less experience evolving APIs using default methods. Default methods were added in Java 8, which was released in 2014, so we have “only” six years of experience with it.

It was possible to have conflicts among interfaces, even before Java 8, for example, if two unrelated interfaces declared the same method but with different return types. Prior to Java 8, though, interfaces were essentially impossible to evolve, and so having such conflicts arise from interface evolution hardly occurred. Finally, in the pre-Java 8 world, interface methods were all abstract. If a class inherited the “same” method (same name, parameters, and return type) from different interfaces, that was OK, as both could be satisfied by a single implementation provided by the class or one of its superclasses.

With the addition of default methods in Java 8, a new problem arose: what if a default method were added to an interface somewhere, such that conflicts between method implementations might arise somewhere in the superclass and superinterface graph? More specifically, what if the superinterface graph contains two default implementations for the same method? The full rules are described in the Java Language Specification, sections 8.4.8 and 8.4.8.4, and there are lots of edge cases, but briefly, the rules are as follows:

  • Methods inherited from the class hierarchy take precedence over default methods inherited from interfaces.
  • Default methods in interfaces are allowed to override each other; the most specific override takes precedence.
  • If multiple default methods are inherited from unrelated interfaces (that is, one doesn’t override the others), that’s a compile-time error.

Here are some examples of these rules in action:

class S {
    public void foo() { ... }
}

interface I {
    default void foo() { ... }
}

interface J extends I {
    default void foo() { ... }
}

interface K {
    default void foo() { ... }
}

Given this class and these interfaces, how do the inheritance rules work?

class C extends S implements I { }
// ok: class wins, S::foo inherited

class D implements I, J { }
// ok: overriding default method wins, J::foo inherited

class E implements I, K { }
    ERROR: types I and K are incompatible;
    class E inherits unrelated defaults for foo() from types I and K

So now we have to think harder about the compatibility impact of adding a default method. If a class already has the method, we’re OK. If there’s another interface that has a default method that overrides or is overridden by the default method we’re adding, that’s OK too. A problem can only occur if there is another default method somewhere in the interface graph inherited by some class.

That’s what’s going on with source compatibility. If you run through the examples above, you can see the kind of compilation error that might arise. What about binary compatibility? It turns out that the rules for binary compatibility with default methods are actually quite similar to those for source compatibility.

Here’s what the Java Virtual Machine Specification says about how invokevirtual finds the method to call. It first talks about method selection:

A method is selected with respect to [the class] and the resolved method (§5.4.6).

Section 5.4.6 says:

The maximally-specific superinterface methods of [the receiver class] are determined (§5.4.3.3). If exactly one matches [the method]’s name and descriptor and is not abstract, then it is the selected method.

OK, what if there isn’t exactly one match? In particular, what if there are multiple matches? Back in the specification of invokevirtual, it says:

If no method is selected, and there are multiple maximally-specific superinterface methods of [the class] that match the resolved method’s name and descriptor and are not abstract, invokevirtual throws an IncompatibleClassChangeError.

Thus, the JVM has to do quite a bit of analysis at runtime. When a method is invoked on some class, it has to not only search for that method up the class hierarchy. It also has to search the graph of interface inheritance to see if a default method might have been inherited, and that there is exactly one such method. Thus, adding a default method to an interface can easily cause problems for existing, compiled classes — a binary incompatibility.

We always examine the JDK for incompatibilities and avoid them if possible. In addition, we look at popular non-JDK libraries to see if problems might occur with them. This kind of incompatibility can occur only if a non-JDK library has a signature-compatible default method in an interface that is unrelated to the JDK interface being modified. It also requires that there be some class that inherits both that interface and the JDK interface. That seems pretty rare, but it can happen.

In fact, this is exactly the case that came up in Eclipse Collections! The Eclipse Collections library has an interface PrimitiveIterable that implements a default method isEmpty, and it also has a class CharAdapter that implements PrimitiveIterable and CharSequence:

interface PrimitiveIterable {
    default boolean isEmpty() { ... }
}

class CharAdapter implements PrimitiveIterable, CharSequence {
    ...
}

This works perfectly fine in JDK 14 and earlier releases. Consider some code that calls CharAdapter.isEmpty(). The bytecode generated would be as follows:

invokevirtual #13 // Method org/eclipse/collections/impl/string/immutable/CharAdapter.isEmpty:()Z

This works on JDK 14, because invokevirtual searches all the superclasses and superinterfaces of CharAdapter, and it finds exactly one default method: the one in PrimitiveIterable.

On JDK 15, the situation is different. A new default method isEmpty() was added to CharSequence. Thus, when the same invokevirtual bytecode is executed, it searches the superclasses and superinterfaces of CharAdapter, but this time it finds two matching default methods: the one in PrimitiveIterable and the one in CharSequence. That’s an error according to the JVM Specification, and that’s exactly what happens:

java.lang.IncompatibleClassChangeError: Conflicting default methods: org/eclipse/collections/api/PrimitiveIterable.isEmpty java/lang/CharSequence.isEmpty

What’s to be done about this? Fortunately, the fix is pretty simple: just add an implementation of isEmpty() to the CharAdapter class. (A couple other classes, CodePointAdapter and CodePointList, are in a similar situation and were also fixed.) In this case the implementations of isEmpty() are so simple that the code this.length == 0 was just inlined. If for some reason it were necessary to have CharAdapter inherit the implementation from PrimitiveIterable, then the implementation in CharAdapter could have been written like this:

@Override
public boolean isEmpty()
{
    return PrimitiveIterable.super.isEmpty();
}

As mentioned above, this fix was delivered in Eclipse Collections 10.4.0, which was delivered in time for JDK 15. Again, thanks to the EC team for their quick work on this.

OK, that’s how the JVM behaves. Why does the JVM behave this way? That is, why does it throw an exception (really, an Error) if it detects multiple default methods among the superinterfaces? Couldn’t it, for example, remember what method was called on JDK 14 (the one on PrimitiveIterable), and then continue to call that method even on JDK 15?

The explanation requires understanding of some background about virtual methods. Consider a simple class hierarchy in a library:

class A {
}

class B extends A {
    void m() { }
}

class C extends B {
}

Suppose further that an application has this code:

void exampleCode(B b) {
    b.m();
}

What method is called? Clearly, this will invoke the B::m. Now suppose that the library is modified as follows:

class A {
    void m() { } // method "promoted" from B
}

class B extends A {
}

class C extends B {
    void m() { } // a new overriding method
}

and the application is run again. Even though the code is invoking method m on B, we don’t know which method will actually be invoked. If the variable b is an instance of B, then A::m will be invoked. But if variable b is an instance of C, then C::m will be invoked.

The method that actually gets invoked depends on the class of the receiver object and the class hierarchy that has been loaded into in this JVM. There is nothing written down anywhere that says that the application used to call B::m. In fact it would be a mistake for something to be written down that causes B::m to continue to be invoked. When an overriding method is added to class C, calls that used to end up at B::m should now be calling C::m. That’s what we want virtual method calls to do.

It’s similar with superinterfaces (though more complicated of course). The JVM needs to do a search at runtime to determine what method to call. If it finds two default methods, such as PrimitiveIterable::isEmpty and CharSequence::isEmpty, there is no information to tell the JVM that the code used to call PrimitiveIterable::isEmpty and that the CharSequence::isEmpty method was added in the most recent release. All the JVM knows is that it’s been asked to invoke a method, it found two, and it has no further information about which to call. Therefore, the only thing it can do is throw an error.

Finally, could this problem have been avoided in the first place? The JDK team had done some analysis to determine whether adding CharSequence.isEmpty() would cause any incompatibilities. The analysis probably looked for no-arg methods with the same name but with a different return type. It might have looked for a method named isEmpty() with a non-public access level, another cause of incompatibilities. But these are both source incompatibilites. Or maybe the analysis missed Eclipse Collections entirely.

One thing that future analyses ought to look for is interfaces with a matching default method. That would have turned up PrimitiveIterable, and which runs the risk of binary incompatibility. By itself this isn’t a problem, but it would cause a problem for any class that implements both interfaces. It turns out that CharAdapter (and related classes) do implement both, so that’s clearly a binary incompatibility.

Even if CharAdapter and friends didn’t exist (and even now after they’ve been fixed) there is still a possibility that further incompatiblities exist. Consider some application class that happens to implement both PrimitiveIterable and CharSequence. That class might work perfectly fine with Eclipse Collections 10.3.0 and JDK 14. But it will fail with JDK 15. The problem will persist even if the application upgrades to Eclipse Collections 10.4.0, since the incompatibility is with the application class, not with CharAdapter and friends. So, that application will have to be fixed, too.

Now that we’ve described the problem and the possibility of incompatibilities, does it mean that it was a mistake to have added CharSequence.isEmpty()? Not necessarily. Even if we had noticed the incompatibility in Eclipse Collections prior to the addition of the isEmpty() default method, we might have gone ahead with it anyway. The criterion isn’t to avoid incompatibility at all costs. Instead, it’s whether the value of adding the new default method outweighs the cost and risk of incompatibility. That said, it would have been better to have noticed the incompatibility earlier and discussed it before proceeding, instead of putting an external project like Eclipse Collections into the position of having to fix something in response to a change in the JDK.

In summary, adding a default method to an interface can result in source and binary incompatibilities. The possibility of the source incompatibility is perhaps obvious, but the binary incompatibility is quite subtle. Both of these have been a possibilities since Java 8 was delivered in 2014. But to my knowledge this is the first time that the addition of a default method has resulted in a binary incompatibility with a real project (as opposed to a theoretical exercise or a toy program). It behooves us to do a more rigorous search for potentially conflicting methods the next time we decide to add a default method to an interface in the JDK.

Read Full Post »

Bill Shannon (1955-2020)

I was saddened to hear news of Bill Shannon’s recent passing. He joined Sun very early, as employee number 11, soon after Sun’s founding in 1982. As far as I know, he was the earliest Sun employee remaining at Oracle. He was an engineering leader already by the time I joined Sun in 1986. I had the privilege of working with him — and sometimes against him — on several occasions.

Back in the day, people at Sun would refer to each other by their Unix logins. (I was “smarks”, and to some extent, I still am.) To this day I think of Bill simply as “shannon”. The other day I tweeted a few memorable quotes from shannon. Each of them is backed by a funny story, which I’ll relate here. If you ever heard Bill speak, please imagine these spoken in his imperious baritone.

Sometime in the 1990s, Sun’s internal network was organized into domains that corresponded to the overall functional area in which one worked. The engineering groups were under “eng.sun.com”, the corporate management was under “corp.sun.com”, and so forth. We had email addresses that were tied to the domain name, so I was smarks@eng.sun.com. At some point it as decided that everything would be reorganized into geographic domains. I worked in the San Francisco Bay Area region, so the old domains would be replaced with the sfbay.sun.com domain. An announcement went out that described this change, and it said something like,

Please inform all of your contacts that your new email address will be login@sfbay.sun.com instead of the old login@eng.sun.com. The eng.sun.com email addresses will stop working in 90 days.

I thought, this is ridiculous. I’ve handed out countless business cards that have my eng.sun.com email address on them, and I can’t track down everybody I’ve ever given a business card. I’ve written that email address on papers that have been published in conference proceedings, and those can’t be changed. I can’t be the only one with this problem, either. But, I thought naïvely, it should be pretty simple to set up an MX record (a DNS mail exchanger record) to handle email sent to addresses in the old eng.sun.com domain. I filed a ticket to request that, but it was summarily closed by the network administrators with some explanation like, “Mail forwarding is not possible.” Oh well, I guess I don’t know anything about running a corporate network with thousands of nodes, and I let it drop.

A couple days later, shannon sent mail to all of engineering, describing exactly the same problem I was concerned about. I replied to him, saying that I had requested an MX record be published, but the ticket had been closed. He said, “Yes, that’s what should be done. I’ll talk to the network administrators about it.”

A couple days later, he followed up with this:

    You're right, these people are idiots.

A project that shannon and I worked on together was a large joint development project with another company (which I won’t name, but whose initials are H.P.) Well, OtherCompany had a penchant of coming up with incredibly complex, fragile designs that tried to solve problems that didn’t really need solving.

In desktop systems, it’s pretty common to have a portion of a window that lets users edit text. This is usually implemented by a “text editor” widget provided by the windows tookit library, but created and managed by the application. Apparently this was unsatisfactory for OtherCompany, so they wanted to have a single, “daemon” process that managed all of the text editor widgets for every application on the desktop. At Sun we all thought this was a terrible idea, but OtherCompany wouldn’t let go of it.

At one point there was a conference call where shannon and and others at Sun had a review for this design with OtherCompany. It went something like this.


shannon: Now let me get this straight. Instead of each
application owning its own text widgets, all the text
editing functions are centralized into a single process?

OtherCompany: Yes.

shannon: And instead of each application process handling
keyboard events for its text widgets, those events will be
handled by this centralized daemon process?

OtherCompany: Yes.

shannon: So all the text data that the user has entered will
be in this daemon process, not in the application?

OtherCompany: Yes.

shannon: And if this other process crashes, what happens
to that data?

OtherCompany: (discussion) All the text data is lost.

shannon: And if this daemon process hangs, then what will
happen to the applications on the desktop?

OtherCompany: (discussion) They will all hang.

shannon: ...

OtherCompany: ...

shannon: Do you see anything wrong with this architecture?

Bill made a big impression on me early on, well before I actually met him. I joined Sun in 1986, as an impressionable young engineer fresh out of school. Fairly early on I heard about some guy “shannon” who was a bigwig in the Systems group. I was in a separate group, the Windows group, so we didn’t interact.

Some time soon after I joined, shannon sent an email to all of engineering, with a policy statement. (This was before I started to save email compulsively, otherwise I would have dug up the original.) As I recall, it went something like this:


This is a statement on the Systems Group's policy for code
that is checked into SunOS. The policy is:

    * All code must conform to the Sun C Style Guide

Non-conforming code that is posted for review will be
rejected until it does conform.

Non-conforming code that is checked into the source base
will be backed out and will not be permitted to be checked
in until it does conform.

If you do not understand this policy, I will come to your
office and explain it until you do.

This only applied to SunOS code, not Windows code, so it didn’t affect my day-to-day work. But as a young engineer I found it to be hair-raising! The lesson I took from this was, you do not want to cross shannon.

It’s a lesson that served me well over the years. 🙂

Like Bill, I stayed on at Sun all the way up until the 2010 acquisition by Oracle, and we stayed at Oracle until the present day. We didn’t work together too closely in recent years, though we both worked on Java – he worked on Java EE, and I worked on Java ME and Java SE. We were even in the same building on Sun’s (later Oracle’s) Santa Clara campus for several years. It’s amazing that he’s been around nearby for literally my entire career. It’s huge loss that he’s gone. Bye shannon, we’ll miss you.

Here are some links to other pages about Bill.

Read Full Post »

The other day on Twitter I said, “Scanner is a weird beast. I wouldn’t necessarily use it as a good example for anything.” The context was a discussion about classes that are both an Iterator and are AutoCloseable. As it happens, Scanner is such an example. It’s an Iterator, because it allows iteration over a sequence of tokens, and it’s also AutoCloseable, because it might have an external resource (like a file) contained within it. I wouldn’t hold it up as an example of good object design, though. This article explains why.

Scanner has a pretty complicated API, but once you figure out how to use it, it’s incredibly useful. Its main issue is that it’s trying to do too many things at once. The good news is that you can use parts of the API for stylized uses and mostly ignore other parts of the API.

At its core, Scanner is about regex pattern matching. Unlike the Pattern and Matcher classes, which can only match on a fixed input such as a String, Scanner allows you to match over arbitrary input that might not even exist in memory. There are several Scanner constructors that allow input to be read from various sources such as files, InputStreams, or channels. Scanner handles buffering, and it reads additional input as necessary, and it discards any input that was skipped over during matching. This is really cool. It means you can do matching over arbitrarily sized input data using just a few KB of memory.

(Naturally this depends on the patterns used for matching as well as the well-formedness of input. For example, you can attempt to read a file line by line, and this will work for an arbitrarily sized file if it’s broken up into reasonably sized lines. If the file doesn’t have any line separators, Scanner will bring the whole file into memory, as the file conceptually contains one long line.)

Scanner has two fundamental modes of matching. The first mode is to break the input into tokens that are separated by delimiters. The delimiters are defined by the regex pattern you provide. (This is rather like the String.split method.) The second mode is to find chunks of text that result from matching the regex pattern you provide. In other words, the token mode provides the text between matches, and the find mode provides the text of the matches themselves. What’s odd about the Scanner API is that there are groups of methods that apply in one mode but not the other.

The methods that apply to the tokens mode are:

  • delimiter
  • locale
  • hasNext* (excluding hasNextLine)
  • next* (excluding nextLine)
  • radix
  • tokens
  • useDelimiter
  • useLocale
  • useRadix

The methods that apply to the find mode are:

  • findAll
  • findInLine
  • findWithinHorizon
  • hasNextLine
  • nextLine
  • skip

(Additional Scanner methods apply to both modes.)

Here’s an example of using Scanner for matching tokens:

    String story = """
        "When I use a word," Humpty Dumpty said,
        in rather a scornful tone, "it means just what I
        choose it to mean - neither more nor less."
        "The question is," said Alice, "whether you
        can make words mean so many different things."
        "The question is," said Humpty Dumpty,
        "which is to be master - that's all."
        """;

    List<String> words = new Scanner(story)
        .useDelimiter("[- \\.\n\",]+")
        .tokens()
        .collect(toList());

(Note, this example uses the new Text Blocks feature, which was previewed in JDK 13 and 14 and which is scheduled to be final in JDK 15.)

Here, we set the delimiter pattern to match whitespace and various punctuation marks, so the tokens consist of text between the delimiters. The results are:

    [When, I, use, a, word, Humpty, Dumpty, said, in, rather, a, scornful,
    tone, it, means, just, what, I, choose, it, to, mean, neither, more,
    nor, less, The, question, is, said, Alice, whether, you, can, make,
    words, mean, so, many, different, things, The, question, is, said,
    Humpty, Dumpty, which, is, to, be, master, that's, all]

In this example I used the tokens() method to provide a stream of tokens. Scanner implements Iterator<String>, which allows you to iterate over the tokens that were found, using the typical hasNext/next methods. Unfortunately, Scanner does not implement Iterable, which would allow you use it within a for-loop.

Scanner also provides pairs of hasNext/next methods for converting tokens to data. For example, it provides hasNextInt and nextInt methods that search for the next token and convert it to an int (if available). Corresponding pairs of methods are also available for BigInteger, boolean, byte, double, float, long, and short. These pairs of methods are “iterator-like” in that the hasNextX/nextX method pairs are just like the hasNext/next method pair of an Iterator, with the addition of data conversion. But there’s no way to wrap them in an Iterator, like Iterator<BigInteger> or Iterator<Double>, without writing your own adapter code. This is unfortunate, since Scanner is an Iterator<String> but its Iterator is only over tokens, not the value-added iterator-like constructs that include data conversions.

The other main mode of Scanner is the find mode, which provides a succession of matches from a pattern you provide. Here’s an example of that:

    List<String> words = new Scanner(story)
        .findAll("[A-Za-z']+")
        .map(MatchResult::group)
        .collect(toList());

Here, instead of matching delimiters between tokens, I’ve provided a pattern that matches the results I want to get. Note that return of findAll() is Stream<MatchResult> and which must be converted to strings; that’s what the MatchResult::group method does. The resulting list is the exact same list of words as the previous example. Personally, I find this mode more useful than the tokens mode. You’re providing the pattern for the text you’re interested in, as opposed to a pattern for the delimiters between the text you’re interested in. Also, you get back MatchResult objects, which are useful for extracting substrings of what you matched. This isn’t available in tokens mode.

I started off this article saying that Scanner is weird but useful. It’s weird because it has these two distinct modes. It has groups of methods that apply to one mode but not the other. If you look at the API carefully (or at the implementation) you’ll also see that there is also a bunch of internal state that applies to one mode but not the other. It seems like Scanner should have been split into two classes. Another weird thing about Scanner is that it’s an Iterator<String>, which elevates one part of one of the modes to the top level of the API and relegates the other parts to second-class status.

That said, Scanner provides some very useful services. It does I/O and buffering for you, and if regex matching needs more input, it handles that automatically. I’m also partial to the streams-returning methods like findAll() and tokens() — I have to admit, I added them — but they make bulk processing of arbitrary input quite easy. I hope you find these aspects of Scanner useful as well.

Read Full Post »

Oracle Code One 2019

Here’s a quick summary of Oracle Code One 2019, which was last week.

It essentially started the previous week at the “Chinascaria”, Steve Chin‘s Community BBQ for JUG leaders and friends. Although Steve is now at JFrog, he’s continuing the BBQ tradition. Of course Bruno Souza, Edson Yanaga, and some other cohorts from Brazil were manning the BBQ, and there was plenty of meat to be had. I didn’t get many photos, but Ruslan from JUG.RU was there and he insisted that we take a selfie:

Hi Ruslan! Oh, here’s a tweet with the chefs from the BBQ:

Java Keynote

The conference kicked off with the Java keynote, The Future of Java is Now, led by Georges Saab. The pace was pretty brisk, with several walk-on guests. We heard from Jessica Pointing talk about quantum computing, and from Aimee Lucido on her new book, Emmy in the Key of Code.  This sounds really cool, a book written in Java-code-like verse. This should be interesting to my ten-year-old daughter, since she’s reading the Girls Who Code series right now. I have to say this is the first time I’ve shown a segment of a conference keynote to my family!

Naturally a good section of the keynote covered technical issues. Mikael Vidstedt and Brian Goetz ably covered the evolution of the JVM and the Java programming language. Notably, Mark Reinhold did not appear; he’s taking a break from conferences to refocus on hard technical problems.

My Sessions

This year, I had two technical sessions and a lab. This was a pretty good workload, compared with previous years where I had half a dozen sessions. I felt like I made a good contribution to the audience, but it left time for me to have conversations with colleagues (the “hallway track”) and to attend other sessions I was interested in.

My sessions were:

Collections Corner Casesslidesvideo

This session covered Map’s view collections (keySet, values, entrySet) and topics regarding comparators being “inconsistent with equals.”

Local Variable Type Inference: Friend or Foe?slidesvideo

(with Simon Ritter)

When Simon and I did an earlier version of this talk at another conference, we called it “Threat or Menace.” This probably doesn’t translate too well; to me, it has a 1950s red scare connotation, which is distinctly American. I think that’s why Simon changed it to Friend or Foe. It turns out that Venkat Subramaniam also had a talk on the same subject, entitled “Type Inference: Friend or Foe”!

Lambda, Streams, and Collectors Programming Laboratorylab repository

(with Maurice Naftalin and José Paumard)

This lab continues to evolve; there are now over 100 exercises. Thanks to Maurice and José for continuing to maintain and develop the lab materials. I recalled that we first did a Lambda Lab at Devoxx UK in 2013, which was before Java 8 was released. Maurice and Richard Warburton and I got together an hour beforehand and came up with about half a dozen exercises. It was a bit ad hoc, but we managed to keep a dozen or so people busy for an hour and a half.

More recently we (mostly José) have added and reorganized the exercises, converted the project to maven, and converted the test assertions to AssertJ. I’ve finally come around to the idea that maven is the way to go. However, the lab attendees still had their fair share of configuration problems. The think the main problem is the mismatch between maven and the IDE. It’s possible to build the project on the command line using maven, but hitting the “Test” button in the IDE does some magic that doesn’t necessarily invoke maven, so it might or might not work.

Meet the Experts

One thing that was new this year was the “Meet the Experts” sessions. In the past we’d be asked to sign up for “booth duty” which consisted of standing around for a couple hours waiting for people to ask questions. This was mostly a waste of time, since we didn’t have flashy demos. Instead, we scheduled informal, half-hour time slots at a station in the Groundbreakers Hub, and these were put onto the conference program. The result was that people showed up! I signed up for two of these. I didn’t have a formal presentation; I just answered people’s questions. This seemed considerably more useful than past “booth duty.” People had good questions, and I had some good conversations.

Everything You Ever Wanted To Know About Java And Didn’t Know Whom To Askvideo

I hadn’t signed up for this session, but the day before the session, Bruno Souza corralled me (and several others) into participating in this. Essentially it’s an impromptu “ask me anything” panel. He convinced about 15 people be on the panel. This included various JUG leaders, conference speakers, and experts in various areas. During the first part of the session, Bruno gathered questions from the audience and a colleague typed them into a document that was projected on the screen. Then he called the panelists up on stage. The rest of the session was the panel picking questions and answering them. I thought this turned out quite well. People got their questions answered, we covered quite a variety of topics, and it provoked some interesting discussions.

Other Sessions of Interest

I attended a few other sessions that were quite useful. I also watched on video some of the sessions that I had missed. Here they are, in no particular order:

Robert Seacord, Serialization Vulnerabilitiesvideo

Mike Duigou, Exceptions 2020 (slide download available)

Sergey Kuksenko, Does Java Need Value Types? Performance Perspectivevideo

Brian Goetz, Java Language Futures, 2019 Editionvideo

Venkat Subramaniam, Type Inference: Friend or Foe?video

Robert Scholte, Broken Build Tools and Bad Behaviors (slide download available)

Nikhil Nanivadekar, Do It Yourself: Collections

Here’s the playlist of Code One sessions that were recorded.

Unfortunately, not all of the sessions were recorded. Some of the speakers’ slide decks are available for download via the conference catalog.

 

Read Full Post »

It was recently announced that Jakarta EE will not be allowed to evolve APIs in the javax.* namespace. (See Mike Milinkovich’s announcement and his followup Twitter thread.) Shortly thereafter, David Blevins posted a proposal and call for discussion about how Jakarta EE should transition its APIs into the new jakarta.* namespace. There seem to be two general approaches to the transition: a “big bang” (do it all at once) approach and an incremental approach. I don’t have much to add to the discussion about how this transition should take place, except to say that I’m pleasantly surprised at the amount of energy and focus that has emerged in the Jakarta EE community around this effort.

I’m a Java SE guy, so the details of Java EE and Jakarta EE specifications are pretty much outside my bailiwick. However, as Dr Deprecator, I should point out that there is one area of overlap: the dependence of Java EE / Jakarta EE APIs on deprecated Java SE APIs. One example in particular that I’m aware of was brought to my attention by my colleague Sean Mullan, who is tech lead of the Java SE Security Libraries group.

The Java SE API in question is java.security.Identity, which was deprecated in JDK 1.2 (released 1998) and deprecated for removal in Java 9. Since this API has been deprecated for a very long time, and we’d like to remove it from Java SE. For most purposes, it can be replaced by java.security.Principal, which was added in JDK 1.1 (released 1997).

The EJB specification uses the Identity type in a couple methods of the EJBContext class. If we were to remove Identity from some release of Java SE, it would mean that EJB — and any Java EE, Jakarta EE, or any other framework that includes EJB — would no longer be compatible with that release of Java SE. We’ve thus held off removing this type for the time being, in order to avoid pulling the rug out from underneath the EE specs.

Identity is used only in two methods the EJBContext class. It appears that these methods were deprecated in EJB 1.2, and replacements that use Principal were introduced at that time. Since J2EE 1.2 was introduced in 1999, things have been this way for about 20 years. I think it’s time to do some cleanup! (See EJB-spec issue #130.)

For better or for worse, these methods still appear in Java EE 8. As I understand things, the next specification release will be Jakarta EE 9, which will be the earliest opportunity to change the EE specification to remove the dependency on the deprecated SE APIs.

The usual argument against removing stuff is that it’s both source and binary incompatible. If something falls over because of a missing API, it’s pretty hard to work around. This is the reason that deprecated stuff has stayed around for so many years. On the other hand, if these deprecated APIs aren’t removed now, when will they be removed?

I’d argue that the upcoming package renaming (whether incremental or big bang) is an opportunity to remove obsolete APIs, because such renaming is inherently both source and binary incompatible. People will have to run migration tools and change their code when they transition it from Java EE 8 to Jakarta EE 9. There can be no expectation that old jar files will run unchanged in the new Jakarta world. Thus, the package renaming is an opportunity to shed these obsolete APIs.

I’m not aware of any EE APIs other than EJBContext that depend on Java SE APIs that are deprecated for removal. I did a quick check of GlassFish 5 using the jdeprscan tool, and this one was the only API-to-API dependency that I found. However, I’m not an expert in EE and GlassFish, so I’m not sure I checked the right set of jars. (I did find a bunch of other stuff, though. Contact me if you’re interested in details.)

I had a brief Twitter exchange with David Blevins on this topic the other day. He pointed me at the parts of the TomEE implementation that implements EJBContext, and it turns out that the two methods in question simply throw UnsupportedOperationException. This is good news, in that it means TomEE applications aren’t using these methods, which means that those applications won’t break if these methods are removed.

However, that doesn’t mean these methods can simply be removed from EE implementations! The TCKs have what is called a “signature test,” which scans the libraries for the public classes, fields, and methods, to make sure that all the APIs required by the specifications are present and that there are no extra APIs. I’m fairly sure that the EE TCK signature test contains entries for those methods. Thus, what needs to happen is that the Jakarta EE specification needs to remove these methods, the EE TCK needs to be updated to match, and then implementations can remove — in fact, will be required to remove — these methods when they’re brought into conformance with the new specification.

Note that all of this is separate from the question of what to do with other deprecated Jakarta EE APIs that don’t depend on deprecated Java SE APIs. Deprecated Jakarta EE APIs might have been deprecated for their own reasons, not because of their dependency on SE APIs. These should be considered on their own merits and an appropriate removal plan developed. Naturally, as Dr Deprecator, I like removing old, obsolete APIs. But the deprecation and potential removal plan for deprecated Jakarta EE APIs needs to be developed with the particular evolution path of those APIs in mind.

Read Full Post »

Processing Large Files in Java

Last week, Paige Niedringhaus posted an article Using Java to Read Really, Really Large Files. While one can quibble about whether the file to be processed is indeed “really, really large,” it’s large enough to expose some interesting concerns and to present some interesting opportunities for optimizations and use of newer APIs. There was some discussion on Reddit /r/java and /r/programming and a PR with an alternative implementation. Earlier today, Morgen Peschke posted an analysis and comparison to a Scala version of the program. (This is posted as comment at the bottom of the original article.) This article is my contribution to the discussion.

When I ran Niedringhaus’ program on my machine using JDK 8, I ran into the same memory issues as did Peschke; the program consumed so much memory that it spent all its time in garbage collection. Increasing the heap size worked around this problem. Interestingly, using JDK 11, I was able to run Niedringhaus’ version successfully without increasing the heap size. (I suspect the reason is that JDK 11 uses G1GC as the default collector, and its different collection scheme avoids the pathological behavior of the Parallel GC, which is the default collector in JDK 8.)

The approach I’ll take is to retain the large lists accumulated by the original program. My presumption is that the lists are loaded into memory in order to do further analysis that isn’t part of the original program. Instead of reducing memory consumption, I focus on changing aspects of the computation to improve runtime performance. After establishing the program’s baseline performance, I proceed to show several variations on the code that successively improve its performance, along with some discussion describing the reasons for the improvement. I present a diff for each variation. Each variation, along with my final version, is also available in a gist.

I downloaded indiv18.zip on 2019-01-04 and extracted the itcont.txt file. It’s about 3.3GB in size and has 18,245,416 lines. I started with the 2019-01-05 version of Niedringhaus’ test program:

ReadFileJavaApplicationBufferedReader.java

For comparing execution times, I’m using the last time reported by the program, the “Most common name time.” The benchmark times all showed a fair amount of variability. In most cases I reran the program a few times and chose the fastest time. This isn’t very rigorous, but it should at least give an idea of the relative speeds of execution. I’ve rounded to whole seconds, because the high variability makes milliseconds superfluous.

Niedringhaus’ article reported a runtime of about 150 seconds for this version of the program. I ran the program on my laptop (MacBook Pro, mid 2014, 3GHz Intel Core i7, 2 cores, 16GB) and the execution time was about 108 seconds. I’ll use that figure as the baseline against which subsequent optimizations are compared.

Variation 1

--- ReadFileJavaApplicationBufferedReader0.java
+++ ReadFileJavaApplicationBufferedReader1.java
@@ -57,8 +57,8 @@
                // System.out.println(readLine);
 
                // get all the names
-               String array1[] = readLine.split("\\s*\\|\\s*");
-               String name = array1[7];
+               String array1[] = readLine.split("\\|", 9);
+               String name = array1[7].strip();
                names.add(name);
                if(indexes.contains(lines)){
                        System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
@@ -80,7 +80,7 @@
                        }
                }
 
-               String rawDate = array1[4];
+               String rawDate = array1[4].strip();
                String month = rawDate.substring(4,6);
                String year = rawDate.substring(0,4);
                String formattedDate = month + "-" + year;

Applying this patch reduced the execution time from 108 seconds to about 44 seconds.

This is change is actually two optimizations. String splitting is quite expensive, and it’s done once for each of the 18 million lines in the file. It’s thus quite beneficial to remove work from the program’s main loop. The String.split() call uses a regex that splits the line into fields, where the separator is a vertical bar including any adjacent whitespace. The regex pattern is compiled each time through the loop. It would save some time to compile the regex once before the loop and to reuse it. But it turns out that using a regex here is unnecessary. We can instead split on a vertical bar alone. The split() method has a fast path for single-character split patterns which avoids regexes entirely. (Since the vertical bar is a regex metacharacter, it still counts as a single character even with the backslash escapes.) Thus we don’t need to worry about pre-compiling the split pattern.

Changing the split pattern can leave unwanted whitespace in some of the fields we’re interested in. To deal with this, we call the String.strip() method to remove it from those fields. The strip() method is new in Java 11. It removes whitespace from both ends of a string, where whitespace is defined using Unicode semantics. This differs from the older String.trim() method, which uses an anachronistic definition of whitespace based on ASCII control characters.

The second optimization applies a limit to the number of splits performed. Each line of the file has 21 fields. Without the limit parameter, the split() method will split the entire line into 21 fields and create string objects for them. However, the program is only interested in data from the 5th and 8th fields (array indexes 4 and 7). It’s a lot of extra work to split the remaining fields and then just to throw them away. Supplying a limit argument of 9 will stop splitting after the eighth field, leaving the remainder of the line unsplit in the last array element (at index 8). This reduces the amount of splitting work considerably.

Variation 2

--- ReadFileJavaApplicationBufferedReader1.java
+++ ReadFileJavaApplicationBufferedReader2.java
@@ -29,17 +29,12 @@
 
        // get total line count
        Instant lineCountStart = Instant.now();
-       int lines = 0;
 
        Instant namesStart = Instant.now();
        ArrayList<String> names = new ArrayList<>();
 
        // get the 432nd and 43243 names
-       ArrayList<Integer> indexes = new ArrayList<>();
-
-       indexes.add(1);
-       indexes.add(433);
-       indexes.add(43244);
+       int[] indexes = { 0, 432, 43243 };
 
        // count the number of donations by month
        Instant donationsStart = Instant.now();
@@ -53,16 +48,12 @@
          System.out.println("Reading file using " + Caller.getName());
 
        while ((readLine = b.readLine()) != null) {
-               lines++;
                // System.out.println(readLine);
 
                // get all the names
                String array1[] = readLine.split("\\|", 9);
                String name = array1[7].strip();
                names.add(name);
-               if(indexes.contains(lines)){
-                       System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
-               }
 
                if(name.contains(", ")) {
 
@@ -88,11 +79,15 @@
 
        }
 
+         for (int i : indexes) {
+             System.out.println("Name: " + names.get(i) + " at index: " + (i));
+         }
+
        Instant namesEnd = Instant.now();
        long timeElapsedNames = Duration.between(namesStart, namesEnd).toMillis();
        System.out.println("Name time: " + timeElapsedNames + "ms");
 
-       System.out.println("Total file line count: " + lines);
+       System.out.println("Total file line count: " + names.size());
        Instant lineCountEnd = Instant.now();
        long timeElapsedLineCount = Duration.between(lineCountStart, lineCountEnd).toMillis();
        System.out.println("Line count time: " + timeElapsedLineCount + "ms");

This patch reduces the execution time from 44 seconds to about 40 seconds.

This is perhaps a bit of a cheat, but it’s another example of removing work from the inner loop. The original code maintained a list of indexes (line numbers) for which names are to be printed out. During the loop, a counter would keep track of the current line, and the current line would be queried against the list of indexes to determine if the name is to be printed out. The list is short, with only 3 items, so searching it is pretty quick. There are 18,245,416 lines in the file and only 3 indexes in the list, so searching the list for the current line number will fail 18,245,413 times. Since we’re storing all the names in a list, we can just print out the names we’re interested in after we’ve loaded them all. This avoids having to check the list within the inner loop.

The patch also stores the indexes in an array since the syntax for initializing an array is a bit more concise. It also avoids boxing overhead. Boxing of three elements isn’t a significant overhead, so it’s unlikely this makes any measurable difference in the performance. In general, I prefer to avoid boxing unless it’s necessary.

Variation 3

--- ReadFileJavaApplicationBufferedReader2.java
+++ ReadFileJavaApplicationBufferedReader3.java
@@ -44,6 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
 
+       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
 
        System.out.println("Reading file using " + Caller.getName());
 
@@ -55,20 +57,13 @@
                String name = array1[7].strip();
                names.add(name);
 
-               if(name.contains(", ")) {
-
-                       String array2[] = (name.split(", "));
-                       String firstHalfOfName = array2[1].trim();
-
-                       if (!firstHalfOfName.isEmpty()) {
-                               if (firstHalfOfName.contains(" ")) {
-                                       String array3[] = firstHalfOfName.split(" ");
-                                       String firstName = array3[0].trim();
-                                       firstNames.add(firstName);
-                               } else {
-                                       firstNames.add(firstHalfOfName);
-                               }
+               var matcher = namePat.matcher(name);
+               if (matcher.find()) {
+                   String s = matcher.group(2);
+                   if (s == null) {
+                       s = matcher.group(3);
                    }
+                   firstNames.add(s);
                }
 
                String rawDate = array1[4].strip();

This patch reduces the execution time from 40 to about 38 seconds.

Whereas in variation 1 we saw that reducing a regex to a single character split pattern helped provide a large speedup, in this case we’re replacing some fairly involved string splitting logic with a regex. Note that this code compiles the regex outside the loop and uses it repeatedly within the loop. In this patch I’m attempting to provide similar semantics to the splitting logic, but I’m sure there are cases where it doesn’t produce the same result. (For the input data in this file, the regex produces the same result as the splitting logic.) Unfortunately the complexity is moved out of the logic and into the regex. I’m not going to explain the regex in great detail, since it’s actually fairly ad hoc itself. One problem is that extracting a “first name” from a name field relies on European name conventions, and those conventions don’t apply to all names in this file. A second problem is that the data itself isn’t well-formed. For example, one name in the file is “FOWLER II, COL. RICHARD”. Both the splitting logic and the regex extract the first name as “COL.” which is clearly a title, not a name. It’s unclear what can be done in this case. Nevertheless, the vast majority of records in the file are well-formed, and applying European name conventions works for them. For a name record such as “SMITH, JOHN A” both the splitting logic and the regex extract “JOHN” as the first name, which is the intended behavior.

Variation 4

--- ReadFileJavaApplicationBufferedReader3.java
+++ ReadFileJavaApplicationBufferedReader4.java
@@ -45,7 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
 
-       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
+       var namePat = Pattern.compile(", \\s*([^, ]+)");
 
          System.out.println("Reading file using " + Caller.getName());
 
@@ -59,11 +59,7 @@
 
                  var matcher = namePat.matcher(name);
                  if (matcher.find()) {
-                     String s = matcher.group(2);
-                     if (s == null) {
-                         s = matcher.group(3);
-                     }
-                     firstNames.add(s);
+                     firstNames.add(matcher.group(1));
                  }
 
                String rawDate = array1[4].strip();

This patch reduces the runtime from 38 seconds to about 35 seconds.

For reasons discussed previously, it’s difficult in general to extract the correct “first name” from a name field. Since most of the data in this file is well-formed, I took the liberty of making some simplifying assumptions. Instead of trying to replicate the original splitting logic, here I’m using a simplified regex that extracts the first non-comma, non-space sequence of characters that follows a comma-space separator. In most cases this will extract the same first name from the name field, but there are some edge cases where it returns a different result. Assuming this is acceptable, it allows a simplification of the regex and also of the logic to extract the desired substring from the match. The result is another small speedup.

Variation 5

--- ReadFileJavaApplicationBufferedReader4.java
+++ ReadFileJavaApplicationBufferedReader5.java
@@ -46,6 +46,8 @@
        ArrayList<String> firstNames = new ArrayList<>();
 
        var namePat = Pattern.compile(", \\s*([^, ]+)");
+       char[] chars = new char[6];
+       StringBuilder sb = new StringBuilder(7);
 
        System.out.println("Reading file using " + Caller.getName());
 
@@ -63,11 +65,12 @@
                  }
 
                String rawDate = array1[4].strip();
-               String month = rawDate.substring(4,6);
-               String year = rawDate.substring(0,4);
-               String formattedDate = month + "-" + year;
-               dates.add(formattedDate);
-
+               rawDate.getChars(0, 6, chars, 0);
+               sb.setLength(0);
+               sb.append(chars, 0, 4)
+                 .append('-')
+                 .append(chars, 4, 2);
+               dates.add(sb.toString());
        }
 
          for (int i : indexes) {

This patch reduces the runtime from 35 seconds to about 33 seconds.

This change is primarily to reduce the amount of memory allocation within the inner loop. The previous code extracts two substrings from the raw date, creating two objects. It then appends the strings with a “-” separator, which requires creation of a temporary StringBuilder object. (This is likely still true even with JEP 280 – Indify String Concatenation in place.) Finally, the StringBuilder is converted to a String, allocating a fourth object. This last object is stored in a collection, but the first three objects are garbage.

To reduce object allocation, the patch code creates a char array and a StringBuilder outside the loop and reuses them. The character data is extracted into the char array, pieces of which are appended to the StringBuilder along with the “-” separator. The StringBuilder’s contents are then converted to a String, which is then stored into the collection. This String object is the only allocation the occurs in this step, so the patch code avoids creating any garbage.

I’m of two minds about this optimization. It does provide a few percentage points of optimization. On the other hand, it’s decidedly non-idiomatic Java: it’s rare to reuse objects this way. However, this code doesn’t introduce much additional complexity, and it does provide a measurable speedup, so I decided to keep it in. It does illustrate some techniques for dealing with character data that can reduce memory allocation, which can become expensive if done within an inner loop.

Variation 6

--- ReadFileJavaApplicationBufferedReader5.java
+++ ReadFileJavaApplicationBufferedReader6.java
@@ -115,16 +115,9 @@
                }
        }
 
-       LinkedList<Entry<String, Integer>> list = new LinkedList<>(map.entrySet());
+       Entry<String, Integer> common = Collections.max(map.entrySet(), Entry.comparingByValue());
 
-       Collections.sort(list, new Comparator<Map.Entry<String, Integer> >() {
-               public int compare(Map.Entry<String, Integer> o1,
-                                  Map.Entry<String, Integer> o2)
-               {
-                       return (o2.getValue()).compareTo(o1.getValue());
-               }
-       });
-       System.out.println("The most common first name is: " + list.get(0).getKey() + " and it occurs: " + list.get(0).getValue() + " times.");
+       System.out.println("The most common first name is: " + common.getKey() + " and it occurs: " + common.getValue() + " times.");
        Instant commonNameEnd = Instant.now();
        long timeElapsedCommonName = Duration.between(commonNameStart, commonNameEnd).toMillis();
        System.out.println("Most common name time: " + timeElapsedCommonName + "ms");

This patch reduces the runtime from 33 seconds to about 32 seconds.

The task here is to find the most frequently occurring first name. Instead of sorting a list of map entries, we can simply use Collections.max() to find the maximum entry according to some criterion. Also, instead of having to write out a comparator that compares the values of two map entries, we can use the Entry.comparingByValue() method to obtain such a comparator. This doesn’t result in much of a speedup. The reason is that, despite there being 18 million names in the file, there are only about 65,000 unique first names in the file, and thus only that many entries in the map. Computing the maximum entry saves a little bit of time compared to doing a full sort, but not that much.

Variation 7

This isn’t a patch, but instead I did a general cleanup and refactoring pass. I’ll describe the changes here. The revised source file is in this gist:

ReadFileJavaApplicationBufferedReader7.java

The changes didn’t significantly affect the runtime, which remained at about 32 seconds.

There are a couple places in the original code where a frequency table is generated. The general algorithm is to create a map of items to counts (typically Integer) to hold the results. Then, for each item, if there’s no entry for it in the map, insert it with the value 1, otherwise add 1 to the value that’s already there. Several commenters have suggested using Map.merge() to make the put-or-update logic within the loop more concise. This will indeed work, but there’s a better way to do this using streams. For example, there is a list firstNames with a list of all first names extracted from the file. To generate a frequency table of these names, one can use this code:

Map<String, Long> nameMap = firstNames.stream()
                                      .collect(groupingBy(name -> name, counting()));

(This assumes a static import of java.util.stream.Collectors.* or individual names.) See the JDK Collectors documentation for more information. Note that the count value is a Long, not an Integer. Note also that we must use boxed values instead of primitives here, because we’re storing the values into collections.

I also use this same technique to generate the frequency table for dates:

Map<String, Long> dateMap = dates.stream()
                                 .collect(groupingBy(date -> date, counting()));

The typical technique to loop over a map involves looping the map’s entry set, and extracting the key and value from the entry using the getKey() and getValue() methods. Often, a more convenient way to loop over the entries of a Map is to use the Map.forEach() method. I used this to print out the map entries from the date map:

dateMap.forEach((date, count) ->
    System.out.println("Donations per month and year: " + date + " and donation count: " + count));

What makes this quite convenient is that the key and value are provided as individual arguments to the lambda expression, avoiding the need to call methods to extract them from an Entry.

Instead of creating a File object, opening a FileReaderon it, and then wrapping it in a BufferedReader, I used the NIO newBufferedReader() method:

BufferedReader b = Files.newBufferedReader(Path.of(FILENAME))

It’s a bit more convenient than the wrapping approach.

Other changes I made include the following:

  • Unified the start time into a single Instant variable, and refactored the elapsed time reporting into a separate between() method.
  • Removed the outermost try statement whose catch block does nothing other than printing a stack trace. I see this a lot; I suspect it exists in some code template somewhere. It’s completely superfluous, because simply letting the exception propagate will cause the default exception handler to print the stack trace anyway. The only thing you might need to do is to add a throws IOException to the main() method, which is what I did in this case.
  • Used interface types instead of implementation types. I used List and Map in variable declarations instead of ArrayList and HashMap. This is an example of programming to an interface, not an implementation. This is not of great consequence in a small program, but it’s a good habit to get into, especially when defining fields and methods. I could also have used var in more places, but I wanted to be explicit when I changed type arguments, e.g., from Integer to Long.
  • Reindented the code. The JDK style is to use spaces for indentation, in multiples of 4. This avoids lines that are indented halfway off the right edge of the screen, but mainly I’m more comfortable with it.

Performance Recap


Version             Time (sec)      Description
-------             ----------      -----------
Original               108          baseline
Variation 1             44          optimize line splitting
Variation 2             40          rearrange printing lines by index
Variation 3             38          use regex for extracting first name
Variation 4             35          simplified first name regex
Variation 5             33          reuse StringBuilder/char[] for date extraction
Variation 6             32          use max() instead of sort()
Variation 7             32          cleanup

Summary & Comment

The first several optimizations involved removing work from the inner loop of the program. This is fairly obvious. Since the loop is executed a lot (18 million times) even a small reduction in the amount of work can affect the program’s runtime significantly.

What’s less obvious is the effect of reducing the amount of garbage generated within a loop. When more garbage is generated, it fills up the heap more quickly, causing GC to run more frequently. The more GC runs, the less time the program can spend getting work done. Thus, reducing the amount of garbage generated can also speed up a program.

I didn’t do any profiling of this program. Normally when you want to optimize a program, profiling is one of the first things you should do. This program is small enough, and I think I have a good enough eye for spotting potential improvements, that I was able to find some significant speedups. However, if somebody were to profile my revised version, they might be able to find more things to optimize.

Typically it’s a bad idea to do ad hoc benchmarking by finding the difference between before-and-after times. This is often the case with microbenchmarking. In such cases it’s preferable to use a benchmarking framework such as JMH. I didn’t think it was strictly necessary to use a framework to benchmark this program, though, since it runs long enough to avoid the usual benchmarking pitfalls. However, the differences in the runtimes between the later optimizations are getting smaller and smaller, and it’s possible that I was misled by my informal timing techniques.

Several commenters have suggested using the Files.lines() method to get a stream of lines, and then running this stream in parallel. I’ve made a few attempts to do this but I haven’t shown any here. One issue is with program organization. As it stands, this program’s main loop extracts data into three lists. Doing this using streams involves operations with side effects (which are not recommended for parallel streams) or creating an aggregate object that can be used to accumulate the results. These are certainly reasonable approaches, but I wasn’t able to get any speedup from using parallel streams — at least on my 2-core system. The additional overhead of aggregation seemed to more than offset the benefit gained from running on two cores. It’s quite possible that with more work, or running the program on a system with more cores, can realize a benefit from running in parallel.

I believe the changes I’ve shown improve the quality of the code as well as improving its performance. But it’s possible to optimize this program even further. I’ve have some additional changes that get the runtime consistently down to about 26 seconds. These changes involve replacing some library calls with hand-written, special-purpose Java code. I don’t usually recommend making such changes, as they result in programs that are more complicated, less maintainable, and more error-prone. That’s why I’m not showing them. The last variation shows, I think, the “sweet spot” that represents the best tradeoff between code quality and performance. It is often possible, though, to make programs go faster at the expense of making them more complicated.

With this article, I hope that I’ve been able to illustrate several programming techniques and APIs that anybody can use to speed up and  improve the quality of their code, and to help people improve their Java development skills.

Read Full Post »

 

Devoxx US 2017 was back in March 21-23 of this year, and I’m only now getting around to posting an article about it.

The conference was in the San Jose McEnery Convention Center, which is quite a convenient venue for me. It’s only a little bit farther from home than my office. The session rooms and exhibition space were pretty nice too.

Unfortunately, the attendance seemed fairly light, which might have had something to do with the postponement of the next Devoxx US until 2019, skipping 2018.

An uncrowded conference meant there was more time for conversations with other speakers and other conference attendees. This was really great. I remember one conversation in particular with Trisha Gee where we had time to talk about nulls and Optional in detail. Some of the ideas from this conversation wound up in an article Code Smells: Null that she wrote recently.

As is typical, I had several sessions at the conference.

Ten Simple Rules for Writing Great Test Cases
– conference session with Steve Poole | slides | video

This is somewhat refreshed and updated version of the BOF Ten Things You Should Know When Writing Good Unit Test Cases in Java that Paul Thwaite (Steve’s colleague at IBM) and I had at JavaOne 2013. We didn’t actually update it all that much; I think most of the advice here is quite broadly applicable and doesn’t go obsolete. Actually, we did update it – “now with added cloud.”

Streams in JDK 8: The Good, The Bad, and the Ugly
– BOF with Simon Ritter | slides

This was a reprise of the BOF that Simon gave at Devoxx BE 2016 where he pulled me up front and asked me to provide some extemporaneous commentary. This worked so well that we decided to have me as an official co-speaker for the BOF this time.

Collections Refueled – conference session | slides | video

This is my talk about the new stuff in the Collections Framework in Java 8 and 9. Unfortunately, I didn’t prepare for this very well, and I had 60 minutes of material but only 45 minutes to present it. I ended up having to skip a bunch of the Java 9 material towards the end. (My JavaOne 2016 version of this talk is probably better.)

Optional: The Mother of all Bikesheds – conference session | slides | video

I’m happy to say that this was the second-highest rated talk at Devoxx US, according to the ratings shown by the Java Posse during the closing keynote:

JavaPosse-TopRatedTalks

Hm, these are Devoxx alternative facts, so maybe they’re alternative ratings as well.

There is a YouTube playlist of all Devoxx US 2017 sessions, so if you missed anything you can always go back and replay it.

Read Full Post »

This evening, I presented Collections Refueled at the Silicon Valley JUG. Thanks to the JUG for having me, and to the attendees for all the interesting questions!

Here are the slides for my presentation: CollectionsRefueled.pdf

 

Read Full Post »

The first segment of Episode 23 of the Java Off-Heap podcast covered the deprecation of Object.finalize in Java 9 and deprecation and finalization in general. Deprecation is a subject near and dear to my heart. The hosts even mentioned me by name. Thanks for the shout-out, guys!

I wanted to clarify a few points and to answer some of the questions that weren’t resolved in that segment of the show.

Java Finalizers vs. C++ Destructors

The role of Java’s finalizers differs from C++ destructors. In C++ (prior to the introduction of mechanisms like shared_ptr) anytime you created something with new in a constructor, you were required to call delete on it in the destructor. People mistakenly carried this thinking over to Java, and they thought that it was necessary to write finalize methods to null out references to other objects. (This was never necessary, and the fortunately the practice seems to have died out long ago.) In Java, the garbage collector cleans up anything that resides on the heap, so it’s rarely necessary to write a finalizer.

Finalizers are useful if an object creates resources that aren’t managed by the garbage collector. Examples of this are things like file descriptors or natively allocated (“off-heap”) memory. The garbage collector doesn’t clean these up, so something else has to. In the early days of Java, finalization was the only mechanism available for cleaning up non-heap resources.

Phantom References

The point of finalization is that it allows one last chance at cleanup after an object becomes unreachable, but before it’s actually collected. One of the problems with finalization is that it allows “resurrection” of an object. When an object’s finalize method is called, it has a reference to this — the object about to be collected. It can hook the this reference back into the object graph, preventing the object from being collected. As a result, the object can’t simply be collected after the finalize method returns. Instead, the garbage collector must run again in order to determine whether the object is truly unreachable and can therefore be collected.

The reference package java.lang.ref was introduced all the way back in JDK 1.2. This package includes several different reference types, including PhantomReference. The salient feature of PhantomReference is that it doesn’t allow the object to be “resurrected.” It does this by making the contained reference be inaccessible. A holder of a phantom reference gets notified that the referent has become unreachable (strictly speaking, phantom-reachable) but there’s no way to get the referent out and hook it back into the object graph. This makes the garbage collector’s job easier.

Another advantage of a PhantomReference is that, like the other reference types, it can be cleared explicitly. Suppose there’s an object that holds some external resource like a file descriptor. Typically, such objects have a close method the application should call in order to release the descriptor. Prior to the introduction of the reference types, such objects also need a finalize method in order to clean up if the application had failed to call close. The problem is, even if the application has called close, the collector needs to do finalization processing and then run again, as described above, in order to collect the object.

PhantomReference and the other reference types have a clear method that explicitly clears the contained reference. An object that has released its native resources via an explicit call to a close method would call PhantomReference.clear. This avoids a subsequent reference processing step, allowing the object to be collected immediately when it becomes unreachable.

Why Deprecate Object.finalize Now?

A couple of things have changed. First, JEP 277 has clarified the meaning of deprecation in Java 9 so that it doesn’t imply that the API will be removed unless forRemoval=true is specified. The deprecation of Object.finalize is an “ordinary” deprecation in that it’s not being deprecated for removal. (At least, not yet.)

A second thing that’s changed in Java 9 is the introduction of a class java.lang.ref.Cleaner. Reference processing is often fairly subtle, and there’s a lot of work to be done to create a reference queue and a thread to process references from that queue. Cleaner is basically a wrapper around ReferenceQueue and PhantomReference that make reference handling easier.

What hasn’t changed is that for years, it’s been part of Java lore that using finalization is discouraged. It’s time to make a formal declaration, and the way to do this is to deprecate it.

Has Anything Ever Been Removed from Java SE?

The podcast episode mentioned a Quora answer by Cameron Purdy written in 2014, where he said that nothing had ever been removed from Java. When he wrote it, the statement was correct. Various features of the JDK had been removed (such as apt, the annotation processing tool), but public APIs had never been removed.

However, the following six APIs were deprecated in Java SE 8, and they have been removed from Java SE 9:

  1. java.util.jar.Pack200.Packer.addPropertyChangeListener
  2. java.util.jar.Pack200.Unpacker.addPropertyChangeListener
  3. java.util.logging.LogManager.addPropertyChangeListener
  4. java.util.jar.Pack200.Packer.removePropertyChangeListener
  5. java.util.jar.Pack200.Unpacker.removePropertyChangeListener
  6. java.util.logging.LogManager.removePropertyChangeListener

In addition, in Java SE 9, about 20 methods and six modules have been deprecated with forRemoval=true, indicating our intent to remove them from the next major Java SE release. Some of the classes and methods to be removed include:

  • java.lang.Compiler
  • Thread.destroy
  • System.runFinalizersOnExit
  • Thread.stop(Throwable)

The modules deprecated for removal are the following:

  1. java.activation
  2. java.corba
  3. java.transaction
  4. java.xml.bind
  5. java.xml.ws
  6. java.xml.ws.annotation

So yes, we are getting serious about removing stuff!

Will Finalization Be Removed?

As mentioned earlier, Object.finalize is not being deprecated for removal at this time. As such, its deprecation is merely a recommendation that developers consider migrating to alternative cleanup mechanisms. The recommended replacements are PhantomReference and the new Cleaner class.

That said, we do eventually want to get rid of finalization. It adds extra complexity to the garbage collector, and there are recurring cases where it causes performance problems.

Before we can get rid of it, though, we need to remove uses of it from the JDK. That’s more than just removing the overrides of finalize and rewriting the code to use Cleaner instead. The problem is that there are some public API classes in the JDK that override finalize and specify its behavior. In turn, their subclasses might override finalize and rely on the existing behavior of super.finalize(). Removing the finalize method would expose these subclasses to a potentially incompatible behavior change. This will need to be investigated carefully.

There might also be a transition period where calling of the finalize method is controlled by a command-line option. This would allow testing of applications to see if they can cope without finalization. Only after a transition period would we consider removing the finalization mechanism entirely. We might even leave the finalize method declaration around for binary compatibility purposes, even after the mechanism for calling it has been removed.

As you can see, removing finalization would require a long transition period spanning several JDK releases, taking several years. That’s all the more reason to start with deprecation now.

Read Full Post »

Now that Devoxx US is imminent, it’s about time for me to post about Devoxx BE 2016, which took place in November 2016 in Antwerp. That was several months ago, which was ages in conference time, so this post is mainly a placeholder to host slides and links to the videos.

Array Linked to a List, the Full Story! – José Paumard (video)

I was surprised to find that I was mentioned by name in the abstract for this university session. José Paumard took a tweet of mine from a year earlier (actually one by my alter ego, Dr Deprecator) and turned it into an entire university session. José was happy to have me attend the session, and he was gracious enough to invite me on stage for a few comments and questions.

Streams in JDK 8: The Good, The Bad and the Ugly – Simon Ritter (BOF)

This was another of my impromptu appearances. Simon had submitted this session, and he asked me to join him in presenting it. I said that I wasn’t sure what I would speak about. He said to me, “I’ll put up a slide and say a few words about it. I’m sure you’ll have an opinion.” (He was right.) This was a BOF, so it was pretty informal, but Simon came up with some really interesting examples, and we had a good discussion and covered a lot of issues.

Simon and I will be repeating this BOF at Devoxx US this coming week.

Ask the JDK Architects – panel session (video)

This was a panel session featuring Mark Reinhold and Brian Goetz (the actual JDK architects) along with Alan Bateman and myself (JDK Core Libraries engineers). This session consisted entirely of answering questions from the audience.

Optional: The Mother of All Bikesheds – conference session (slides, video)

This was a conference session about a single Java 8 API, java.util.Optional. Some were were skeptical that I could talk for an entire hour about a single API. I proved them wrong. Credit for the title goes to my übermanager at Oracle, Jeannette Hung. It refers to the many protracted mailing list discussions (“centithreads”) about the design of Optional.

Thinking in Parallel – joint conference session with Brian Goetz (slides, video)

This was an amazing experience because the auditorium was so full that people were sitting on the steps. Brian Goetz was the big draw here, but I also think it was packed because there were fewer sessions running at the same time.

* * *

I was pleased to learn that both of my conference sessions were in the top 20 talks for the conference. Thanks for your support!

Read Full Post »

Older Posts »