I’m watching the latest JEP Café video from my colleague José Paumard, where he talks about the Comparator interface. One of the things you can do with a Comparator is to use it to sort a list:


If a class implements the Comparable interface, that means instances of that class know how to compare themselves. (The “comparison of themselves” is referred to as the natural order.) If the list contains elements that are Comparable, you don’t need to pass a Comparator argument to the List.sort() method. Instead, you pass null:


and the list will be sorted according to the elements’ natural order. José quite reasonably observes that it’s somewhat unpleasant to have to pass “null” there, and he suggests that it would be cleaner to have a no-arg overload that sorts the list in natural order.

Yeah, we should add that!

After all, there are two overloads of the Stream.sorted() method: a no-arg overload that sorts by natural order, and an overload that takes a Comparator argument. We should clearly do the same for List.sort(). Maybe its omission was just an oversight. On the other hand, it’s such an obvious thing; maybe the omission was deliberate.

What about compatibility?

The default methods feature was added in Java 8. This allowed the addition of new methods to interfaces. Prior to Java 8, adding a method to an interface was an incompatible change. With default methods, it’s possible to add a new method in a compatible way. However, incompatibility is still possible when adding a default method. For example, consider the List.sort(Comparator) method again. Its return type is void. Suppose there is a Java 7 application that has this List implementation:

class MyList<E> extends List<E> {
    public MyList<E> sort(Comparator<? super E> comparator) {
        // sort the list
        return this;

If this class were recompiled on Java 8, an error would occur, because Java doesn’t allow overrides with a differing return type.

(Java does allow covariant return types in an overriding method, where the return type of the override is a subtype of the return type of the overridden method. That doesn’t apply here, because MyList<E> isn’t a subtype of void.)

Adding a default method to an interface can be compatible, but it might be incompatible if there is a conflict with a method in an existing class. It’s therefore necessary to be quite careful when adding default methods to interfaces. It’s even more important to be careful if the interface is widely implemented (like List), if the method name is short and common (like sort), and if it has few or no arguments. Thus, it seems likely that adding a no-arg List.sort() default method would conflict with a method in an existing subclass. Maybe adding this default method isn’t such a good idea after all.

This line of reasoning seems valuable. Maybe I should write it down somewhere!

This issue is fairly subtle, and it’s worth writing down so that somebody in the future doesn’t make a mistake. A blog post (like this one) is one way to preserve this information. However, this blog isn’t connected to the OpenJDK project, so somebody working on the JDK wouldn’t know to search here. Someplace closer to the JDK would be preferable.

Another place to store this information is the JDK Bug System (JBS), which is the bug database for the JDK. It contains a lot of history, including bug reports converted from the old Sun bug database dating back to the pre-JDK-1.0 era in the 1990s. It seems likely that information in JBS will persist longer than this blog. Since JBS is associated with the JDK project, it’s also more likely that somebody working on the JDK will find it. Plus, JBS is a database, with nice categories and querying capabilities, making it easy to find information.

How should this kind of information be recorded in a bug database? I could file a request to add this API, close it out, and put the rationale for not implementing it into the comments. Filing a request and closing it immediately might seem excessively fussy. Once it’s in the database, though, it would be easy for future maintainers to rediscover the request in the future if a similar issue were to arise.

Before filing a new issue, it’s always good practice to search the database to see if something similar exists already. Indeed, upon searching, I found this:

JDK-8078076 Create Overload List#sort() for Natural Ordering Case

Huh. That seems like it covers exactly the same issue. It was submitted in April 2015 by “Webbug Group”, which is the JBS username that’s used when a bug is received from an anonymous person on the internet. The bug’s status is Closed, and its resolution value is Won’t Fix. Who did that and why? Looking through bug report, the last comment (also from April 2015) is this one:

This was considered and rejected during JDK 8 development. We were fairly minimal with the addition of default methods. There have already been incompatibilities with the List.sort(Comparator) method; adding a no-arg List.sort() method would likely cause additional incompatibilities while adding very little value. Closing as Won’t Fix.

This comment was written by … me! Wow, I had completely forgotten about this. Not surprisingly, it turns out that the bug database has a better memory than I do. I just went through this line of reasoning and reached a conclusion. Then I found the same line of reasoning and the same conclusion that I had written down nearly eight years earlier. Fortunately, present me agrees with past me.

Suppose that I had watched José’s video and immediately decided to implement the new default method. Every change to the JDK requires a JBS entry, so I would have started by searching for an existing issue or filing a new one. It seems likely I would have run across the 2015 issue at that time. (There are only 14 collections bugs in the database that have both “list” and “sort” in the title.) Even if I had missed it, one of the reviewers of the change probably would have noticed the 2015 bug and called my attention to it. Either way, it’s clear that writing down the reasoning in 2015 is valuable to a future maintainer in 2023, whether that maintainer is me or somebody else. And it seems likely that having this issue in the database, along with other similar issues, will be of value to future maintainers.

Anyway, sorry about that José, that’s why we won’t be adding a no-arg List.sort() overload.

Now that JEP 421 (Deprecate Finalization for Removal) has been delivered in JDK 18, it seems like more people are talking about finalization and how to migrate to alternatives such as Cleaner. I had an interesting Twitter conversation about this with Heinz Kabutz the other day:

The code from SunGraphics2D that Heinz pointed out is this:

public void finalize() {

Why did somebody bother to write an empty finalize() method, and why is it so important that there is a comment warning not to remove it?

The answer is that an empty finalizer disables finalization for all instances of that class and for all instances of subclasses (unless overridden by a subclass). Depending on the usage of that class, this can be a significant optimization.

To understand this, let’s recap the Java object life cycle.

A Java object without a finalizer is created, is used for a while, and eventually becomes unreachable. Some time later, the garbage collector notices that the object is unreachable and garbage collects it.

An object with a finalizer is created, is used for a while, and eventually becomes unreachable. Some time later, the object’s finalize() method is run. This is regular Java code, so the object is now actually reachable. Some additional time later, the object becomes unreachable again, and this time, the garbage collector collects the object. Thus, objects with finalizers live longer than objects without finalizers, and the garbage collector needs to do more work to garbage collect them. Using a lot of objects with finalizers increases memory pressure and potentially increases the memory requirements of the system.

Why would you need to disable finalization for some objects?

Let’s look at the case that Heinz pointed out. Instances of java.awt.Graphics (actually, its subclasses) keep a pointer to native resources used by that object. The dispose() method frees those native resources. It also has a finalizer that calls dispose() as a “safety net” in case the program didn’t call dispose(). Note that when a Graphics object becomes unreachable, it’s kept around in order for it to be finalized, even if the program had already called dispose().

The SunGraphics2D subclass is a “lightweight” object that never has any associated native resources. If it were to inherit the finalizer from Graphics, instances would need to be kept around longer in order to run the finalizers, which would call dispose(), which would do nothing. To prevent this, SunGraphics2D provides an empty finalize() method. An empty method has no visible side effects; therefore, it’s pointless for the JVM to extend the lifetime of an object in order to run an empty finalize() method. Instead, the JVM garbage collects such objects as soon it can determine they are unreachable, skipping their finalization step.

Let’s see this in action. It’s pretty easy to tell when an object is finalized by putting a print statement into its finalizer. But how can we tell whether an object with an empty finalizer was actually finalized or whether it was garbage collected immediately? This is fairly simple to do, using a new JFR event added in JDK 18.

Here’s a program with a small class hierarchy. Class A has a finalizer; B inherits it; C overrides with an empty finalizer; D inherits the empty finalizer; and E overrides with a non-empty finalizer. (I’ve made them static classes nested inside a top-level class EmptyFinalizer so they’re all in one file, but otherwise this doesn’t affect finalization. See the full program.)

    static class A {
        protected void finalize() {
            System.out.println(this + " was finalized");

    static class B extends A {

    static class C extends B {
        protected void finalize() { }

    static class D extends C {

    static class E extends D {
        protected void finalize() {
            System.out.println(this + " was finalized");

The main program creates a bunch of instances but doesn’t keep references to them. It calls System.gc() a few times and sleeps to let the garbage collector run. The output is something like the following:

$ java EmptyFinalizer
EmptyFinalizer$E@cd4e940 was finalized
EmptyFinalizer$B@8eb6c02 was finalized
EmptyFinalizer$A@4de9e37b was finalized
EmptyFinalizer$E@57db5523 was finalized
EmptyFinalizer$B@7cee2871 was finalized
EmptyFinalizer$A@2f36c092 was finalized
EmptyFinalizer$E@2dc61c34 was finalized
EmptyFinalizer$B@203936e2 was finalized
EmptyFinalizer$A@2d193f34 was finalized
EmptyFinalizer$E@34324855 was finalized
EmptyFinalizer$B@2988c55b was finalized
EmptyFinalizer$A@40ef68ae was finalized
EmptyFinalizer$E@246b0f18 was finalized
EmptyFinalizer$B@23d8b20 was finalized
EmptyFinalizer$A@6df02421 was finalized

We can see that instances of A, B, and E were finalized, but C and D were not. Well, we can’t really tell, can we? Their empty finalizers might have been called. Starting in JDK 18, we can use JFR to determine whether these objects were finalized. First, enable JFR during the run:

$ java -XX:StartFlightRecording:filename=recording.jfr EmptyFinalizer
[0.365s][info][jfr,startup] Started recording 1. No limit specified, using maxsize=250MB as default.
[0.365s][info][jfr,startup] Use jcmd 56793 JFR.dump name=1 to copy recording data to file.
EmptyFinalizer$A@cd4e940 was finalized
EmptyFinalizer$E@8eb6c02 was finalized
EmptyFinalizer$B@4de9e37b was finalized
EmptyFinalizer$A@57db5523 was finalized
EmptyFinalizer$E@7cee2871 was finalized
EmptyFinalizer$B@2f36c092 was finalized
EmptyFinalizer$A@2dc61c34 was finalized
EmptyFinalizer$E@203936e2 was finalized
EmptyFinalizer$B@2d193f34 was finalized
EmptyFinalizer$E@34324855 was finalized
EmptyFinalizer$B@2988c55b was finalized
EmptyFinalizer$A@40ef68ae was finalized
EmptyFinalizer$E@246b0f18 was finalized
EmptyFinalizer$B@23d8b20 was finalized
EmptyFinalizer$A@6df02421 was finalized

Now we have a file recording.jfr with a bunch of events. Next, we print this file in a readable form with the following command:

$ jfr print --events FinalizerStatistics recording.jfr
jdk.FinalizerStatistics {
  startTime = 16:43:37.379 (2022-04-27)
  finalizableClass = EmptyFinalizer$A (classLoader = app)
  codeSource = "file:///private/tmp/"
  objects = 0
  totalFinalizersRun = 5

jdk.FinalizerStatistics {
  startTime = 16:43:37.379 (2022-04-27)
  finalizableClass = EmptyFinalizer$B (classLoader = app)
  codeSource = "file:///private/tmp/"
  objects = 0
  totalFinalizersRun = 5

jdk.FinalizerStatistics {
  startTime = 16:43:37.379 (2022-04-27)
  finalizableClass = jdk.jfr.internal.RepositoryChunk (classLoader = bootstrap)
  codeSource = N/A
  objects = 1
  totalFinalizersRun = 0

jdk.FinalizerStatistics {
  startTime = 16:43:37.379 (2022-04-27)
  finalizableClass = EmptyFinalizer$E (classLoader = app)
  codeSource = "file:///private/tmp/"
  objects = 0
  totalFinalizersRun = 5

We can easily see that classes A, B, and E each had five instances finalized, with zero instances remaining on the heap. Classes C and D aren’t listed, so no finalization was performed for them. Also, it looks like the JFR internal class RepositoryChunk uses a finalizer, and there was one live instance, and none were finalized. (We’ll have to get the JFR team to convert this class to use Cleaner instead!)

JEP 421 has deprecated finalization for removal. Eventually it will be disabled and removed from the JDK. If your system uses finalizers — or, perhaps more crucially, if you don’t know whether your system uses finalizers — use JFR to help find out. See the JDK Flight Recorder documentation for more information about JFR.

(Updated with suggestions from Kim Barrett. Thanks, Kim!)

A new default method CharSequence.isEmpty() was added in the just-released JDK 15. This broke the Eclipse Collections project. Fortunately, the EC developers were testing the JDK 15 early access builds. They noticed the incompatibility, and they were able to ship a patch release (Eclipse Collections 10.4.0) before JDK 15 shipped. They also reported this to the OpenJDK Quality Outreach program. As a result, we were able to document this change in a release note for JDK 15.

Kudos to Nikhil Nanivadekar and Don Raab and the Eclipse Collections team for getting on top of this issue!

What’s the story here? Aren’t new JDK releases supposed to be compatible? In general, yes, we try really hard to keep everything compatible. But sometimes incompatibilities are unavoidable, and sometimes we just miss stuff. To understand what happened, we need to discuss two distinct concepts: source incompatibility and binary incompatibility.

A source incompatible change is one where a source file compiles just fine on an earlier JDK release but fails to compile on a more recent JDK release. A binary incompatible change is one where a compiled class file runs fine on an earlier JDK release but fails at runtime on a more recent JDK release.

In development of the JDK, we put in quite a bit of effort to avoid binary incompatible changes, since it’s unreasonable to force people to recompile everything, and potentially maintain different artifacts, for different JDK releases. Ideally, we’d like to enable people to provide a single binary artifact (e.g., a jar file) that runs on all of the JDK releases that their project supports.

We are somewhat more tolerant of source incompatible changes. If you’re recompiling something, then presumably you have access to the source code in order to make a few minor adjustments. We’re willing to make minor source incompatible changes to the JDK if the change provides enough value to justify the incompatibility.

It turns out that adding a default method to an interface is potentially both a source and binary incompatible change. I was a bit surprised by this. What’s going on?

Let’s first set aside default methods on interfaces and look just at adding methods to classes. Making changes to a class potentially affects subclasses. In most cases, adding a method to a class is a binary compatible change, even if the subclass has methods that are apparently in conflict with the new method in the superclass. For example, consider this class compiled on JDK 8:

class MyInputStream extends InputStream {
    public String readAllBytes() { ... }

This works fine. However, a method was added to InputStream on JDK 9:

public byte[] readAllBytes()

Now there is a conflict between InputStream and MyInputStream, since they have methods with the same name, the same parameters (none), but different return types. Despite this conflict, this is a binary compatible change. Any already-compiled classes that invoke the readAllBytes() method on an instance of MyInputStream will do so using this bytecode:

invokevirtual #6 // Method MyInputStream.readAllBytes:()Ljava/lang/String;

(I determined this by compiling a program that uses MyInputStream on JDK 8, and then running the javap -c command on the resulting class file.) Roughly, this says “invoke the method named «readAllBytes» that takes no arguments and returns a String.” That method exists on MyInputStream and not on InputStream, so the method invocation works even on JDK 9.

However, this is a source incompatible change. When I try to recompile MyInputStream.java on JDK 9, the result is this:

MyInputStream.java:13: error: readAllBytes() in MyInputStream cannot override readAllBytes() in InputStream
public String readAllBytes() {
return type String is not compatible with byte[]

The compatibility analysis of adding methods to classes is fairly straightforward. There is only one path from the current class up the superclass chain to the root class, java.lang.Object. Any conflicts among methods can only occur on this path.

Analysis of adding default methods to interfaces is more complicated, because a class or interface can inherit from multiple interfaces. This means that, looking upward from the current class, instead of there being a linear chain of superclasses up to Object, there is a branching tree (actually a DAG) of interface inheritance. This gives rise to several inheritance possibilities that cannot occur with class-only inheritance.

Also, since default methods are a relatively recent feature, the Java community has relatively less experience evolving APIs using default methods. Default methods were added in Java 8, which was released in 2014, so we have “only” six years of experience with it.

It was possible to have conflicts among interfaces, even before Java 8, for example, if two unrelated interfaces declared the same method but with different return types. Prior to Java 8, though, interfaces were essentially impossible to evolve, and so having such conflicts arise from interface evolution hardly occurred. Finally, in the pre-Java 8 world, interface methods were all abstract. If a class inherited the “same” method (same name, parameters, and return type) from different interfaces, that was OK, as both could be satisfied by a single implementation provided by the class or one of its superclasses.

With the addition of default methods in Java 8, a new problem arose: what if a default method were added to an interface somewhere, such that conflicts between method implementations might arise somewhere in the superclass and superinterface graph? More specifically, what if the superinterface graph contains two default implementations for the same method? The full rules are described in the Java Language Specification, sections 8.4.8 and, and there are lots of edge cases, but briefly, the rules are as follows:

  • Methods inherited from the class hierarchy take precedence over default methods inherited from interfaces.
  • Default methods in interfaces are allowed to override each other; the most specific override takes precedence.
  • If multiple default methods are inherited from unrelated interfaces (that is, one doesn’t override the others), that’s a compile-time error.

Here are some examples of these rules in action:

class S {
    public void foo() { ... }

interface I {
    default void foo() { ... }

interface J extends I {
    default void foo() { ... }

interface K {
    default void foo() { ... }

Given this class and these interfaces, how do the inheritance rules work?

class C extends S implements I { }
// ok: class wins, S::foo inherited

class D implements I, J { }
// ok: overriding default method wins, J::foo inherited

class E implements I, K { }
    ERROR: types I and K are incompatible;
    class E inherits unrelated defaults for foo() from types I and K

So now we have to think harder about the compatibility impact of adding a default method. If a class already has the method, we’re OK. If there’s another interface that has a default method that overrides or is overridden by the default method we’re adding, that’s OK too. A problem can only occur if there is another default method somewhere in the interface graph inherited by some class.

That’s what’s going on with source compatibility. If you run through the examples above, you can see the kind of compilation error that might arise. What about binary compatibility? It turns out that the rules for binary compatibility with default methods are actually quite similar to those for source compatibility.

Here’s what the Java Virtual Machine Specification says about how invokevirtual finds the method to call. It first talks about method selection:

A method is selected with respect to [the class] and the resolved method (§5.4.6).

Section 5.4.6 says:

The maximally-specific superinterface methods of [the receiver class] are determined (§ If exactly one matches [the method]’s name and descriptor and is not abstract, then it is the selected method.

OK, what if there isn’t exactly one match? In particular, what if there are multiple matches? Back in the specification of invokevirtual, it says:

If no method is selected, and there are multiple maximally-specific superinterface methods of [the class] that match the resolved method’s name and descriptor and are not abstract, invokevirtual throws an IncompatibleClassChangeError.

Thus, the JVM has to do quite a bit of analysis at runtime. When a method is invoked on some class, it has to not only search for that method up the class hierarchy. It also has to search the graph of interface inheritance to see if a default method might have been inherited, and that there is exactly one such method. Thus, adding a default method to an interface can easily cause problems for existing, compiled classes — a binary incompatibility.

We always examine the JDK for incompatibilities and avoid them if possible. In addition, we look at popular non-JDK libraries to see if problems might occur with them. This kind of incompatibility can occur only if a non-JDK library has a signature-compatible default method in an interface that is unrelated to the JDK interface being modified. It also requires that there be some class that inherits both that interface and the JDK interface. That seems pretty rare, but it can happen.

In fact, this is exactly the case that came up in Eclipse Collections! The Eclipse Collections library has an interface PrimitiveIterable that implements a default method isEmpty, and it also has a class CharAdapter that implements PrimitiveIterable and CharSequence:

interface PrimitiveIterable {
    default boolean isEmpty() { ... }

class CharAdapter implements PrimitiveIterable, CharSequence {

This works perfectly fine in JDK 14 and earlier releases. Consider some code that calls CharAdapter.isEmpty(). The bytecode generated would be as follows:

invokevirtual #13 // Method org/eclipse/collections/impl/string/immutable/CharAdapter.isEmpty:()Z

This works on JDK 14, because invokevirtual searches all the superclasses and superinterfaces of CharAdapter, and it finds exactly one default method: the one in PrimitiveIterable.

On JDK 15, the situation is different. A new default method isEmpty() was added to CharSequence. Thus, when the same invokevirtual bytecode is executed, it searches the superclasses and superinterfaces of CharAdapter, but this time it finds two matching default methods: the one in PrimitiveIterable and the one in CharSequence. That’s an error according to the JVM Specification, and that’s exactly what happens:

java.lang.IncompatibleClassChangeError: Conflicting default methods: org/eclipse/collections/api/PrimitiveIterable.isEmpty java/lang/CharSequence.isEmpty

What’s to be done about this? Fortunately, the fix is pretty simple: just add an implementation of isEmpty() to the CharAdapter class. (A couple other classes, CodePointAdapter and CodePointList, are in a similar situation and were also fixed.) In this case the implementations of isEmpty() are so simple that the code this.length == 0 was just inlined. If for some reason it were necessary to have CharAdapter inherit the implementation from PrimitiveIterable, then the implementation in CharAdapter could have been written like this:

public boolean isEmpty()
    return PrimitiveIterable.super.isEmpty();

As mentioned above, this fix was delivered in Eclipse Collections 10.4.0, which was delivered in time for JDK 15. Again, thanks to the EC team for their quick work on this.

OK, that’s how the JVM behaves. Why does the JVM behave this way? That is, why does it throw an exception (really, an Error) if it detects multiple default methods among the superinterfaces? Couldn’t it, for example, remember what method was called on JDK 14 (the one on PrimitiveIterable), and then continue to call that method even on JDK 15?

The explanation requires understanding of some background about virtual methods. Consider a simple class hierarchy in a library:

class A {

class B extends A {
    void m() { }

class C extends B {

Suppose further that an application has this code:

void exampleCode(B b) {

What method is called? Clearly, this will invoke the B::m. Now suppose that the library is modified as follows:

class A {
    void m() { } // method "promoted" from B

class B extends A {

class C extends B {
    void m() { } // a new overriding method

and the application is run again. Even though the code is invoking method m on B, we don’t know which method will actually be invoked. If the variable b is an instance of B, then A::m will be invoked. But if variable b is an instance of C, then C::m will be invoked.

The method that actually gets invoked depends on the class of the receiver object and the class hierarchy that has been loaded into in this JVM. There is nothing written down anywhere that says that the application used to call B::m. In fact it would be a mistake for something to be written down that causes B::m to continue to be invoked. When an overriding method is added to class C, calls that used to end up at B::m should now be calling C::m. That’s what we want virtual method calls to do.

It’s similar with superinterfaces (though more complicated of course). The JVM needs to do a search at runtime to determine what method to call. If it finds two default methods, such as PrimitiveIterable::isEmpty and CharSequence::isEmpty, there is no information to tell the JVM that the code used to call PrimitiveIterable::isEmpty and that the CharSequence::isEmpty method was added in the most recent release. All the JVM knows is that it’s been asked to invoke a method, it found two, and it has no further information about which to call. Therefore, the only thing it can do is throw an error.

Finally, could this problem have been avoided in the first place? The JDK team had done some analysis to determine whether adding CharSequence.isEmpty() would cause any incompatibilities. The analysis probably looked for no-arg methods with the same name but with a different return type. It might have looked for a method named isEmpty() with a non-public access level, another cause of incompatibilities. But these are both source incompatibilites. Or maybe the analysis missed Eclipse Collections entirely.

One thing that future analyses ought to look for is interfaces with a matching default method. That would have turned up PrimitiveIterable, and which runs the risk of binary incompatibility. By itself this isn’t a problem, but it would cause a problem for any class that implements both interfaces. It turns out that CharAdapter (and related classes) do implement both, so that’s clearly a binary incompatibility.

Even if CharAdapter and friends didn’t exist (and even now after they’ve been fixed) there is still a possibility that further incompatiblities exist. Consider some application class that happens to implement both PrimitiveIterable and CharSequence. That class might work perfectly fine with Eclipse Collections 10.3.0 and JDK 14. But it will fail with JDK 15. The problem will persist even if the application upgrades to Eclipse Collections 10.4.0, since the incompatibility is with the application class, not with CharAdapter and friends. So, that application will have to be fixed, too.

Now that we’ve described the problem and the possibility of incompatibilities, does it mean that it was a mistake to have added CharSequence.isEmpty()? Not necessarily. Even if we had noticed the incompatibility in Eclipse Collections prior to the addition of the isEmpty() default method, we might have gone ahead with it anyway. The criterion isn’t to avoid incompatibility at all costs. Instead, it’s whether the value of adding the new default method outweighs the cost and risk of incompatibility. That said, it would have been better to have noticed the incompatibility earlier and discussed it before proceeding, instead of putting an external project like Eclipse Collections into the position of having to fix something in response to a change in the JDK.

In summary, adding a default method to an interface can result in source and binary incompatibilities. The possibility of the source incompatibility is perhaps obvious, but the binary incompatibility is quite subtle. Both of these have been a possibilities since Java 8 was delivered in 2014. But to my knowledge this is the first time that the addition of a default method has resulted in a binary incompatibility with a real project (as opposed to a theoretical exercise or a toy program). It behooves us to do a more rigorous search for potentially conflicting methods the next time we decide to add a default method to an interface in the JDK.

I was saddened to hear news of Bill Shannon’s recent passing. He joined Sun very early, as employee number 11, soon after Sun’s founding in 1982. As far as I know, he was the earliest Sun employee remaining at Oracle. He was an engineering leader already by the time I joined Sun in 1986. I had the privilege of working with him — and sometimes against him — on several occasions.

Back in the day, people at Sun would refer to each other by their Unix logins. (I was “smarks”, and to some extent, I still am.) To this day I think of Bill simply as “shannon”. The other day I tweeted a few memorable quotes from shannon. Each of them is backed by a funny story, which I’ll relate here. If you ever heard Bill speak, please imagine these spoken in his imperious baritone.

Sometime in the 1990s, Sun’s internal network was organized into domains that corresponded to the overall functional area in which one worked. The engineering groups were under “eng.sun.com”, the corporate management was under “corp.sun.com”, and so forth. We had email addresses that were tied to the domain name, so I was smarks@eng.sun.com. At some point it as decided that everything would be reorganized into geographic domains. I worked in the San Francisco Bay Area region, so the old domains would be replaced with the sfbay.sun.com domain. An announcement went out that described this change, and it said something like,

Please inform all of your contacts that your new email address will be login@sfbay.sun.com instead of the old login@eng.sun.com. The eng.sun.com email addresses will stop working in 90 days.

I thought, this is ridiculous. I’ve handed out countless business cards that have my eng.sun.com email address on them, and I can’t track down everybody I’ve ever given a business card. I’ve written that email address on papers that have been published in conference proceedings, and those can’t be changed. I can’t be the only one with this problem, either. But, I thought naïvely, it should be pretty simple to set up an MX record (a DNS mail exchanger record) to handle email sent to addresses in the old eng.sun.com domain. I filed a ticket to request that, but it was summarily closed by the network administrators with some explanation like, “Mail forwarding is not possible.” Oh well, I guess I don’t know anything about running a corporate network with thousands of nodes, and I let it drop.

A couple days later, shannon sent mail to all of engineering, describing exactly the same problem I was concerned about. I replied to him, saying that I had requested an MX record be published, but the ticket had been closed. He said, “Yes, that’s what should be done. I’ll talk to the network administrators about it.”

A couple days later, he followed up with this:

    You're right, these people are idiots.

A project that shannon and I worked on together was a large joint development project with another company (which I won’t name, but whose initials are H.P.) Well, OtherCompany had a penchant of coming up with incredibly complex, fragile designs that tried to solve problems that didn’t really need solving.

In desktop systems, it’s pretty common to have a portion of a window that lets users edit text. This is usually implemented by a “text editor” widget provided by the windows tookit library, but created and managed by the application. Apparently this was unsatisfactory for OtherCompany, so they wanted to have a single, “daemon” process that managed all of the text editor widgets for every application on the desktop. At Sun we all thought this was a terrible idea, but OtherCompany wouldn’t let go of it.

At one point there was a conference call where shannon and and others at Sun had a review for this design with OtherCompany. It went something like this.

shannon: Now let me get this straight. Instead of each
application owning its own text widgets, all the text
editing functions are centralized into a single process?

OtherCompany: Yes.

shannon: And instead of each application process handling
keyboard events for its text widgets, those events will be
handled by this centralized daemon process?

OtherCompany: Yes.

shannon: So all the text data that the user has entered will
be in this daemon process, not in the application?

OtherCompany: Yes.

shannon: And if this other process crashes, what happens
to that data?

OtherCompany: (discussion) All the text data is lost.

shannon: And if this daemon process hangs, then what will
happen to the applications on the desktop?

OtherCompany: (discussion) They will all hang.

shannon: ...

OtherCompany: ...

shannon: Do you see anything wrong with this architecture?

Bill made a big impression on me early on, well before I actually met him. I joined Sun in 1986, as an impressionable young engineer fresh out of school. Fairly early on I heard about some guy “shannon” who was a bigwig in the Systems group. I was in a separate group, the Windows group, so we didn’t interact.

Some time soon after I joined, shannon sent an email to all of engineering, with a policy statement. (This was before I started to save email compulsively, otherwise I would have dug up the original.) As I recall, it went something like this:

This is a statement on the Systems Group's policy for code
that is checked into SunOS. The policy is:

    * All code must conform to the Sun C Style Guide

Non-conforming code that is posted for review will be
rejected until it does conform.

Non-conforming code that is checked into the source base
will be backed out and will not be permitted to be checked
in until it does conform.

If you do not understand this policy, I will come to your
office and explain it until you do.

This only applied to SunOS code, not Windows code, so it didn’t affect my day-to-day work. But as a young engineer I found it to be hair-raising! The lesson I took from this was, you do not want to cross shannon.

It’s a lesson that served me well over the years. 🙂

Like Bill, I stayed on at Sun all the way up until the 2010 acquisition by Oracle, and we stayed at Oracle until the present day. We didn’t work together too closely in recent years, though we both worked on Java – he worked on Java EE, and I worked on Java ME and Java SE. We were even in the same building on Sun’s (later Oracle’s) Santa Clara campus for several years. It’s amazing that he’s been around nearby for literally my entire career. It’s huge loss that he’s gone. Bye shannon, we’ll miss you.

Here are some links to other pages about Bill.

The other day on Twitter I said, “Scanner is a weird beast. I wouldn’t necessarily use it as a good example for anything.” The context was a discussion about classes that are both an Iterator and are AutoCloseable. As it happens, Scanner is such an example. It’s an Iterator, because it allows iteration over a sequence of tokens, and it’s also AutoCloseable, because it might have an external resource (like a file) contained within it. I wouldn’t hold it up as an example of good object design, though. This article explains why.

Scanner has a pretty complicated API, but once you figure out how to use it, it’s incredibly useful. Its main issue is that it’s trying to do too many things at once. The good news is that you can use parts of the API for stylized uses and mostly ignore other parts of the API.

At its core, Scanner is about regex pattern matching. Unlike the Pattern and Matcher classes, which can only match on a fixed input such as a String, Scanner allows you to match over arbitrary input that might not even exist in memory. There are several Scanner constructors that allow input to be read from various sources such as files, InputStreams, or channels. Scanner handles buffering, and it reads additional input as necessary, and it discards any input that was skipped over during matching. This is really cool. It means you can do matching over arbitrarily sized input data using just a few KB of memory.

(Naturally this depends on the patterns used for matching as well as the well-formedness of input. For example, you can attempt to read a file line by line, and this will work for an arbitrarily sized file if it’s broken up into reasonably sized lines. If the file doesn’t have any line separators, Scanner will bring the whole file into memory, as the file conceptually contains one long line.)

Scanner has two fundamental modes of matching. The first mode is to break the input into tokens that are separated by delimiters. The delimiters are defined by the regex pattern you provide. (This is rather like the String.split method.) The second mode is to find chunks of text that result from matching the regex pattern you provide. In other words, the token mode provides the text between matches, and the find mode provides the text of the matches themselves. What’s odd about the Scanner API is that there are groups of methods that apply in one mode but not the other.

The methods that apply to the tokens mode are:

  • delimiter
  • locale
  • hasNext* (excluding hasNextLine)
  • next* (excluding nextLine)
  • radix
  • tokens
  • useDelimiter
  • useLocale
  • useRadix

The methods that apply to the find mode are:

  • findAll
  • findInLine
  • findWithinHorizon
  • hasNextLine
  • nextLine
  • skip

(Additional Scanner methods apply to both modes.)

Here’s an example of using Scanner for matching tokens:

    String story = """
        "When I use a word," Humpty Dumpty said,
        in rather a scornful tone, "it means just what I
        choose it to mean - neither more nor less."
        "The question is," said Alice, "whether you
        can make words mean so many different things."
        "The question is," said Humpty Dumpty,
        "which is to be master - that's all."

    List<String> words = new Scanner(story)
        .useDelimiter("[- \\.\n\",]+")

(Note, this example uses the new Text Blocks feature, which was previewed in JDK 13 and 14 and which is scheduled to be final in JDK 15.)

Here, we set the delimiter pattern to match whitespace and various punctuation marks, so the tokens consist of text between the delimiters. The results are:

    [When, I, use, a, word, Humpty, Dumpty, said, in, rather, a, scornful,
    tone, it, means, just, what, I, choose, it, to, mean, neither, more,
    nor, less, The, question, is, said, Alice, whether, you, can, make,
    words, mean, so, many, different, things, The, question, is, said,
    Humpty, Dumpty, which, is, to, be, master, that's, all]

In this example I used the tokens() method to provide a stream of tokens. Scanner implements Iterator<String>, which allows you to iterate over the tokens that were found, using the typical hasNext/next methods. Unfortunately, Scanner does not implement Iterable, which would allow you use it within a for-loop.

Scanner also provides pairs of hasNext/next methods for converting tokens to data. For example, it provides hasNextInt and nextInt methods that search for the next token and convert it to an int (if available). Corresponding pairs of methods are also available for BigInteger, boolean, byte, double, float, long, and short. These pairs of methods are “iterator-like” in that the hasNextX/nextX method pairs are just like the hasNext/next method pair of an Iterator, with the addition of data conversion. But there’s no way to wrap them in an Iterator, like Iterator<BigInteger> or Iterator<Double>, without writing your own adapter code. This is unfortunate, since Scanner is an Iterator<String> but its Iterator is only over tokens, not the value-added iterator-like constructs that include data conversions.

The other main mode of Scanner is the find mode, which provides a succession of matches from a pattern you provide. Here’s an example of that:

    List<String> words = new Scanner(story)

Here, instead of matching delimiters between tokens, I’ve provided a pattern that matches the results I want to get. Note that return of findAll() is Stream<MatchResult> and which must be converted to strings; that’s what the MatchResult::group method does. The resulting list is the exact same list of words as the previous example. Personally, I find this mode more useful than the tokens mode. You’re providing the pattern for the text you’re interested in, as opposed to a pattern for the delimiters between the text you’re interested in. Also, you get back MatchResult objects, which are useful for extracting substrings of what you matched. This isn’t available in tokens mode.

I started off this article saying that Scanner is weird but useful. It’s weird because it has these two distinct modes. It has groups of methods that apply to one mode but not the other. If you look at the API carefully (or at the implementation) you’ll also see that there is also a bunch of internal state that applies to one mode but not the other. It seems like Scanner should have been split into two classes. Another weird thing about Scanner is that it’s an Iterator<String>, which elevates one part of one of the modes to the top level of the API and relegates the other parts to second-class status.

That said, Scanner provides some very useful services. It does I/O and buffering for you, and if regex matching needs more input, it handles that automatically. I’m also partial to the streams-returning methods like findAll() and tokens() — I have to admit, I added them — but they make bulk processing of arbitrary input quite easy. I hope you find these aspects of Scanner useful as well.

And what we might do about it

Brian Goetz and I gave this presentation in Antwerp, Belgium on November 7, 2019. The original title of this talk (as posted on the conference program) was “Why We Hate Java Serialization And What We’re Doing About It.” We made a slight adjustment to the title just before the presentation.

I initially proposed this talk to Brian because I felt we needed to correct the record about Java serialization. It’s very easy to criticize Java serialization in retrospect. We hear a lot of comments like “Just get rid of it!” but in fact serialization was introduced because it solves — and continues to solve — a very important problem. Like many complex systems, it has flaws, not because its designers were stupid, but because of typical software project difficulties: disagreements over the fundamental goals, being designed and implemented in a hurry, and a healthy dose of corporate politics.

We wanted to document very precisely where we think Java serialization’s flaws are: at the binding to the object model. In addition, Brian and the Java team had been thinking a lot about what the future of serialization would be, and we wanted to present that as well. Those ideas are described in more detail on Towards Better Serialization (June, 2019).

(This is a backdated article, posted on April 2, 2021. I’ve represented to the best of my ability my perspective at the time the presentation was given in November, 2019.)

Oracle Code One 2019

Here’s a quick summary of Oracle Code One 2019, which was last week.

It essentially started the previous week at the “Chinascaria”, Steve Chin‘s Community BBQ for JUG leaders and friends. Although Steve is now at JFrog, he’s continuing the BBQ tradition. Of course Bruno Souza, Edson Yanaga, and some other cohorts from Brazil were manning the BBQ, and there was plenty of meat to be had. I didn’t get many photos, but Ruslan from JUG.RU was there and he insisted that we take a selfie:

Hi Ruslan! Oh, here’s a tweet with the chefs from the BBQ:

Java Keynote

The conference kicked off with the Java keynote, The Future of Java is Now, led by Georges Saab. The pace was pretty brisk, with several walk-on guests. We heard from Jessica Pointing talk about quantum computing, and from Aimee Lucido on her new book, Emmy in the Key of Code.  This sounds really cool, a book written in Java-code-like verse. This should be interesting to my ten-year-old daughter, since she’s reading the Girls Who Code series right now. I have to say this is the first time I’ve shown a segment of a conference keynote to my family!

Naturally a good section of the keynote covered technical issues. Mikael Vidstedt and Brian Goetz ably covered the evolution of the JVM and the Java programming language. Notably, Mark Reinhold did not appear; he’s taking a break from conferences to refocus on hard technical problems.

My Sessions

This year, I had two technical sessions and a lab. This was a pretty good workload, compared with previous years where I had half a dozen sessions. I felt like I made a good contribution to the audience, but it left time for me to have conversations with colleagues (the “hallway track”) and to attend other sessions I was interested in.

My sessions were:

Collections Corner Casesslidesvideo

This session covered Map’s view collections (keySet, values, entrySet) and topics regarding comparators being “inconsistent with equals.”

Local Variable Type Inference: Friend or Foe?slidesvideo

(with Simon Ritter)

When Simon and I did an earlier version of this talk at another conference, we called it “Threat or Menace.” This probably doesn’t translate too well; to me, it has a 1950s red scare connotation, which is distinctly American. I think that’s why Simon changed it to Friend or Foe. It turns out that Venkat Subramaniam also had a talk on the same subject, entitled “Type Inference: Friend or Foe”!

Lambda, Streams, and Collectors Programming Laboratorylab repository

(with Maurice Naftalin and José Paumard)

This lab continues to evolve; there are now over 100 exercises. Thanks to Maurice and José for continuing to maintain and develop the lab materials. I recalled that we first did a Lambda Lab at Devoxx UK in 2013, which was before Java 8 was released. Maurice and Richard Warburton and I got together an hour beforehand and came up with about half a dozen exercises. It was a bit ad hoc, but we managed to keep a dozen or so people busy for an hour and a half.

More recently we (mostly José) have added and reorganized the exercises, converted the project to maven, and converted the test assertions to AssertJ. I’ve finally come around to the idea that maven is the way to go. However, the lab attendees still had their fair share of configuration problems. The think the main problem is the mismatch between maven and the IDE. It’s possible to build the project on the command line using maven, but hitting the “Test” button in the IDE does some magic that doesn’t necessarily invoke maven, so it might or might not work.

Meet the Experts

One thing that was new this year was the “Meet the Experts” sessions. In the past we’d be asked to sign up for “booth duty” which consisted of standing around for a couple hours waiting for people to ask questions. This was mostly a waste of time, since we didn’t have flashy demos. Instead, we scheduled informal, half-hour time slots at a station in the Groundbreakers Hub, and these were put onto the conference program. The result was that people showed up! I signed up for two of these. I didn’t have a formal presentation; I just answered people’s questions. This seemed considerably more useful than past “booth duty.” People had good questions, and I had some good conversations.

Everything You Ever Wanted To Know About Java And Didn’t Know Whom To Askvideo

I hadn’t signed up for this session, but the day before the session, Bruno Souza corralled me (and several others) into participating in this. Essentially it’s an impromptu “ask me anything” panel. He convinced about 15 people be on the panel. This included various JUG leaders, conference speakers, and experts in various areas. During the first part of the session, Bruno gathered questions from the audience and a colleague typed them into a document that was projected on the screen. Then he called the panelists up on stage. The rest of the session was the panel picking questions and answering them. I thought this turned out quite well. People got their questions answered, we covered quite a variety of topics, and it provoked some interesting discussions.

Other Sessions of Interest

I attended a few other sessions that were quite useful. I also watched on video some of the sessions that I had missed. Here they are, in no particular order:

Robert Seacord, Serialization Vulnerabilitiesvideo

Mike Duigou, Exceptions 2020 (slide download available)

Sergey Kuksenko, Does Java Need Value Types? Performance Perspectivevideo

Brian Goetz, Java Language Futures, 2019 Editionvideo

Venkat Subramaniam, Type Inference: Friend or Foe?video

Robert Scholte, Broken Build Tools and Bad Behaviors (slide download available)

Nikhil Nanivadekar, Do It Yourself: Collections

Here’s the playlist of Code One sessions that were recorded.

Unfortunately, not all of the sessions were recorded. Some of the speakers’ slide decks are available for download via the conference catalog.


It was recently announced that Jakarta EE will not be allowed to evolve APIs in the javax.* namespace. (See Mike Milinkovich’s announcement and his followup Twitter thread.) Shortly thereafter, David Blevins posted a proposal and call for discussion about how Jakarta EE should transition its APIs into the new jakarta.* namespace. There seem to be two general approaches to the transition: a “big bang” (do it all at once) approach and an incremental approach. I don’t have much to add to the discussion about how this transition should take place, except to say that I’m pleasantly surprised at the amount of energy and focus that has emerged in the Jakarta EE community around this effort.

I’m a Java SE guy, so the details of Java EE and Jakarta EE specifications are pretty much outside my bailiwick. However, as Dr Deprecator, I should point out that there is one area of overlap: the dependence of Java EE / Jakarta EE APIs on deprecated Java SE APIs. One example in particular that I’m aware of was brought to my attention by my colleague Sean Mullan, who is tech lead of the Java SE Security Libraries group.

The Java SE API in question is java.security.Identity, which was deprecated in JDK 1.2 (released 1998) and deprecated for removal in Java 9. Since this API has been deprecated for a very long time, and we’d like to remove it from Java SE. For most purposes, it can be replaced by java.security.Principal, which was added in JDK 1.1 (released 1997).

The EJB specification uses the Identity type in a couple methods of the EJBContext class. If we were to remove Identity from some release of Java SE, it would mean that EJB — and any Java EE, Jakarta EE, or any other framework that includes EJB — would no longer be compatible with that release of Java SE. We’ve thus held off removing this type for the time being, in order to avoid pulling the rug out from underneath the EE specs.

Identity is used only in two methods the EJBContext class. It appears that these methods were deprecated in EJB 1.2, and replacements that use Principal were introduced at that time. Since J2EE 1.2 was introduced in 1999, things have been this way for about 20 years. I think it’s time to do some cleanup! (See EJB-spec issue #130.)

For better or for worse, these methods still appear in Java EE 8. As I understand things, the next specification release will be Jakarta EE 9, which will be the earliest opportunity to change the EE specification to remove the dependency on the deprecated SE APIs.

The usual argument against removing stuff is that it’s both source and binary incompatible. If something falls over because of a missing API, it’s pretty hard to work around. This is the reason that deprecated stuff has stayed around for so many years. On the other hand, if these deprecated APIs aren’t removed now, when will they be removed?

I’d argue that the upcoming package renaming (whether incremental or big bang) is an opportunity to remove obsolete APIs, because such renaming is inherently both source and binary incompatible. People will have to run migration tools and change their code when they transition it from Java EE 8 to Jakarta EE 9. There can be no expectation that old jar files will run unchanged in the new Jakarta world. Thus, the package renaming is an opportunity to shed these obsolete APIs.

I’m not aware of any EE APIs other than EJBContext that depend on Java SE APIs that are deprecated for removal. I did a quick check of GlassFish 5 using the jdeprscan tool, and this one was the only API-to-API dependency that I found. However, I’m not an expert in EE and GlassFish, so I’m not sure I checked the right set of jars. (I did find a bunch of other stuff, though. Contact me if you’re interested in details.)

I had a brief Twitter exchange with David Blevins on this topic the other day. He pointed me at the parts of the TomEE implementation that implements EJBContext, and it turns out that the two methods in question simply throw UnsupportedOperationException. This is good news, in that it means TomEE applications aren’t using these methods, which means that those applications won’t break if these methods are removed.

However, that doesn’t mean these methods can simply be removed from EE implementations! The TCKs have what is called a “signature test,” which scans the libraries for the public classes, fields, and methods, to make sure that all the APIs required by the specifications are present and that there are no extra APIs. I’m fairly sure that the EE TCK signature test contains entries for those methods. Thus, what needs to happen is that the Jakarta EE specification needs to remove these methods, the EE TCK needs to be updated to match, and then implementations can remove — in fact, will be required to remove — these methods when they’re brought into conformance with the new specification.

Note that all of this is separate from the question of what to do with other deprecated Jakarta EE APIs that don’t depend on deprecated Java SE APIs. Deprecated Jakarta EE APIs might have been deprecated for their own reasons, not because of their dependency on SE APIs. These should be considered on their own merits and an appropriate removal plan developed. Naturally, as Dr Deprecator, I like removing old, obsolete APIs. But the deprecation and potential removal plan for deprecated Jakarta EE APIs needs to be developed with the particular evolution path of those APIs in mind.

This is a very belated post that covers a session that took place at the JavaOne conference in San Francisco, October 2017.

Here’s a recap of the BOF (“birds-of-a-feather”) session I led on software maintenance. The title was Maintenance – The Silent Killer. This was my feeble attempt at clickbait. This was an evening session that was held during the dinner hour, and maintenance isn’t the most scintillating topic, so I figured attendance needed all the help I could give it.

When the start time arrived, I was standing on the podium in an empty room. I thought, well, if nobody shows up then I can go home early. Then about fifty people flooded in! It turns out they had lined up outside waiting for their badges to be scanned, but then a conference staffer came by and told them that badges weren’t scanned for the evening sessions and that they should just go in.

Overall I thought it went quite well. I gave a brief presentation, and then set up some discussion questions for the audience. The people who showed up really were interested in maintenance, they offered a variety of interesting insights and views, and they were quite serious about the topic. There was enough discussion to fill the allotted time, and there was plenty of interaction between me and the audience and among audience members themselves. I’ll declare the session to have been successful, though it’s difficult for me to draw any grand conclusions from it. I was heartened by the amount of participation. I was really concerned that nobody would show up, or perhaps that three people would show up, since most tech conferences are about the latest and greatest new shiny thing.

The session wasn’t recorded. What follows is some notes on my slide presentation, followed by some additional notes from the discussion that followed. These are unfortunately rather sparse, as I was participating at the same time. However, I did capture a few ideas that I hadn’t considered previously, which I found quite beneficial.

Slide Presentation (PDF)

Slide 2: Golden Gate Bridge. I grew up in Marin County, which is connected to San Francisco by the Golden Gate Bridge. We crossed the bridge frequently. Back in 1974 or so the toll was raised from 50¢ to 75¢, and my parents complained incessantly about this. At one point I had the following conversation with my Dad about the toll:

Me: Why do they collect tolls?
Dad: To pay off the bridge.
Me: When will the bridge be paid off?
Dad: Never!

As I kid I was kind of perplexed by this. If you take out a loan, and make regular payments on it, won’t it eventually be paid off? (Sub-prime mortgages weren’t invented until much later.) Of course, the original construction loans have long since been paid off. What the tolls are used for, and which indeed will never be paid off, is the continuous maintenance that the bridge requires.

Slide 3: This is me driving my car through Tunnel Log in Sequoia National Park. The point isn’t about a tunnel through a tree, but the cost of owning and operating a car. The first time I used my car for business expenses, I was surprised by the per-mile reimbursement amount. If you consider the 2017 numbers, this car’s gasoline costs about 14¢-20¢ per mile, and the IRS standard reimbursement rate is 53.5¢ per mile. Hey, I’m making money on this deal!

No. This is a 1998 BMW, and you will not be surprised to learn that the cost of maintenance on this car is quite significant. Indeed, I’ve added up the maintenance costs over the lifetime of the car, and they outweigh the cost of gasoline. Counting maintenance and depreciation, I’m decidedly not making money on mileage reimbursement.

Slide 4 has some points on maintenance as a general phenomenon. One point that bears further explanation is my claim that “deferred maintenance costs can grow superlinearly.” Continuing with the car example, consider oil changes. It might cost a couple hundred dollars a year for regular oil changes. You could save money for a couple years by not changing the oil. This might eventually result in a several thousand dollar engine rebuild. “Superlinear” isn’t very precise, but the point is that the cost of remediating problems caused by deferred maintenance is often much greater than the sum of incremental maintenance costs.

Slide 5, quotation from Kurt Vonnegut. Perhaps profound if you’ve never heard it before, but a cliché if you pay attention to maintenance. It does seem to be true that in general creative activities get all the attention at the expense of maintenance activities.

Slides 6-7. Physical systems exhibit wear and friction and this contributes to the need to do regular maintenance. Software doesn’t “wear” out. But there are a bunch of phenomena that cause software systems to require maintenance. Primarily these seem to be related to the environment in which the software exists, not the software itself.

Slides 8-9. Most planning and costing efforts around software projects are concerned with software construction. Maintenance is a significant cost, accounting for perhaps 50% to 75% (Boehm) or 40% to 80% (Glass) of the total life cycle costs. However, comparatively little planning and budgeting effort goes toward maintenance.

Glass points out that software maintenance and construction are essentially the same activity, except that maintenance requires additional effort to “understand the existing product.” As a programmer, when you’re developing software, you know what you’re trying to do and you’re familiar with the code you’re developing at the moment. When maintaining software, you often have to deal with code that you might never have seen before and figure out what it does before you can modify it successfully. The cost incurred in re-acquiring knowledge and understanding of an existing software system is significant.

Slide 10. OpenJDK is an open source implementation of Java. It’s an old code base; Java 1.0 was released in 1996, and it was in development for a couple years prior to that. It’s been continually evolved and maintained since then. Evolution consists of usual software activities such as adding features, improving performance, fixing bugs, mitigating security vulnerabilities, and maintaining old releases. Maintenance activities are a large portion of the team’s activities. I’m not sure how to measure it, but the estimates from Boehm and Glass above are quite plausible.

In addition to the above development activities the team also puts effort into deprecation and removal of obsolete features. This is important because, among other things, it helps to reduce the long-term maintenance burden. See some of my prior materials on the topic of deprecation:

The cost of knowledge re-acquisition mentioned previously is somewhat mitigated by systems maintained by the JDK group that preserve history.

The open version of the JDK source code in the Mercurial version control system, and it includes changesets dating back to December 2007. The earlier source code history is in a closed, Oracle-internal system and dates back to August 1994.

The JDK Bug System (a JIRA instance) contains over 265,000 bugs and feature requests dating back to late 1994. Many of these bugs were converted from a Sun Microsystems internal bug database.

Personally, I’ve found that the ability to search over 20 years of source code history and bug history to be of immense value in understanding existing code and diagnosing problems.

Slide 11. A big driver of software maintenance is security vulnerabilities. This has gotten worse in recent years, as “everything” is connected to the internet. Another significant contributor to maintenance issues is the large number of dependencies among software components, many of which are in open source. By reusing external software components, you can reduce development time. However, doing so takes on the maintenance burden of those components. Either you have to keep all the external components up to date, or you have to maintain them yourself.

Slide 12. Questions and Audience Discussion

The slide has several questions to spark discussion with the audience. We didn’t address them directly, but there was a relatively free-flowing conversation. Here are some notes from that conversation.

One audience member compared maintenance to a fence. Suppose you have a pasture, and wolves keep coming to it and attacking your sheep. So you put up a fence. The fence just sits there. The sheep grace peacefully. Wolves stay away because they realize they can’t get past the fence. Nothing happens. The fact that nothing is happening is a huge benefit! Like a fence, a well-maintained system just does its thing without calling attention to itself. This may lead people to forget about it. A poorly-maintained system is constantly breaking, attracting lots of attention.

An attendee suggested thinking about maintenance planning the same way a project manager thinks about risk management. With less maintenance there is a greater risk of failure, and vice-versa.

Another attendee suggested insurance as a model for maintenance. Maintenance costs are like insurance premiums: you pay them regularly, and you’re protected. Not paying them saves money temporarily, until some disaster strikes. (Rather like my car oil change example above.) Of course, insurance is closely related to risk management, and as a social institution it seems poorly understood by most lay individuals.

An audience member suggested just biting the bullet and declaring that maintenance is just a cost of doing business. There’s no use complaining about it; you just have to accept it. Another audience member said that his department allocated 10% of its budget to maintenance costs.

Regarding keeping up with the software updates, one attendee pointed out that it’s not necessarily important to be on the latest software release, but instead it’s important to be on the latest patch or update level even if you’re on an old release. Many commercial software products have support contracts where they will maintain old releases for many years. They don’t have the most features or the highest performance, but they are maintained with fixes for current security vulnerabilities and other high priority problems.

(This is a big component of the business of my company, Oracle. This is also true of products from many other software companies.)

Last week, Paige Niedringhaus posted an article Using Java to Read Really, Really Large Files. While one can quibble about whether the file to be processed is indeed “really, really large,” it’s large enough to expose some interesting concerns and to present some interesting opportunities for optimizations and use of newer APIs. There was some discussion on Reddit /r/java and /r/programming and a PR with an alternative implementation. Earlier today, Morgen Peschke posted an analysis and comparison to a Scala version of the program. (This is posted as comment at the bottom of the original article.) This article is my contribution to the discussion.

When I ran Niedringhaus’ program on my machine using JDK 8, I ran into the same memory issues as did Peschke; the program consumed so much memory that it spent all its time in garbage collection. Increasing the heap size worked around this problem. Interestingly, using JDK 11, I was able to run Niedringhaus’ version successfully without increasing the heap size. (I suspect the reason is that JDK 11 uses G1GC as the default collector, and its different collection scheme avoids the pathological behavior of the Parallel GC, which is the default collector in JDK 8.)

The approach I’ll take is to retain the large lists accumulated by the original program. My presumption is that the lists are loaded into memory in order to do further analysis that isn’t part of the original program. Instead of reducing memory consumption, I focus on changing aspects of the computation to improve runtime performance. After establishing the program’s baseline performance, I proceed to show several variations on the code that successively improve its performance, along with some discussion describing the reasons for the improvement. I present a diff for each variation. Each variation, along with my final version, is also available in a gist.

I downloaded indiv18.zip on 2019-01-04 and extracted the itcont.txt file. It’s about 3.3GB in size and has 18,245,416 lines. I started with the 2019-01-05 version of Niedringhaus’ test program:


For comparing execution times, I’m using the last time reported by the program, the “Most common name time.” The benchmark times all showed a fair amount of variability. In most cases I reran the program a few times and chose the fastest time. This isn’t very rigorous, but it should at least give an idea of the relative speeds of execution. I’ve rounded to whole seconds, because the high variability makes milliseconds superfluous.

Niedringhaus’ article reported a runtime of about 150 seconds for this version of the program. I ran the program on my laptop (MacBook Pro, mid 2014, 3GHz Intel Core i7, 2 cores, 16GB) and the execution time was about 108 seconds. I’ll use that figure as the baseline against which subsequent optimizations are compared.

Variation 1

--- ReadFileJavaApplicationBufferedReader0.java
+++ ReadFileJavaApplicationBufferedReader1.java
@@ -57,8 +57,8 @@
                // System.out.println(readLine);
                // get all the names
-               String array1[] = readLine.split("\\s*\\|\\s*");
-               String name = array1[7];
+               String array1[] = readLine.split("\\|", 9);
+               String name = array1[7].strip();
                        System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
@@ -80,7 +80,7 @@
-               String rawDate = array1[4];
+               String rawDate = array1[4].strip();
                String month = rawDate.substring(4,6);
                String year = rawDate.substring(0,4);
                String formattedDate = month + "-" + year;

Applying this patch reduced the execution time from 108 seconds to about 44 seconds.

This is change is actually two optimizations. String splitting is quite expensive, and it’s done once for each of the 18 million lines in the file. It’s thus quite beneficial to remove work from the program’s main loop. The String.split() call uses a regex that splits the line into fields, where the separator is a vertical bar including any adjacent whitespace. The regex pattern is compiled each time through the loop. It would save some time to compile the regex once before the loop and to reuse it. But it turns out that using a regex here is unnecessary. We can instead split on a vertical bar alone. The split() method has a fast path for single-character split patterns which avoids regexes entirely. (Since the vertical bar is a regex metacharacter, it still counts as a single character even with the backslash escapes.) Thus we don’t need to worry about pre-compiling the split pattern.

Changing the split pattern can leave unwanted whitespace in some of the fields we’re interested in. To deal with this, we call the String.strip() method to remove it from those fields. The strip() method is new in Java 11. It removes whitespace from both ends of a string, where whitespace is defined using Unicode semantics. This differs from the older String.trim() method, which uses an anachronistic definition of whitespace based on ASCII control characters.

The second optimization applies a limit to the number of splits performed. Each line of the file has 21 fields. Without the limit parameter, the split() method will split the entire line into 21 fields and create string objects for them. However, the program is only interested in data from the 5th and 8th fields (array indexes 4 and 7). It’s a lot of extra work to split the remaining fields and then just to throw them away. Supplying a limit argument of 9 will stop splitting after the eighth field, leaving the remainder of the line unsplit in the last array element (at index 8). This reduces the amount of splitting work considerably.

Variation 2

--- ReadFileJavaApplicationBufferedReader1.java
+++ ReadFileJavaApplicationBufferedReader2.java
@@ -29,17 +29,12 @@
        // get total line count
        Instant lineCountStart = Instant.now();
-       int lines = 0;
        Instant namesStart = Instant.now();
        ArrayList<String> names = new ArrayList<>();
        // get the 432nd and 43243 names
-       ArrayList<Integer> indexes = new ArrayList<>();
-       indexes.add(1);
-       indexes.add(433);
-       indexes.add(43244);
+       int[] indexes = { 0, 432, 43243 };
        // count the number of donations by month
        Instant donationsStart = Instant.now();
@@ -53,16 +48,12 @@
          System.out.println("Reading file using " + Caller.getName());
        while ((readLine = b.readLine()) != null) {
-               lines++;
                // System.out.println(readLine);
                // get all the names
                String array1[] = readLine.split("\\|", 9);
                String name = array1[7].strip();
-               if(indexes.contains(lines)){
-                       System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
-               }
                if(name.contains(", ")) {
@@ -88,11 +79,15 @@
+         for (int i : indexes) {
+             System.out.println("Name: " + names.get(i) + " at index: " + (i));
+         }
        Instant namesEnd = Instant.now();
        long timeElapsedNames = Duration.between(namesStart, namesEnd).toMillis();
        System.out.println("Name time: " + timeElapsedNames + "ms");
-       System.out.println("Total file line count: " + lines);
+       System.out.println("Total file line count: " + names.size());
        Instant lineCountEnd = Instant.now();
        long timeElapsedLineCount = Duration.between(lineCountStart, lineCountEnd).toMillis();
        System.out.println("Line count time: " + timeElapsedLineCount + "ms");

This patch reduces the execution time from 44 seconds to about 40 seconds.

This is perhaps a bit of a cheat, but it’s another example of removing work from the inner loop. The original code maintained a list of indexes (line numbers) for which names are to be printed out. During the loop, a counter would keep track of the current line, and the current line would be queried against the list of indexes to determine if the name is to be printed out. The list is short, with only 3 items, so searching it is pretty quick. There are 18,245,416 lines in the file and only 3 indexes in the list, so searching the list for the current line number will fail 18,245,413 times. Since we’re storing all the names in a list, we can just print out the names we’re interested in after we’ve loaded them all. This avoids having to check the list within the inner loop.

The patch also stores the indexes in an array since the syntax for initializing an array is a bit more concise. It also avoids boxing overhead. Boxing of three elements isn’t a significant overhead, so it’s unlikely this makes any measurable difference in the performance. In general, I prefer to avoid boxing unless it’s necessary.

Variation 3

--- ReadFileJavaApplicationBufferedReader2.java
+++ ReadFileJavaApplicationBufferedReader3.java
@@ -44,6 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
+       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
        System.out.println("Reading file using " + Caller.getName());
@@ -55,20 +57,13 @@
                String name = array1[7].strip();
-               if(name.contains(", ")) {
-                       String array2[] = (name.split(", "));
-                       String firstHalfOfName = array2[1].trim();
-                       if (!firstHalfOfName.isEmpty()) {
-                               if (firstHalfOfName.contains(" ")) {
-                                       String array3[] = firstHalfOfName.split(" ");
-                                       String firstName = array3[0].trim();
-                                       firstNames.add(firstName);
-                               } else {
-                                       firstNames.add(firstHalfOfName);
-                               }
+               var matcher = namePat.matcher(name);
+               if (matcher.find()) {
+                   String s = matcher.group(2);
+                   if (s == null) {
+                       s = matcher.group(3);
+                   firstNames.add(s);
                String rawDate = array1[4].strip();

This patch reduces the execution time from 40 to about 38 seconds.

Whereas in variation 1 we saw that reducing a regex to a single character split pattern helped provide a large speedup, in this case we’re replacing some fairly involved string splitting logic with a regex. Note that this code compiles the regex outside the loop and uses it repeatedly within the loop. In this patch I’m attempting to provide similar semantics to the splitting logic, but I’m sure there are cases where it doesn’t produce the same result. (For the input data in this file, the regex produces the same result as the splitting logic.) Unfortunately the complexity is moved out of the logic and into the regex. I’m not going to explain the regex in great detail, since it’s actually fairly ad hoc itself. One problem is that extracting a “first name” from a name field relies on European name conventions, and those conventions don’t apply to all names in this file. A second problem is that the data itself isn’t well-formed. For example, one name in the file is “FOWLER II, COL. RICHARD”. Both the splitting logic and the regex extract the first name as “COL.” which is clearly a title, not a name. It’s unclear what can be done in this case. Nevertheless, the vast majority of records in the file are well-formed, and applying European name conventions works for them. For a name record such as “SMITH, JOHN A” both the splitting logic and the regex extract “JOHN” as the first name, which is the intended behavior.

Variation 4

--- ReadFileJavaApplicationBufferedReader3.java
+++ ReadFileJavaApplicationBufferedReader4.java
@@ -45,7 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
-       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
+       var namePat = Pattern.compile(", \\s*([^, ]+)");
          System.out.println("Reading file using " + Caller.getName());
@@ -59,11 +59,7 @@
                  var matcher = namePat.matcher(name);
                  if (matcher.find()) {
-                     String s = matcher.group(2);
-                     if (s == null) {
-                         s = matcher.group(3);
-                     }
-                     firstNames.add(s);
+                     firstNames.add(matcher.group(1));
                String rawDate = array1[4].strip();

This patch reduces the runtime from 38 seconds to about 35 seconds.

For reasons discussed previously, it’s difficult in general to extract the correct “first name” from a name field. Since most of the data in this file is well-formed, I took the liberty of making some simplifying assumptions. Instead of trying to replicate the original splitting logic, here I’m using a simplified regex that extracts the first non-comma, non-space sequence of characters that follows a comma-space separator. In most cases this will extract the same first name from the name field, but there are some edge cases where it returns a different result. Assuming this is acceptable, it allows a simplification of the regex and also of the logic to extract the desired substring from the match. The result is another small speedup.

Variation 5

--- ReadFileJavaApplicationBufferedReader4.java
+++ ReadFileJavaApplicationBufferedReader5.java
@@ -46,6 +46,8 @@
        ArrayList<String> firstNames = new ArrayList<>();
        var namePat = Pattern.compile(", \\s*([^, ]+)");
+       char[] chars = new char[6];
+       StringBuilder sb = new StringBuilder(7);
        System.out.println("Reading file using " + Caller.getName());
@@ -63,11 +65,12 @@
                String rawDate = array1[4].strip();
-               String month = rawDate.substring(4,6);
-               String year = rawDate.substring(0,4);
-               String formattedDate = month + "-" + year;
-               dates.add(formattedDate);
+               rawDate.getChars(0, 6, chars, 0);
+               sb.setLength(0);
+               sb.append(chars, 0, 4)
+                 .append('-')
+                 .append(chars, 4, 2);
+               dates.add(sb.toString());
          for (int i : indexes) {

This patch reduces the runtime from 35 seconds to about 33 seconds.

This change is primarily to reduce the amount of memory allocation within the inner loop. The previous code extracts two substrings from the raw date, creating two objects. It then appends the strings with a “-” separator, which requires creation of a temporary StringBuilder object. (This is likely still true even with JEP 280 – Indify String Concatenation in place.) Finally, the StringBuilder is converted to a String, allocating a fourth object. This last object is stored in a collection, but the first three objects are garbage.

To reduce object allocation, the patch code creates a char array and a StringBuilder outside the loop and reuses them. The character data is extracted into the char array, pieces of which are appended to the StringBuilder along with the “-” separator. The StringBuilder’s contents are then converted to a String, which is then stored into the collection. This String object is the only allocation the occurs in this step, so the patch code avoids creating any garbage.

I’m of two minds about this optimization. It does provide a few percentage points of optimization. On the other hand, it’s decidedly non-idiomatic Java: it’s rare to reuse objects this way. However, this code doesn’t introduce much additional complexity, and it does provide a measurable speedup, so I decided to keep it in. It does illustrate some techniques for dealing with character data that can reduce memory allocation, which can become expensive if done within an inner loop.

Variation 6

--- ReadFileJavaApplicationBufferedReader5.java
+++ ReadFileJavaApplicationBufferedReader6.java
@@ -115,16 +115,9 @@
-       LinkedList<Entry<String, Integer>> list = new LinkedList<>(map.entrySet());
+       Entry<String, Integer> common = Collections.max(map.entrySet(), Entry.comparingByValue());
-       Collections.sort(list, new Comparator<Map.Entry<String, Integer> >() {
-               public int compare(Map.Entry<String, Integer> o1,
-                                  Map.Entry<String, Integer> o2)
-               {
-                       return (o2.getValue()).compareTo(o1.getValue());
-               }
-       });
-       System.out.println("The most common first name is: " + list.get(0).getKey() + " and it occurs: " + list.get(0).getValue() + " times.");
+       System.out.println("The most common first name is: " + common.getKey() + " and it occurs: " + common.getValue() + " times.");
        Instant commonNameEnd = Instant.now();
        long timeElapsedCommonName = Duration.between(commonNameStart, commonNameEnd).toMillis();
        System.out.println("Most common name time: " + timeElapsedCommonName + "ms");

This patch reduces the runtime from 33 seconds to about 32 seconds.

The task here is to find the most frequently occurring first name. Instead of sorting a list of map entries, we can simply use Collections.max() to find the maximum entry according to some criterion. Also, instead of having to write out a comparator that compares the values of two map entries, we can use the Entry.comparingByValue() method to obtain such a comparator. This doesn’t result in much of a speedup. The reason is that, despite there being 18 million names in the file, there are only about 65,000 unique first names in the file, and thus only that many entries in the map. Computing the maximum entry saves a little bit of time compared to doing a full sort, but not that much.

Variation 7

This isn’t a patch, but instead I did a general cleanup and refactoring pass. I’ll describe the changes here. The revised source file is in this gist:


The changes didn’t significantly affect the runtime, which remained at about 32 seconds.

There are a couple places in the original code where a frequency table is generated. The general algorithm is to create a map of items to counts (typically Integer) to hold the results. Then, for each item, if there’s no entry for it in the map, insert it with the value 1, otherwise add 1 to the value that’s already there. Several commenters have suggested using Map.merge() to make the put-or-update logic within the loop more concise. This will indeed work, but there’s a better way to do this using streams. For example, there is a list firstNames with a list of all first names extracted from the file. To generate a frequency table of these names, one can use this code:

Map<String, Long> nameMap = firstNames.stream()
                                      .collect(groupingBy(name -> name, counting()));

(This assumes a static import of java.util.stream.Collectors.* or individual names.) See the JDK Collectors documentation for more information. Note that the count value is a Long, not an Integer. Note also that we must use boxed values instead of primitives here, because we’re storing the values into collections.

I also use this same technique to generate the frequency table for dates:

Map<String, Long> dateMap = dates.stream()
                                 .collect(groupingBy(date -> date, counting()));

The typical technique to loop over a map involves looping the map’s entry set, and extracting the key and value from the entry using the getKey() and getValue() methods. Often, a more convenient way to loop over the entries of a Map is to use the Map.forEach() method. I used this to print out the map entries from the date map:

dateMap.forEach((date, count) ->
    System.out.println("Donations per month and year: " + date + " and donation count: " + count));

What makes this quite convenient is that the key and value are provided as individual arguments to the lambda expression, avoiding the need to call methods to extract them from an Entry.

Instead of creating a File object, opening a FileReaderon it, and then wrapping it in a BufferedReader, I used the NIO newBufferedReader() method:

BufferedReader b = Files.newBufferedReader(Path.of(FILENAME))

It’s a bit more convenient than the wrapping approach.

Other changes I made include the following:

  • Unified the start time into a single Instant variable, and refactored the elapsed time reporting into a separate between() method.
  • Removed the outermost try statement whose catch block does nothing other than printing a stack trace. I see this a lot; I suspect it exists in some code template somewhere. It’s completely superfluous, because simply letting the exception propagate will cause the default exception handler to print the stack trace anyway. The only thing you might need to do is to add a throws IOException to the main() method, which is what I did in this case.
  • Used interface types instead of implementation types. I used List and Map in variable declarations instead of ArrayList and HashMap. This is an example of programming to an interface, not an implementation. This is not of great consequence in a small program, but it’s a good habit to get into, especially when defining fields and methods. I could also have used var in more places, but I wanted to be explicit when I changed type arguments, e.g., from Integer to Long.
  • Reindented the code. The JDK style is to use spaces for indentation, in multiples of 4. This avoids lines that are indented halfway off the right edge of the screen, but mainly I’m more comfortable with it.

Performance Recap

Version             Time (sec)      Description
-------             ----------      -----------
Original               108          baseline
Variation 1             44          optimize line splitting
Variation 2             40          rearrange printing lines by index
Variation 3             38          use regex for extracting first name
Variation 4             35          simplified first name regex
Variation 5             33          reuse StringBuilder/char[] for date extraction
Variation 6             32          use max() instead of sort()
Variation 7             32          cleanup

Summary & Comment

The first several optimizations involved removing work from the inner loop of the program. This is fairly obvious. Since the loop is executed a lot (18 million times) even a small reduction in the amount of work can affect the program’s runtime significantly.

What’s less obvious is the effect of reducing the amount of garbage generated within a loop. When more garbage is generated, it fills up the heap more quickly, causing GC to run more frequently. The more GC runs, the less time the program can spend getting work done. Thus, reducing the amount of garbage generated can also speed up a program.

I didn’t do any profiling of this program. Normally when you want to optimize a program, profiling is one of the first things you should do. This program is small enough, and I think I have a good enough eye for spotting potential improvements, that I was able to find some significant speedups. However, if somebody were to profile my revised version, they might be able to find more things to optimize.

Typically it’s a bad idea to do ad hoc benchmarking by finding the difference between before-and-after times. This is often the case with microbenchmarking. In such cases it’s preferable to use a benchmarking framework such as JMH. I didn’t think it was strictly necessary to use a framework to benchmark this program, though, since it runs long enough to avoid the usual benchmarking pitfalls. However, the differences in the runtimes between the later optimizations are getting smaller and smaller, and it’s possible that I was misled by my informal timing techniques.

Several commenters have suggested using the Files.lines() method to get a stream of lines, and then running this stream in parallel. I’ve made a few attempts to do this but I haven’t shown any here. One issue is with program organization. As it stands, this program’s main loop extracts data into three lists. Doing this using streams involves operations with side effects (which are not recommended for parallel streams) or creating an aggregate object that can be used to accumulate the results. These are certainly reasonable approaches, but I wasn’t able to get any speedup from using parallel streams — at least on my 2-core system. The additional overhead of aggregation seemed to more than offset the benefit gained from running on two cores. It’s quite possible that with more work, or running the program on a system with more cores, can realize a benefit from running in parallel.

I believe the changes I’ve shown improve the quality of the code as well as improving its performance. But it’s possible to optimize this program even further. I’ve have some additional changes that get the runtime consistently down to about 26 seconds. These changes involve replacing some library calls with hand-written, special-purpose Java code. I don’t usually recommend making such changes, as they result in programs that are more complicated, less maintainable, and more error-prone. That’s why I’m not showing them. The last variation shows, I think, the “sweet spot” that represents the best tradeoff between code quality and performance. It is often possible, though, to make programs go faster at the expense of making them more complicated.

With this article, I hope that I’ve been able to illustrate several programming techniques and APIs that anybody can use to speed up and  improve the quality of their code, and to help people improve their Java development skills.