The other day on Twitter I said, “Scanner is a weird beast. I wouldn’t necessarily use it as a good example for anything.” The context was a discussion about classes that are both an Iterator and are AutoCloseable. As it happens, Scanner is such an example. It’s an Iterator, because it allows iteration over a sequence of tokens, and it’s also AutoCloseable, because it might have an external resource (like a file) contained within it. I wouldn’t hold it up as an example of good object design, though. This article explains why.

Scanner has a pretty complicated API, but once you figure out how to use it, it’s incredibly useful. Its main issue is that it’s trying to do too many things at once. The good news is that you can use parts of the API for stylized uses and mostly ignore other parts of the API.

At its core, Scanner is about regex pattern matching. Unlike the Pattern and Matcher classes, which can only match on a fixed input such as a String, Scanner allows you to match over arbitrary input that might not even exist in memory. There are several Scanner constructors that allow input to be read from various sources such as files, InputStreams, or channels. Scanner handles buffering, and it reads additional input as necessary, and it discards any input that was skipped over during matching. This is really cool. It means you can do matching over arbitrarily sized input data using just a few KB of memory.

(Naturally this depends on the patterns used for matching as well as the well-formedness of input. For example, you can attempt to read a file line by line, and this will work for an arbitrarily sized file if it’s broken up into reasonably sized lines. If the file doesn’t have any line separators, Scanner will bring the whole file into memory, as the file conceptually contains one long line.)

Scanner has two fundamental modes of matching. The first mode is to break the input into tokens that are separated by delimiters. The delimiters are defined by the regex pattern you provide. (This is rather like the String.split method.) The second mode is to find chunks of text that result from matching the regex pattern you provide. In other words, the token mode provides the text between matches, and the find mode provides the text of the matches themselves. What’s odd about the Scanner API is that there are groups of methods that apply in one mode but not the other.

The methods that apply to the tokens mode are:

  • delimiter
  • locale
  • hasNext* (excluding hasNextLine)
  • next* (excluding nextLine)
  • radix
  • tokens
  • useDelimiter
  • useLocale
  • useRadix

The methods that apply to the find mode are:

  • findAll
  • findInLine
  • findWithinHorizon
  • hasNextLine
  • nextLine
  • skip

(Additional Scanner methods apply to both modes.)

Here’s an example of using Scanner for matching tokens:

    String story = """
        "When I use a word," Humpty Dumpty said,
        in rather a scornful tone, "it means just what I
        choose it to mean - neither more nor less."
        "The question is," said Alice, "whether you
        can make words mean so many different things."
        "The question is," said Humpty Dumpty,
        "which is to be master - that's all."

    List<String> words = new Scanner(story)
        .useDelimiter("[- \\.\n\",]+")

(Note, this example uses the new Text Blocks feature, which was previewed in JDK 13 and 14 and which is scheduled to be final in JDK 15.)

Here, we set the delimiter pattern to match whitespace and various punctuation marks, so the tokens consist of text between the delimiters. The results are:

    [When, I, use, a, word, Humpty, Dumpty, said, in, rather, a, scornful,
    tone, it, means, just, what, I, choose, it, to, mean, neither, more,
    nor, less, The, question, is, said, Alice, whether, you, can, make,
    words, mean, so, many, different, things, The, question, is, said,
    Humpty, Dumpty, which, is, to, be, master, that's, all]

In this example I used the tokens() method to provide a stream of tokens. Scanner implements Iterator<String>, which allows you to iterate over the tokens that were found, using the typical hasNext/next methods. Unfortunately, Scanner does not implement Iterable, which would allow you use it within a for-loop.

Scanner also provides pairs of hasNext/next methods for converting tokens to data. For example, it provides hasNextInt and nextInt methods that search for the next token and convert it to an int (if available). Corresponding pairs of methods are also available for BigInteger, boolean, byte, double, float, long, and short. These pairs of methods are “iterator-like” in that the hasNextX/nextX method pairs are just like the hasNext/next method pair of an Iterator, with the addition of data conversion. But there’s no way to wrap them in an Iterator, like Iterator<BigInteger> or Iterator<Double>, without writing your own adapter code. This is unfortunate, since Scanner is an Iterator<String> but its Iterator is only over tokens, not the value-added iterator-like constructs that include data conversions.

The other main mode of Scanner is the find mode, which provides a succession of matches from a pattern you provide. Here’s an example of that:

    List<String> words = new Scanner(story)

Here, instead of matching delimiters between tokens, I’ve provided a pattern that matches the results I want to get. Note that return of findAll() is Stream<MatchResult> and which must be converted to strings; that’s what the MatchResult::group method does. The resulting list is the exact same list of words as the previous example. Personally, I find this mode more useful than the tokens mode. You’re providing the pattern for the text you’re interested in, as opposed to a pattern for the delimiters between the text you’re interested in. Also, you get back MatchResult objects, which are useful for extracting substrings of what you matched. This isn’t available in tokens mode.

I started off this article saying that Scanner is weird but useful. It’s weird because it has these two distinct modes. It has groups of methods that apply to one mode but not the other. If you look at the API carefully (or at the implementation) you’ll also see that there is also a bunch of internal state that applies to one mode but not the other. It seems like Scanner should have been split into two classes. Another weird thing about Scanner is that it’s an Iterator<String>, which elevates one part of one of the modes to the top level of the API and relegates the other parts to second-class status.

That said, Scanner provides some very useful services. It does I/O and buffering for you, and if regex matching needs more input, it handles that automatically. I’m also partial to the streams-returning methods like findAll() and tokens() — I have to admit, I added them — but they make bulk processing of arbitrary input quite easy. I hope you find these aspects of Scanner useful as well.

Oracle Code One 2019

Here’s a quick summary of Oracle Code One 2019, which was last week.

It essentially started the previous week at the “Chinascaria”, Steve Chin‘s Community BBQ for JUG leaders and friends. Although Steve is now at JFrog, he’s continuing the BBQ tradition. Of course Bruno Souza, Edson Yanaga, and some other cohorts from Brazil were manning the BBQ, and there was plenty of meat to be had. I didn’t get many photos, but Ruslan from JUG.RU was there and he insisted that we take a selfie:

Hi Ruslan! Oh, here’s a tweet with the chefs from the BBQ:

Java Keynote

The conference kicked off with the Java keynote, The Future of Java is Now, led by Georges Saab. The pace was pretty brisk, with several walk-on guests. We heard from Jessica Pointing talk about quantum computing, and from Aimee Lucido on her new book, Emmy in the Key of Code.  This sounds really cool, a book written in Java-code-like verse. This should be interesting to my ten-year-old daughter, since she’s reading the Girls Who Code series right now. I have to say this is the first time I’ve shown a segment of a conference keynote to my family!

Naturally a good section of the keynote covered technical issues. Mikael Vidstedt and Brian Goetz ably covered the evolution of the JVM and the Java programming language. Notably, Mark Reinhold did not appear; he’s taking a break from conferences to refocus on hard technical problems.

My Sessions

This year, I had two technical sessions and a lab. This was a pretty good workload, compared with previous years where I had half a dozen sessions. I felt like I made a good contribution to the audience, but it left time for me to have conversations with colleagues (the “hallway track”) and to attend other sessions I was interested in.

My sessions were:

Collections Corner Casesslidesvideo

This session covered Map’s view collections (keySet, values, entrySet) and topics regarding comparators being “inconsistent with equals.”

Local Variable Type Inference: Friend or Foe?slidesvideo

(with Simon Ritter)

When Simon and I did an earlier version of this talk at another conference, we called it “Threat or Menace.” This probably doesn’t translate too well; to me, it has a 1950s red scare connotation, which is distinctly American. I think that’s why Simon changed it to Friend or Foe. It turns out that Venkat Subramaniam also had a talk on the same subject, entitled “Type Inference: Friend or Foe”!

Lambda, Streams, and Collectors Programming Laboratorylab repository

(with Maurice Naftalin and José Paumard)

This lab continues to evolve; there are now over 100 exercises. Thanks to Maurice and José for continuing to maintain and develop the lab materials. I recalled that we first did a Lambda Lab at Devoxx UK in 2013, which was before Java 8 was released. Maurice and Richard Warburton and I got together an hour beforehand and came up with about half a dozen exercises. It was a bit ad hoc, but we managed to keep a dozen or so people busy for an hour and a half.

More recently we (mostly José) have added and reorganized the exercises, converted the project to maven, and converted the test assertions to AssertJ. I’ve finally come around to the idea that maven is the way to go. However, the lab attendees still had their fair share of configuration problems. The think the main problem is the mismatch between maven and the IDE. It’s possible to build the project on the command line using maven, but hitting the “Test” button in the IDE does some magic that doesn’t necessarily invoke maven, so it might or might not work.

Meet the Experts

One thing that was new this year was the “Meet the Experts” sessions. In the past we’d be asked to sign up for “booth duty” which consisted of standing around for a couple hours waiting for people to ask questions. This was mostly a waste of time, since we didn’t have flashy demos. Instead, we scheduled informal, half-hour time slots at a station in the Groundbreakers Hub, and these were put onto the conference program. The result was that people showed up! I signed up for two of these. I didn’t have a formal presentation; I just answered people’s questions. This seemed considerably more useful than past “booth duty.” People had good questions, and I had some good conversations.

Everything You Ever Wanted To Know About Java And Didn’t Know Whom To Askvideo

I hadn’t signed up for this session, but the day before the session, Bruno Souza corralled me (and several others) into participating in this. Essentially it’s an impromptu “ask me anything” panel. He convinced about 15 people be on the panel. This included various JUG leaders, conference speakers, and experts in various areas. During the first part of the session, Bruno gathered questions from the audience and a colleague typed them into a document that was projected on the screen. Then he called the panelists up on stage. The rest of the session was the panel picking questions and answering them. I thought this turned out quite well. People got their questions answered, we covered quite a variety of topics, and it provoked some interesting discussions.

Other Sessions of Interest

I attended a few other sessions that were quite useful. I also watched on video some of the sessions that I had missed. Here they are, in no particular order:

Robert Seacord, Serialization Vulnerabilitiesvideo

Mike Duigou, Exceptions 2020 (slide download available)

Sergey Kuksenko, Does Java Need Value Types? Performance Perspectivevideo

Brian Goetz, Java Language Futures, 2019 Editionvideo

Venkat Subramaniam, Type Inference: Friend or Foe?video

Robert Scholte, Broken Build Tools and Bad Behaviors (slide download available)

Nikhil Nanivadekar, Do It Yourself: Collections

Here’s the playlist of Code One sessions that were recorded.

Unfortunately, not all of the sessions were recorded. Some of the speakers’ slide decks are available for download via the conference catalog.


It was recently announced that Jakarta EE will not be allowed to evolve APIs in the javax.* namespace. (See Mike Milinkovich’s announcement and his followup Twitter thread.) Shortly thereafter, David Blevins posted a proposal and call for discussion about how Jakarta EE should transition its APIs into the new jakarta.* namespace. There seem to be two general approaches to the transition: a “big bang” (do it all at once) approach and an incremental approach. I don’t have much to add to the discussion about how this transition should take place, except to say that I’m pleasantly surprised at the amount of energy and focus that has emerged in the Jakarta EE community around this effort.

I’m a Java SE guy, so the details of Java EE and Jakarta EE specifications are pretty much outside my bailiwick. However, as Dr Deprecator, I should point out that there is one area of overlap: the dependence of Java EE / Jakarta EE APIs on deprecated Java SE APIs. One example in particular that I’m aware of was brought to my attention by my colleague Sean Mullan, who is tech lead of the Java SE Security Libraries group.

The Java SE API in question is java.security.Identity, which was deprecated in JDK 1.2 (released 1998) and deprecated for removal in Java 9. Since this API has been deprecated for a very long time, and we’d like to remove it from Java SE. For most purposes, it can be replaced by java.security.Principal, which was added in JDK 1.1 (released 1997).

The EJB specification uses the Identity type in a couple methods of the EJBContext class. If we were to remove Identity from some release of Java SE, it would mean that EJB — and any Java EE, Jakarta EE, or any other framework that includes EJB — would no longer be compatible with that release of Java SE. We’ve thus held off removing this type for the time being, in order to avoid pulling the rug out from underneath the EE specs.

Identity is used only in two methods the EJBContext class. It appears that these methods were deprecated in EJB 1.2, and replacements that use Principal were introduced at that time. Since J2EE 1.2 was introduced in 1999, things have been this way for about 20 years. I think it’s time to do some cleanup! (See EJB-spec issue #130.)

For better or for worse, these methods still appear in Java EE 8. As I understand things, the next specification release will be Jakarta EE 9, which will be the earliest opportunity to change the EE specification to remove the dependency on the deprecated SE APIs.

The usual argument against removing stuff is that it’s both source and binary incompatible. If something falls over because of a missing API, it’s pretty hard to work around. This is the reason that deprecated stuff has stayed around for so many years. On the other hand, if these deprecated APIs aren’t removed now, when will they be removed?

I’d argue that the upcoming package renaming (whether incremental or big bang) is an opportunity to remove obsolete APIs, because such renaming is inherently both source and binary incompatible. People will have to run migration tools and change their code when they transition it from Java EE 8 to Jakarta EE 9. There can be no expectation that old jar files will run unchanged in the new Jakarta world. Thus, the package renaming is an opportunity to shed these obsolete APIs.

I’m not aware of any EE APIs other than EJBContext that depend on Java SE APIs that are deprecated for removal. I did a quick check of GlassFish 5 using the jdeprscan tool, and this one was the only API-to-API dependency that I found. However, I’m not an expert in EE and GlassFish, so I’m not sure I checked the right set of jars. (I did find a bunch of other stuff, though. Contact me if you’re interested in details.)

I had a brief Twitter exchange with David Blevins on this topic the other day. He pointed me at the parts of the TomEE implementation that implements EJBContext, and it turns out that the two methods in question simply throw UnsupportedOperationException. This is good news, in that it means TomEE applications aren’t using these methods, which means that those applications won’t break if these methods are removed.

However, that doesn’t mean these methods can simply be removed from EE implementations! The TCKs have what is called a “signature test,” which scans the libraries for the public classes, fields, and methods, to make sure that all the APIs required by the specifications are present and that there are no extra APIs. I’m fairly sure that the EE TCK signature test contains entries for those methods. Thus, what needs to happen is that the Jakarta EE specification needs to remove these methods, the EE TCK needs to be updated to match, and then implementations can remove — in fact, will be required to remove — these methods when they’re brought into conformance with the new specification.

Note that all of this is separate from the question of what to do with other deprecated Jakarta EE APIs that don’t depend on deprecated Java SE APIs. Deprecated Jakarta EE APIs might have been deprecated for their own reasons, not because of their dependency on SE APIs. These should be considered on their own merits and an appropriate removal plan developed. Naturally, as Dr Deprecator, I like removing old, obsolete APIs. But the deprecation and potential removal plan for deprecated Jakarta EE APIs needs to be developed with the particular evolution path of those APIs in mind.

This is a very belated post that covers a session that took place at the JavaOne conference in San Francisco, October 2017.

Here’s a recap of the BOF (“birds-of-a-feather”) session I led on software maintenance. The title was Maintenance – The Silent Killer. This was my feeble attempt at clickbait. This was an evening session that was held during the dinner hour, and maintenance isn’t the most scintillating topic, so I figured attendance needed all the help I could give it.

When the start time arrived, I was standing on the podium in an empty room. I thought, well, if nobody shows up then I can go home early. Then about fifty people flooded in! It turns out they had lined up outside waiting for their badges to be scanned, but then a conference staffer came by and told them that badges weren’t scanned for the evening sessions and that they should just go in.

Overall I thought it went quite well. I gave a brief presentation, and then set up some discussion questions for the audience. The people who showed up really were interested in maintenance, they offered a variety of interesting insights and views, and they were quite serious about the topic. There was enough discussion to fill the allotted time, and there was plenty of interaction between me and the audience and among audience members themselves. I’ll declare the session to have been successful, though it’s difficult for me to draw any grand conclusions from it. I was heartened by the amount of participation. I was really concerned that nobody would show up, or perhaps that three people would show up, since most tech conferences are about the latest and greatest new shiny thing.

The session wasn’t recorded. What follows is some notes on my slide presentation, followed by some additional notes from the discussion that followed. These are unfortunately rather sparse, as I was participating at the same time. However, I did capture a few ideas that I hadn’t considered previously, which I found quite beneficial.

Slide Presentation (PDF)

Slide 2: Golden Gate Bridge. I grew up in Marin County, which is connected to San Francisco by the Golden Gate Bridge. We crossed the bridge frequently. Back in 1974 or so the toll was raised from 50¢ to 75¢, and my parents complained incessantly about this. At one point I had the following conversation with my Dad about the toll:

Me: Why do they collect tolls?
Dad: To pay off the bridge.
Me: When will the bridge be paid off?
Dad: Never!

As I kid I was kind of perplexed by this. If you take out a loan, and make regular payments on it, won’t it eventually be paid off? (Sub-prime mortgages weren’t invented until much later.) Of course, the original construction loans have long since been paid off. What the tolls are used for, and which indeed will never be paid off, is the continuous maintenance that the bridge requires.

Slide 3: This is me driving my car through Tunnel Log in Sequoia National Park. The point isn’t about a tunnel through a tree, but the cost of owning and operating a car. The first time I used my car for business expenses, I was surprised by the per-mile reimbursement amount. If you consider the 2017 numbers, this car’s gasoline costs about 14¢-20¢ per mile, and the IRS standard reimbursement rate is 53.5¢ per mile. Hey, I’m making money on this deal!

No. This is a 1998 BMW, and you will not be surprised to learn that the cost of maintenance on this car is quite significant. Indeed, I’ve added up the maintenance costs over the lifetime of the car, and they outweigh the cost of gasoline. Counting maintenance and depreciation, I’m decidedly not making money on mileage reimbursement.

Slide 4 has some points on maintenance as a general phenomenon. One point that bears further explanation is my claim that “deferred maintenance costs can grow superlinearly.” Continuing with the car example, consider oil changes. It might cost a couple hundred dollars a year for regular oil changes. You could save money for a couple years by not changing the oil. This might eventually result in a several thousand dollar engine rebuild. “Superlinear” isn’t very precise, but the point is that the cost of remediating problems caused by deferred maintenance is often much greater than the sum of incremental maintenance costs.

Slide 5, quotation from Kurt Vonnegut. Perhaps profound if you’ve never heard it before, but a cliché if you pay attention to maintenance. It does seem to be true that in general creative activities get all the attention at the expense of maintenance activities.

Slides 6-7. Physical systems exhibit wear and friction and this contributes to the need to do regular maintenance. Software doesn’t “wear” out. But there are a bunch of phenomena that cause software systems to require maintenance. Primarily these seem to be related to the environment in which the software exists, not the software itself.

Slides 8-9. Most planning and costing efforts around software projects are concerned with software construction. Maintenance is a significant cost, accounting for perhaps 50% to 75% (Boehm) or 40% to 80% (Glass) of the total life cycle costs. However, comparatively little planning and budgeting effort goes toward maintenance.

Glass points out that software maintenance and construction are essentially the same activity, except that maintenance requires additional effort to “understand the existing product.” As a programmer, when you’re developing software, you know what you’re trying to do and you’re familiar with the code you’re developing at the moment. When maintaining software, you often have to deal with code that you might never have seen before and figure out what it does before you can modify it successfully. The cost incurred in re-acquiring knowledge and understanding of an existing software system is significant.

Slide 10. OpenJDK is an open source implementation of Java. It’s an old code base; Java 1.0 was released in 1996, and it was in development for a couple years prior to that. It’s been continually evolved and maintained since then. Evolution consists of usual software activities such as adding features, improving performance, fixing bugs, mitigating security vulnerabilities, and maintaining old releases. Maintenance activities are a large portion of the team’s activities. I’m not sure how to measure it, but the estimates from Boehm and Glass above are quite plausible.

In addition to the above development activities the team also puts effort into deprecation and removal of obsolete features. This is important because, among other things, it helps to reduce the long-term maintenance burden. See some of my prior materials on the topic of deprecation:

The cost of knowledge re-acquisition mentioned previously is somewhat mitigated by systems maintained by the JDK group that preserve history.

The open version of the JDK source code in the Mercurial version control system, and it includes changesets dating back to December 2007. The earlier source code history is in a closed, Oracle-internal system and dates back to August 1994.

The JDK Bug System (a JIRA instance) contains over 265,000 bugs and feature requests dating back to late 1994. Many of these bugs were converted from a Sun Microsystems internal bug database.

Personally, I’ve found that the ability to search over 20 years of source code history and bug history to be of immense value in understanding existing code and diagnosing problems.

Slide 11. A big driver of software maintenance is security vulnerabilities. This has gotten worse in recent years, as “everything” is connected to the internet. Another significant contributor to maintenance issues is the large number of dependencies among software components, many of which are in open source. By reusing external software components, you can reduce development time. However, doing so takes on the maintenance burden of those components. Either you have to keep all the external components up to date, or you have to maintain them yourself.

Slide 12. Questions and Audience Discussion

The slide has several questions to spark discussion with the audience. We didn’t address them directly, but there was a relatively free-flowing conversation. Here are some notes from that conversation.

One audience member compared maintenance to a fence. Suppose you have a pasture, and wolves keep coming to it and attacking your sheep. So you put up a fence. The fence just sits there. The sheep grace peacefully. Wolves stay away because they realize they can’t get past the fence. Nothing happens. The fact that nothing is happening is a huge benefit! Like a fence, a well-maintained system just does its thing without calling attention to itself. This may lead people to forget about it. A poorly-maintained system is constantly breaking, attracting lots of attention.

An attendee suggested thinking about maintenance planning the same way a project manager thinks about risk management. With less maintenance there is a greater risk of failure, and vice-versa.

Another attendee suggested insurance as a model for maintenance. Maintenance costs are like insurance premiums: you pay them regularly, and you’re protected. Not paying them saves money temporarily, until some disaster strikes. (Rather like my car oil change example above.) Of course, insurance is closely related to risk management, and as a social institution it seems poorly understood by most lay individuals.

An audience member suggested just biting the bullet and declaring that maintenance is just a cost of doing business. There’s no use complaining about it; you just have to accept it. Another audience member said that his department allocated 10% of its budget to maintenance costs.

Regarding keeping up with the software updates, one attendee pointed out that it’s not necessarily important to be on the latest software release, but instead it’s important to be on the latest patch or update level even if you’re on an old release. Many commercial software products have support contracts where they will maintain old releases for many years. They don’t have the most features or the highest performance, but they are maintained with fixes for current security vulnerabilities and other high priority problems.

(This is a big component of the business of my company, Oracle. This is also true of products from many other software companies.)

Last week, Paige Niedringhaus posted an article Using Java to Read Really, Really Large Files. While one can quibble about whether the file to be processed is indeed “really, really large,” it’s large enough to expose some interesting concerns and to present some interesting opportunities for optimizations and use of newer APIs. There was some discussion on Reddit /r/java and /r/programming and a PR with an alternative implementation. Earlier today, Morgen Peschke posted an analysis and comparison to a Scala version of the program. (This is posted as comment at the bottom of the original article.) This article is my contribution to the discussion.

When I ran Niedringhaus’ program on my machine using JDK 8, I ran into the same memory issues as did Peschke; the program consumed so much memory that it spent all its time in garbage collection. Increasing the heap size worked around this problem. Interestingly, using JDK 11, I was able to run Niedringhaus’ version successfully without increasing the heap size. (I suspect the reason is that JDK 11 uses G1GC as the default collector, and its different collection scheme avoids the pathological behavior of the Parallel GC, which is the default collector in JDK 8.)

The approach I’ll take is to retain the large lists accumulated by the original program. My presumption is that the lists are loaded into memory in order to do further analysis that isn’t part of the original program. Instead of reducing memory consumption, I focus on changing aspects of the computation to improve runtime performance. After establishing the program’s baseline performance, I proceed to show several variations on the code that successively improve its performance, along with some discussion describing the reasons for the improvement. I present a diff for each variation. Each variation, along with my final version, is also available in a gist.

I downloaded indiv18.zip on 2019-01-04 and extracted the itcont.txt file. It’s about 3.3GB in size and has 18,245,416 lines. I started with the 2019-01-05 version of Niedringhaus’ test program:


For comparing execution times, I’m using the last time reported by the program, the “Most common name time.” The benchmark times all showed a fair amount of variability. In most cases I reran the program a few times and chose the fastest time. This isn’t very rigorous, but it should at least give an idea of the relative speeds of execution. I’ve rounded to whole seconds, because the high variability makes milliseconds superfluous.

Niedringhaus’ article reported a runtime of about 150 seconds for this version of the program. I ran the program on my laptop (MacBook Pro, mid 2014, 3GHz Intel Core i7, 2 cores, 16GB) and the execution time was about 108 seconds. I’ll use that figure as the baseline against which subsequent optimizations are compared.

Variation 1

--- ReadFileJavaApplicationBufferedReader0.java
+++ ReadFileJavaApplicationBufferedReader1.java
@@ -57,8 +57,8 @@
                // System.out.println(readLine);
                // get all the names
-               String array1[] = readLine.split("\\s*\\|\\s*");
-               String name = array1[7];
+               String array1[] = readLine.split("\\|", 9);
+               String name = array1[7].strip();
                        System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
@@ -80,7 +80,7 @@
-               String rawDate = array1[4];
+               String rawDate = array1[4].strip();
                String month = rawDate.substring(4,6);
                String year = rawDate.substring(0,4);
                String formattedDate = month + "-" + year;

Applying this patch reduced the execution time from 108 seconds to about 44 seconds.

This is change is actually two optimizations. String splitting is quite expensive, and it’s done once for each of the 18 million lines in the file. It’s thus quite beneficial to remove work from the program’s main loop. The String.split() call uses a regex that splits the line into fields, where the separator is a vertical bar including any adjacent whitespace. The regex pattern is compiled each time through the loop. It would save some time to compile the regex once before the loop and to reuse it. But it turns out that using a regex here is unnecessary. We can instead split on a vertical bar alone. The split() method has a fast path for single-character split patterns which avoids regexes entirely. (Since the vertical bar is a regex metacharacter, it still counts as a single character even with the backslash escapes.) Thus we don’t need to worry about pre-compiling the split pattern.

Changing the split pattern can leave unwanted whitespace in some of the fields we’re interested in. To deal with this, we call the String.strip() method to remove it from those fields. The strip() method is new in Java 11. It removes whitespace from both ends of a string, where whitespace is defined using Unicode semantics. This differs from the older String.trim() method, which uses an anachronistic definition of whitespace based on ASCII control characters.

The second optimization applies a limit to the number of splits performed. Each line of the file has 21 fields. Without the limit parameter, the split() method will split the entire line into 21 fields and create string objects for them. However, the program is only interested in data from the 5th and 8th fields (array indexes 4 and 7). It’s a lot of extra work to split the remaining fields and then just to throw them away. Supplying a limit argument of 9 will stop splitting after the eighth field, leaving the remainder of the line unsplit in the last array element (at index 8). This reduces the amount of splitting work considerably.

Variation 2

--- ReadFileJavaApplicationBufferedReader1.java
+++ ReadFileJavaApplicationBufferedReader2.java
@@ -29,17 +29,12 @@
        // get total line count
        Instant lineCountStart = Instant.now();
-       int lines = 0;
        Instant namesStart = Instant.now();
        ArrayList<String> names = new ArrayList<>();
        // get the 432nd and 43243 names
-       ArrayList<Integer> indexes = new ArrayList<>();
-       indexes.add(1);
-       indexes.add(433);
-       indexes.add(43244);
+       int[] indexes = { 0, 432, 43243 };
        // count the number of donations by month
        Instant donationsStart = Instant.now();
@@ -53,16 +48,12 @@
          System.out.println("Reading file using " + Caller.getName());
        while ((readLine = b.readLine()) != null) {
-               lines++;
                // System.out.println(readLine);
                // get all the names
                String array1[] = readLine.split("\\|", 9);
                String name = array1[7].strip();
-               if(indexes.contains(lines)){
-                       System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
-               }
                if(name.contains(", ")) {
@@ -88,11 +79,15 @@
+         for (int i : indexes) {
+             System.out.println("Name: " + names.get(i) + " at index: " + (i));
+         }
        Instant namesEnd = Instant.now();
        long timeElapsedNames = Duration.between(namesStart, namesEnd).toMillis();
        System.out.println("Name time: " + timeElapsedNames + "ms");
-       System.out.println("Total file line count: " + lines);
+       System.out.println("Total file line count: " + names.size());
        Instant lineCountEnd = Instant.now();
        long timeElapsedLineCount = Duration.between(lineCountStart, lineCountEnd).toMillis();
        System.out.println("Line count time: " + timeElapsedLineCount + "ms");

This patch reduces the execution time from 44 seconds to about 40 seconds.

This is perhaps a bit of a cheat, but it’s another example of removing work from the inner loop. The original code maintained a list of indexes (line numbers) for which names are to be printed out. During the loop, a counter would keep track of the current line, and the current line would be queried against the list of indexes to determine if the name is to be printed out. The list is short, with only 3 items, so searching it is pretty quick. There are 18,245,416 lines in the file and only 3 indexes in the list, so searching the list for the current line number will fail 18,245,413 times. Since we’re storing all the names in a list, we can just print out the names we’re interested in after we’ve loaded them all. This avoids having to check the list within the inner loop.

The patch also stores the indexes in an array since the syntax for initializing an array is a bit more concise. It also avoids boxing overhead. Boxing of three elements isn’t a significant overhead, so it’s unlikely this makes any measurable difference in the performance. In general, I prefer to avoid boxing unless it’s necessary.

Variation 3

--- ReadFileJavaApplicationBufferedReader2.java
+++ ReadFileJavaApplicationBufferedReader3.java
@@ -44,6 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
+       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
        System.out.println("Reading file using " + Caller.getName());
@@ -55,20 +57,13 @@
                String name = array1[7].strip();
-               if(name.contains(", ")) {
-                       String array2[] = (name.split(", "));
-                       String firstHalfOfName = array2[1].trim();
-                       if (!firstHalfOfName.isEmpty()) {
-                               if (firstHalfOfName.contains(" ")) {
-                                       String array3[] = firstHalfOfName.split(" ");
-                                       String firstName = array3[0].trim();
-                                       firstNames.add(firstName);
-                               } else {
-                                       firstNames.add(firstHalfOfName);
-                               }
+               var matcher = namePat.matcher(name);
+               if (matcher.find()) {
+                   String s = matcher.group(2);
+                   if (s == null) {
+                       s = matcher.group(3);
+                   firstNames.add(s);
                String rawDate = array1[4].strip();

This patch reduces the execution time from 40 to about 38 seconds.

Whereas in variation 1 we saw that reducing a regex to a single character split pattern helped provide a large speedup, in this case we’re replacing some fairly involved string splitting logic with a regex. Note that this code compiles the regex outside the loop and uses it repeatedly within the loop. In this patch I’m attempting to provide similar semantics to the splitting logic, but I’m sure there are cases where it doesn’t produce the same result. (For the input data in this file, the regex produces the same result as the splitting logic.) Unfortunately the complexity is moved out of the logic and into the regex. I’m not going to explain the regex in great detail, since it’s actually fairly ad hoc itself. One problem is that extracting a “first name” from a name field relies on European name conventions, and those conventions don’t apply to all names in this file. A second problem is that the data itself isn’t well-formed. For example, one name in the file is “FOWLER II, COL. RICHARD”. Both the splitting logic and the regex extract the first name as “COL.” which is clearly a title, not a name. It’s unclear what can be done in this case. Nevertheless, the vast majority of records in the file are well-formed, and applying European name conventions works for them. For a name record such as “SMITH, JOHN A” both the splitting logic and the regex extract “JOHN” as the first name, which is the intended behavior.

Variation 4

--- ReadFileJavaApplicationBufferedReader3.java
+++ ReadFileJavaApplicationBufferedReader4.java
@@ -45,7 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
-       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
+       var namePat = Pattern.compile(", \\s*([^, ]+)");
          System.out.println("Reading file using " + Caller.getName());
@@ -59,11 +59,7 @@
                  var matcher = namePat.matcher(name);
                  if (matcher.find()) {
-                     String s = matcher.group(2);
-                     if (s == null) {
-                         s = matcher.group(3);
-                     }
-                     firstNames.add(s);
+                     firstNames.add(matcher.group(1));
                String rawDate = array1[4].strip();

This patch reduces the runtime from 38 seconds to about 35 seconds.

For reasons discussed previously, it’s difficult in general to extract the correct “first name” from a name field. Since most of the data in this file is well-formed, I took the liberty of making some simplifying assumptions. Instead of trying to replicate the original splitting logic, here I’m using a simplified regex that extracts the first non-comma, non-space sequence of characters that follows a comma-space separator. In most cases this will extract the same first name from the name field, but there are some edge cases where it returns a different result. Assuming this is acceptable, it allows a simplification of the regex and also of the logic to extract the desired substring from the match. The result is another small speedup.

Variation 5

--- ReadFileJavaApplicationBufferedReader4.java
+++ ReadFileJavaApplicationBufferedReader5.java
@@ -46,6 +46,8 @@
        ArrayList<String> firstNames = new ArrayList<>();
        var namePat = Pattern.compile(", \\s*([^, ]+)");
+       char[] chars = new char[6];
+       StringBuilder sb = new StringBuilder(7);
        System.out.println("Reading file using " + Caller.getName());
@@ -63,11 +65,12 @@
                String rawDate = array1[4].strip();
-               String month = rawDate.substring(4,6);
-               String year = rawDate.substring(0,4);
-               String formattedDate = month + "-" + year;
-               dates.add(formattedDate);
+               rawDate.getChars(0, 6, chars, 0);
+               sb.setLength(0);
+               sb.append(chars, 0, 4)
+                 .append('-')
+                 .append(chars, 4, 2);
+               dates.add(sb.toString());
          for (int i : indexes) {

This patch reduces the runtime from 35 seconds to about 33 seconds.

This change is primarily to reduce the amount of memory allocation within the inner loop. The previous code extracts two substrings from the raw date, creating two objects. It then appends the strings with a “-” separator, which requires creation of a temporary StringBuilder object. (This is likely still true even with JEP 280 – Indify String Concatenation in place.) Finally, the StringBuilder is converted to a String, allocating a fourth object. This last object is stored in a collection, but the first three objects are garbage.

To reduce object allocation, the patch code creates a char array and a StringBuilder outside the loop and reuses them. The character data is extracted into the char array, pieces of which are appended to the StringBuilder along with the “-” separator. The StringBuilder’s contents are then converted to a String, which is then stored into the collection. This String object is the only allocation the occurs in this step, so the patch code avoids creating any garbage.

I’m of two minds about this optimization. It does provide a few percentage points of optimization. On the other hand, it’s decidedly non-idiomatic Java: it’s rare to reuse objects this way. However, this code doesn’t introduce much additional complexity, and it does provide a measurable speedup, so I decided to keep it in. It does illustrate some techniques for dealing with character data that can reduce memory allocation, which can become expensive if done within an inner loop.

Variation 6

--- ReadFileJavaApplicationBufferedReader5.java
+++ ReadFileJavaApplicationBufferedReader6.java
@@ -115,16 +115,9 @@
-       LinkedList<Entry<String, Integer>> list = new LinkedList<>(map.entrySet());
+       Entry<String, Integer> common = Collections.max(map.entrySet(), Entry.comparingByValue());
-       Collections.sort(list, new Comparator<Map.Entry<String, Integer> >() {
-               public int compare(Map.Entry<String, Integer> o1,
-                                  Map.Entry<String, Integer> o2)
-               {
-                       return (o2.getValue()).compareTo(o1.getValue());
-               }
-       });
-       System.out.println("The most common first name is: " + list.get(0).getKey() + " and it occurs: " + list.get(0).getValue() + " times.");
+       System.out.println("The most common first name is: " + common.getKey() + " and it occurs: " + common.getValue() + " times.");
        Instant commonNameEnd = Instant.now();
        long timeElapsedCommonName = Duration.between(commonNameStart, commonNameEnd).toMillis();
        System.out.println("Most common name time: " + timeElapsedCommonName + "ms");

This patch reduces the runtime from 33 seconds to about 32 seconds.

The task here is to find the most frequently occurring first name. Instead of sorting a list of map entries, we can simply use Collections.max() to find the maximum entry according to some criterion. Also, instead of having to write out a comparator that compares the values of two map entries, we can use the Entry.comparingByValue() method to obtain such a comparator. This doesn’t result in much of a speedup. The reason is that, despite there being 18 million names in the file, there are only about 65,000 unique first names in the file, and thus only that many entries in the map. Computing the maximum entry saves a little bit of time compared to doing a full sort, but not that much.

Variation 7

This isn’t a patch, but instead I did a general cleanup and refactoring pass. I’ll describe the changes here. The revised source file is in this gist:


The changes didn’t significantly affect the runtime, which remained at about 32 seconds.

There are a couple places in the original code where a frequency table is generated. The general algorithm is to create a map of items to counts (typically Integer) to hold the results. Then, for each item, if there’s no entry for it in the map, insert it with the value 1, otherwise add 1 to the value that’s already there. Several commenters have suggested using Map.merge() to make the put-or-update logic within the loop more concise. This will indeed work, but there’s a better way to do this using streams. For example, there is a list firstNames with a list of all first names extracted from the file. To generate a frequency table of these names, one can use this code:

Map<String, Long> nameMap = firstNames.stream()
                                      .collect(groupingBy(name -> name, counting()));

(This assumes a static import of java.util.stream.Collectors.* or individual names.) See the JDK Collectors documentation for more information. Note that the count value is a Long, not an Integer. Note also that we must use boxed values instead of primitives here, because we’re storing the values into collections.

I also use this same technique to generate the frequency table for dates:

Map<String, Long> dateMap = dates.stream()
                                 .collect(groupingBy(date -> date, counting()));

The typical technique to loop over a map involves looping the map’s entry set, and extracting the key and value from the entry using the getKey() and getValue() methods. Often, a more convenient way to loop over the entries of a Map is to use the Map.forEach() method. I used this to print out the map entries from the date map:

dateMap.forEach((date, count) ->
    System.out.println("Donations per month and year: " + date + " and donation count: " + count));

What makes this quite convenient is that the key and value are provided as individual arguments to the lambda expression, avoiding the need to call methods to extract them from an Entry.

Instead of creating a File object, opening a FileReaderon it, and then wrapping it in a BufferedReader, I used the NIO newBufferedReader() method:

BufferedReader b = Files.newBufferedReader(Path.of(FILENAME))

It’s a bit more convenient than the wrapping approach.

Other changes I made include the following:

  • Unified the start time into a single Instant variable, and refactored the elapsed time reporting into a separate between() method.
  • Removed the outermost try statement whose catch block does nothing other than printing a stack trace. I see this a lot; I suspect it exists in some code template somewhere. It’s completely superfluous, because simply letting the exception propagate will cause the default exception handler to print the stack trace anyway. The only thing you might need to do is to add a throws IOException to the main() method, which is what I did in this case.
  • Used interface types instead of implementation types. I used List and Map in variable declarations instead of ArrayList and HashMap. This is an example of programming to an interface, not an implementation. This is not of great consequence in a small program, but it’s a good habit to get into, especially when defining fields and methods. I could also have used var in more places, but I wanted to be explicit when I changed type arguments, e.g., from Integer to Long.
  • Reindented the code. The JDK style is to use spaces for indentation, in multiples of 4. This avoids lines that are indented halfway off the right edge of the screen, but mainly I’m more comfortable with it.

Performance Recap

Version             Time (sec)      Description
-------             ----------      -----------
Original               108          baseline
Variation 1             44          optimize line splitting
Variation 2             40          rearrange printing lines by index
Variation 3             38          use regex for extracting first name
Variation 4             35          simplified first name regex
Variation 5             33          reuse StringBuilder/char[] for date extraction
Variation 6             32          use max() instead of sort()
Variation 7             32          cleanup

Summary & Comment

The first several optimizations involved removing work from the inner loop of the program. This is fairly obvious. Since the loop is executed a lot (18 million times) even a small reduction in the amount of work can affect the program’s runtime significantly.

What’s less obvious is the effect of reducing the amount of garbage generated within a loop. When more garbage is generated, it fills up the heap more quickly, causing GC to run more frequently. The more GC runs, the less time the program can spend getting work done. Thus, reducing the amount of garbage generated can also speed up a program.

I didn’t do any profiling of this program. Normally when you want to optimize a program, profiling is one of the first things you should do. This program is small enough, and I think I have a good enough eye for spotting potential improvements, that I was able to find some significant speedups. However, if somebody were to profile my revised version, they might be able to find more things to optimize.

Typically it’s a bad idea to do ad hoc benchmarking by finding the difference between before-and-after times. This is often the case with microbenchmarking. In such cases it’s preferable to use a benchmarking framework such as JMH. I didn’t think it was strictly necessary to use a framework to benchmark this program, though, since it runs long enough to avoid the usual benchmarking pitfalls. However, the differences in the runtimes between the later optimizations are getting smaller and smaller, and it’s possible that I was misled by my informal timing techniques.

Several commenters have suggested using the Files.lines() method to get a stream of lines, and then running this stream in parallel. I’ve made a few attempts to do this but I haven’t shown any here. One issue is with program organization. As it stands, this program’s main loop extracts data into three lists. Doing this using streams involves operations with side effects (which are not recommended for parallel streams) or creating an aggregate object that can be used to accumulate the results. These are certainly reasonable approaches, but I wasn’t able to get any speedup from using parallel streams — at least on my 2-core system. The additional overhead of aggregation seemed to more than offset the benefit gained from running on two cores. It’s quite possible that with more work, or running the program on a system with more cores, can realize a benefit from running in parallel.

I believe the changes I’ve shown improve the quality of the code as well as improving its performance. But it’s possible to optimize this program even further. I’ve have some additional changes that get the runtime consistently down to about 26 seconds. These changes involve replacing some library calls with hand-written, special-purpose Java code. I don’t usually recommend making such changes, as they result in programs that are more complicated, less maintainable, and more error-prone. That’s why I’m not showing them. The last variation shows, I think, the “sweet spot” that represents the best tradeoff between code quality and performance. It is often possible, though, to make programs go faster at the expense of making them more complicated.

With this article, I hope that I’ve been able to illustrate several programming techniques and APIs that anybody can use to speed up and  improve the quality of their code, and to help people improve their Java development skills.

Oracle Code One 2018

Oracle Code One 2018 was the week before last. Overall I thought it was a good conference, though I was a bit sad to see the retirement of the JavaOne name. It was in Moscone West, which has much better conference facilities than the hotels where JavaOne had been for the previous several years. Unfortunately there seem to be fewer places to hang out where you’d just run into people. The main part of Moscone center is still under construction; perhaps when that finishes things can be rearranged a bit.

Unusually for me, I presented only two talks and one lab this year:

Var With Style: Local Variable Type Inference in Java (slides, video)

Collections Refueled (slides, video)

Lambda Programming Laboratory (exercises)

A “problem” with the conference was that there were a lot of sessions I wanted to attend, but they either conflicted with each other, they conflicted with a talk I was giving, or I ended up talking to people I had run into instead of attending sessions. But this last bit, the “hallway track,” is really what a conference is all about: meeting with and talking to people.

Fortunately, some of the sessions were recorded, so I can still see some of the ones I missed. Here’s a playlist of recordings of those sessions, along with the keynotes. I still have a bunch of them on my “watch later” list.

Some photos I took at the conference are shown below.


Just arrived at Oracle Code One, on Sunday, before the conference. With the “Usual Suspects” ringleader, Amelia:



Dr. Deprecator is in the house!



All ready for my first talk, Var With Style. I guess I look pretty grim here. I was fine, though. I guess I was concentrating on getting the selfie and I forgot to smile!



Getting ready for my second talk. This time I was kinda trying to smile, but it turned out more like a grimace!



With conference buddies Trisha and Simon!


Devoxx US 2017 Recap


Devoxx US 2017 was back in March 21-23 of this year, and I’m only now getting around to posting an article about it.

The conference was in the San Jose McEnery Convention Center, which is quite a convenient venue for me. It’s only a little bit farther from home than my office. The session rooms and exhibition space were pretty nice too.

Unfortunately, the attendance seemed fairly light, which might have had something to do with the postponement of the next Devoxx US until 2019, skipping 2018.

An uncrowded conference meant there was more time for conversations with other speakers and other conference attendees. This was really great. I remember one conversation in particular with Trisha Gee where we had time to talk about nulls and Optional in detail. Some of the ideas from this conversation wound up in an article Code Smells: Null that she wrote recently.

As is typical, I had several sessions at the conference.

Ten Simple Rules for Writing Great Test Cases
– conference session with Steve Poole | slides | video

This is somewhat refreshed and updated version of the BOF Ten Things You Should Know When Writing Good Unit Test Cases in Java that Paul Thwaite (Steve’s colleague at IBM) and I had at JavaOne 2013. We didn’t actually update it all that much; I think most of the advice here is quite broadly applicable and doesn’t go obsolete. Actually, we did update it – “now with added cloud.”

Streams in JDK 8: The Good, The Bad, and the Ugly
– BOF with Simon Ritter | slides

This was a reprise of the BOF that Simon gave at Devoxx BE 2016 where he pulled me up front and asked me to provide some extemporaneous commentary. This worked so well that we decided to have me as an official co-speaker for the BOF this time.

Collections Refueled – conference session | slides | video

This is my talk about the new stuff in the Collections Framework in Java 8 and 9. Unfortunately, I didn’t prepare for this very well, and I had 60 minutes of material but only 45 minutes to present it. I ended up having to skip a bunch of the Java 9 material towards the end. (My JavaOne 2016 version of this talk is probably better.)

Optional: The Mother of all Bikesheds – conference session | slides | video

I’m happy to say that this was the second-highest rated talk at Devoxx US, according to the ratings shown by the Java Posse during the closing keynote:


Hm, these are Devoxx alternative facts, so maybe they’re alternative ratings as well.

There is a YouTube playlist of all Devoxx US 2017 sessions, so if you missed anything you can always go back and replay it.