Feeds:
Posts
Comments

Last week, Paige Niedringhaus posted an article Using Java to Read Really, Really Large Files. While one can quibble about whether the file to be processed is indeed “really, really large,” it’s large enough to expose some interesting concerns and to present some interesting opportunities for optimizations and use of newer APIs. There was some discussion on Reddit /r/java and /r/programming and a PR with an alternative implementation. Earlier today, Morgen Peschke posted an analysis and comparison to a Scala version of the program. (This is posted as comment at the bottom of the original article.) This article is my contribution to the discussion.

When I ran Niedringhaus’ program on my machine using JDK 8, I ran into the same memory issues as did Peschke; the program consumed so much memory that it spent all its time in garbage collection. Increasing the heap size worked around this problem. Interestingly, using JDK 11, I was able to run Niedringhaus’ version successfully without increasing the heap size. (I suspect the reason is that JDK 11 uses G1GC as the default collector, and its different collection scheme avoids the pathological behavior of the Parallel GC, which is the default collector in JDK 8.)

The approach I’ll take is to retain the large lists accumulated by the original program. My presumption is that the lists are loaded into memory in order to do further analysis that isn’t part of the original program. Instead of reducing memory consumption, I focus on changing aspects of the computation to improve runtime performance. After establishing the program’s baseline performance, I proceed to show several variations on the code that successively improve its performance, along with some discussion describing the reasons for the improvement. I present a diff for each variation. Each variation, along with my final version, is also available in a gist.

I downloaded indiv18.zip on 2019-01-04 and extracted the itcont.txt file. It’s about 3.3GB in size and has 18,245,416 lines. I started with the 2019-01-05 version of Niedringhaus’ test program:

ReadFileJavaApplicationBufferedReader.java

For comparing execution times, I’m using the last time reported by the program, the “Most common name time.” The benchmark times all showed a fair amount of variability. In most cases I reran the program a few times and chose the fastest time. This isn’t very rigorous, but it should at least give an idea of the relative speeds of execution. I’ve rounded to whole seconds, because the high variability makes milliseconds superfluous.

Niedringhaus’ article reported a runtime of about 150 seconds for this version of the program. I ran the program on my laptop (MacBook Pro, mid 2014, 3GHz Intel Core i7, 2 cores, 16GB) and the execution time was about 108 seconds. I’ll use that figure as the baseline against which subsequent optimizations are compared.

Variation 1

--- ReadFileJavaApplicationBufferedReader0.java
+++ ReadFileJavaApplicationBufferedReader1.java
@@ -57,8 +57,8 @@
                // System.out.println(readLine);
 
                // get all the names
-               String array1[] = readLine.split("\\s*\\|\\s*");
-               String name = array1[7];
+               String array1[] = readLine.split("\\|", 9);
+               String name = array1[7].strip();
                names.add(name);
                if(indexes.contains(lines)){
                        System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
@@ -80,7 +80,7 @@
                        }
                }
 
-               String rawDate = array1[4];
+               String rawDate = array1[4].strip();
                String month = rawDate.substring(4,6);
                String year = rawDate.substring(0,4);
                String formattedDate = month + "-" + year;

Applying this patch reduced the execution time from 108 seconds to about 44 seconds.

This is change is actually two optimizations. String splitting is quite expensive, and it’s done once for each of the 18 million lines in the file. It’s thus quite beneficial to remove work from the program’s main loop. The String.split() call uses a regex that splits the line into fields, where the separator is a vertical bar including any adjacent whitespace. The regex pattern is compiled each time through the loop. It would save some time to compile the regex once before the loop and to reuse it. But it turns out that using a regex here is unnecessary. We can instead split on a vertical bar alone. The split() method has a fast path for single-character split patterns which avoids regexes entirely. (Since the vertical bar is a regex metacharacter, it still counts as a single character even with the backslash escapes.) Thus we don’t need to worry about pre-compiling the split pattern.

Changing the split pattern can leave unwanted whitespace in some of the fields we’re interested in. To deal with this, we call the String.strip() method to remove it from those fields. The strip() method is new in Java 11. It removes whitespace from both ends of a string, where whitespace is defined using Unicode semantics. This differs from the older String.trim() method, which uses an anachronistic definition of whitespace based on ASCII control characters.

The second optimization applies a limit to the number of splits performed. Each line of the file has 21 fields. Without the limit parameter, the split() method will split the entire line into 21 fields and create string objects for them. However, the program is only interested in data from the 5th and 8th fields (array indexes 4 and 7). It’s a lot of extra work to split the remaining fields and then just to throw them away. Supplying a limit argument of 9 will stop splitting after the eighth field, leaving the remainder of the line unsplit in the last array element (at index 8). This reduces the amount of splitting work considerably.

Variation 2

--- ReadFileJavaApplicationBufferedReader1.java
+++ ReadFileJavaApplicationBufferedReader2.java
@@ -29,17 +29,12 @@
 
        // get total line count
        Instant lineCountStart = Instant.now();
-       int lines = 0;
 
        Instant namesStart = Instant.now();
        ArrayList<String> names = new ArrayList<>();
 
        // get the 432nd and 43243 names
-       ArrayList<Integer> indexes = new ArrayList<>();
-
-       indexes.add(1);
-       indexes.add(433);
-       indexes.add(43244);
+       int[] indexes = { 0, 432, 43243 };
 
        // count the number of donations by month
        Instant donationsStart = Instant.now();
@@ -53,16 +48,12 @@
          System.out.println("Reading file using " + Caller.getName());
 
        while ((readLine = b.readLine()) != null) {
-               lines++;
                // System.out.println(readLine);
 
                // get all the names
                String array1[] = readLine.split("\\|", 9);
                String name = array1[7].strip();
                names.add(name);
-               if(indexes.contains(lines)){
-                       System.out.println("Name: " + names.get(lines - 1) + " at index: " + (lines - 1));
-               }
 
                if(name.contains(", ")) {
 
@@ -88,11 +79,15 @@
 
        }
 
+         for (int i : indexes) {
+             System.out.println("Name: " + names.get(i) + " at index: " + (i));
+         }
+
        Instant namesEnd = Instant.now();
        long timeElapsedNames = Duration.between(namesStart, namesEnd).toMillis();
        System.out.println("Name time: " + timeElapsedNames + "ms");
 
-       System.out.println("Total file line count: " + lines);
+       System.out.println("Total file line count: " + names.size());
        Instant lineCountEnd = Instant.now();
        long timeElapsedLineCount = Duration.between(lineCountStart, lineCountEnd).toMillis();
        System.out.println("Line count time: " + timeElapsedLineCount + "ms");

This patch reduces the execution time from 44 seconds to about 40 seconds.

This is perhaps a bit of a cheat, but it’s another example of removing work from the inner loop. The original code maintained a list of indexes (line numbers) for which names are to be printed out. During the loop, a counter would keep track of the current line, and the current line would be queried against the list of indexes to determine if the name is to be printed out. The list is short, with only 3 items, so searching it is pretty quick. There are 18,245,416 lines in the file and only 3 indexes in the list, so searching the list for the current line number will fail 18,245,413 times. Since we’re storing all the names in a list, we can just print out the names we’re interested in after we’ve loaded them all. This avoids having to check the list within the inner loop.

The patch also stores the indexes in an array since the syntax for initializing an array is a bit more concise. It also avoids boxing overhead. Boxing of three elements isn’t a significant overhead, so it’s unlikely this makes any measurable difference in the performance. In general, I prefer to avoid boxing unless it’s necessary.

Variation 3

--- ReadFileJavaApplicationBufferedReader2.java
+++ ReadFileJavaApplicationBufferedReader3.java
@@ -44,6 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
 
+       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
 
        System.out.println("Reading file using " + Caller.getName());
 
@@ -55,20 +57,13 @@
                String name = array1[7].strip();
                names.add(name);
 
-               if(name.contains(", ")) {
-
-                       String array2[] = (name.split(", "));
-                       String firstHalfOfName = array2[1].trim();
-
-                       if (!firstHalfOfName.isEmpty()) {
-                               if (firstHalfOfName.contains(" ")) {
-                                       String array3[] = firstHalfOfName.split(" ");
-                                       String firstName = array3[0].trim();
-                                       firstNames.add(firstName);
-                               } else {
-                                       firstNames.add(firstHalfOfName);
-                               }
+               var matcher = namePat.matcher(name);
+               if (matcher.find()) {
+                   String s = matcher.group(2);
+                   if (s == null) {
+                       s = matcher.group(3);
                    }
+                   firstNames.add(s);
                }
 
                String rawDate = array1[4].strip();

This patch reduces the execution time from 40 to about 38 seconds.

Whereas in variation 1 we saw that reducing a regex to a single character split pattern helped provide a large speedup, in this case we’re replacing some fairly involved string splitting logic with a regex. Note that this code compiles the regex outside the loop and uses it repeatedly within the loop. In this patch I’m attempting to provide similar semantics to the splitting logic, but I’m sure there are cases where it doesn’t produce the same result. (For the input data in this file, the regex produces the same result as the splitting logic.) Unfortunately the complexity is moved out of the logic and into the regex. I’m not going to explain the regex in great detail, since it’s actually fairly ad hoc itself. One problem is that extracting a “first name” from a name field relies on European name conventions, and those conventions don’t apply to all names in this file. A second problem is that the data itself isn’t well-formed. For example, one name in the file is “FOWLER II, COL. RICHARD”. Both the splitting logic and the regex extract the first name as “COL.” which is clearly a title, not a name. It’s unclear what can be done in this case. Nevertheless, the vast majority of records in the file are well-formed, and applying European name conventions works for them. For a name record such as “SMITH, JOHN A” both the splitting logic and the regex extract “JOHN” as the first name, which is the intended behavior.

Variation 4

--- ReadFileJavaApplicationBufferedReader3.java
+++ ReadFileJavaApplicationBufferedReader4.java
@@ -45,7 +45,7 @@
        Instant commonNameStart = Instant.now();
        ArrayList<String> firstNames = new ArrayList<>();
 
-       var namePat = Pattern.compile(", \\s*(([^ ]*), |([^ ]+))");
+       var namePat = Pattern.compile(", \\s*([^, ]+)");
 
          System.out.println("Reading file using " + Caller.getName());
 
@@ -59,11 +59,7 @@
 
                  var matcher = namePat.matcher(name);
                  if (matcher.find()) {
-                     String s = matcher.group(2);
-                     if (s == null) {
-                         s = matcher.group(3);
-                     }
-                     firstNames.add(s);
+                     firstNames.add(matcher.group(1));
                  }
 
                String rawDate = array1[4].strip();

This patch reduces the runtime from 38 seconds to about 35 seconds.

For reasons discussed previously, it’s difficult in general to extract the correct “first name” from a name field. Since most of the data in this file is well-formed, I took the liberty of making some simplifying assumptions. Instead of trying to replicate the original splitting logic, here I’m using a simplified regex that extracts the first non-comma, non-space sequence of characters that follows a comma-space separator. In most cases this will extract the same first name from the name field, but there are some edge cases where it returns a different result. Assuming this is acceptable, it allows a simplification of the regex and also of the logic to extract the desired substring from the match. The result is another small speedup.

Variation 5

--- ReadFileJavaApplicationBufferedReader4.java
+++ ReadFileJavaApplicationBufferedReader5.java
@@ -46,6 +46,8 @@
        ArrayList<String> firstNames = new ArrayList<>();
 
        var namePat = Pattern.compile(", \\s*([^, ]+)");
+       char[] chars = new char[6];
+       StringBuilder sb = new StringBuilder(7);
 
        System.out.println("Reading file using " + Caller.getName());
 
@@ -63,11 +65,12 @@
                  }
 
                String rawDate = array1[4].strip();
-               String month = rawDate.substring(4,6);
-               String year = rawDate.substring(0,4);
-               String formattedDate = month + "-" + year;
-               dates.add(formattedDate);
-
+               rawDate.getChars(0, 6, chars, 0);
+               sb.setLength(0);
+               sb.append(chars, 0, 4)
+                 .append('-')
+                 .append(chars, 4, 2);
+               dates.add(sb.toString());
        }
 
          for (int i : indexes) {

This patch reduces the runtime from 35 seconds to about 33 seconds.

This change is primarily to reduce the amount of memory allocation within the inner loop. The previous code extracts two substrings from the raw date, creating two objects. It then appends the strings with a “-” separator, which requires creation of a temporary StringBuilder object. (This is likely still true even with JEP 280 – Indify String Concatenation in place.) Finally, the StringBuilder is converted to a String, allocating a fourth object. This last object is stored in a collection, but the first three objects are garbage.

To reduce object allocation, the patch code creates a char array and a StringBuilder outside the loop and reuses them. The character data is extracted into the char array, pieces of which are appended to the StringBuilder along with the “-” separator. The StringBuilder’s contents are then converted to a String, which is then stored into the collection. This String object is the only allocation the occurs in this step, so the patch code avoids creating any garbage.

I’m of two minds about this optimization. It does provide a few percentage points of optimization. On the other hand, it’s decidedly non-idiomatic Java: it’s rare to reuse objects this way. However, this code doesn’t introduce much additional complexity, and it does provide a measurable speedup, so I decided to keep it in. It does illustrate some techniques for dealing with character data that can reduce memory allocation, which can become expensive if done within an inner loop.

Variation 6

--- ReadFileJavaApplicationBufferedReader5.java
+++ ReadFileJavaApplicationBufferedReader6.java
@@ -115,16 +115,9 @@
                }
        }
 
-       LinkedList<Entry<String, Integer>> list = new LinkedList<>(map.entrySet());
+       Entry<String, Integer> common = Collections.max(map.entrySet(), Entry.comparingByValue());
 
-       Collections.sort(list, new Comparator<Map.Entry<String, Integer> >() {
-               public int compare(Map.Entry<String, Integer> o1,
-                                  Map.Entry<String, Integer> o2)
-               {
-                       return (o2.getValue()).compareTo(o1.getValue());
-               }
-       });
-       System.out.println("The most common first name is: " + list.get(0).getKey() + " and it occurs: " + list.get(0).getValue() + " times.");
+       System.out.println("The most common first name is: " + common.getKey() + " and it occurs: " + common.getValue() + " times.");
        Instant commonNameEnd = Instant.now();
        long timeElapsedCommonName = Duration.between(commonNameStart, commonNameEnd).toMillis();
        System.out.println("Most common name time: " + timeElapsedCommonName + "ms");

This patch reduces the runtime from 33 seconds to about 32 seconds.

The task here is to find the most frequently occurring first name. Instead of sorting a list of map entries, we can simply use Collections.max() to find the maximum entry according to some criterion. Also, instead of having to write out a comparator that compares the values of two map entries, we can use the Entry.comparingByValue() method to obtain such a comparator. This doesn’t result in much of a speedup. The reason is that, despite there being 18 million names in the file, there are only about 65,000 unique first names in the file, and thus only that many entries in the map. Computing the maximum entry saves a little bit of time compared to doing a full sort, but not that much.

Variation 7

This isn’t a patch, but instead I did a general cleanup and refactoring pass. I’ll describe the changes here. The revised source file is in this gist:

ReadFileJavaApplicationBufferedReader7.java

The changes didn’t significantly affect the runtime, which remained at about 32 seconds.

There are a couple places in the original code where a frequency table is generated. The general algorithm is to create a map of items to counts (typically Integer) to hold the results. Then, for each item, if there’s no entry for it in the map, insert it with the value 1, otherwise add 1 to the value that’s already there. Several commenters have suggested using Map.merge() to make the put-or-update logic within the loop more concise. This will indeed work, but there’s a better way to do this using streams. For example, there is a list firstNames with a list of all first names extracted from the file. To generate a frequency table of these names, one can use this code:

Map<String, Long> nameMap = firstNames.stream()
                                      .collect(groupingBy(name -> name, counting()));

(This assumes a static import of java.util.stream.Collectors.* or individual names.) See the JDK Collectors documentation for more information. Note that the count value is a Long, not an Integer. Note also that we must use boxed values instead of primitives here, because we’re storing the values into collections.

I also use this same technique to generate the frequency table for dates:

Map<String, Long> dateMap = dates.stream()
                                 .collect(groupingBy(date -> date, counting()));

The typical technique to loop over a map involves looping the map’s entry set, and extracting the key and value from the entry using the getKey() and getValue() methods. Often, a more convenient way to loop over the entries of a Map is to use the Map.forEach() method. I used this to print out the map entries from the date map:

dateMap.forEach((date, count) ->
    System.out.println("Donations per month and year: " + date + " and donation count: " + count));

What makes this quite convenient is that the key and value are provided as individual arguments to the lambda expression, avoiding the need to call methods to extract them from an Entry.

Instead of creating a File object, opening a FileReaderon it, and then wrapping it in a BufferedReader, I used the NIO newBufferedReader() method:

BufferedReader b = Files.newBufferedReader(Path.of(FILENAME))

It’s a bit more convenient than the wrapping approach.

Other changes I made include the following:

  • Unified the start time into a single Instant variable, and refactored the elapsed time reporting into a separate between() method.
  • Removed the outermost try statement whose catch block does nothing other than printing a stack trace. I see this a lot; I suspect it exists in some code template somewhere. It’s completely superfluous, because simply letting the exception propagate will cause the default exception handler to print the stack trace anyway. The only thing you might need to do is to add a throws IOException to the main() method, which is what I did in this case.
  • Used interface types instead of implementation types. I used List and Map in variable declarations instead of ArrayList and HashMap. This is an example of programming to an interface, not an implementation. This is not of great consequence in a small program, but it’s a good habit to get into, especially when defining fields and methods. I could also have used var in more places, but I wanted to be explicit when I changed type arguments, e.g., from Integer to Long.
  • Reindented the code. The JDK style is to use spaces for indentation, in multiples of 4. This avoids lines that are indented halfway off the right edge of the screen, but mainly I’m more comfortable with it.

Performance Recap


Version             Time (sec)      Description
-------             ----------      -----------
Original               108          baseline
Variation 1             44          optimize line splitting
Variation 2             40          rearrange printing lines by index
Variation 3             38          use regex for extracting first name
Variation 4             35          simplified first name regex
Variation 5             33          reuse StringBuilder/char[] for date extraction
Variation 6             32          use max() instead of sort()
Variation 7             32          cleanup

Summary & Comment

The first several optimizations involved removing work from the inner loop of the program. This is fairly obvious. Since the loop is executed a lot (18 million times) even a small reduction in the amount of work can affect the program’s runtime significantly.

What’s less obvious is the effect of reducing the amount of garbage generated within a loop. When more garbage is generated, it fills up the heap more quickly, causing GC to run more frequently. The more GC runs, the less time the program can spend getting work done. Thus, reducing the amount of garbage generated can also speed up a program.

I didn’t do any profiling of this program. Normally when you want to optimize a program, profiling is one of the first things you should do. This program is small enough, and I think I have a good enough eye for spotting potential improvements, that I was able to find some significant speedups. However, if somebody were to profile my revised version, they might be able to find more things to optimize.

Typically it’s a bad idea to do ad hoc benchmarking by finding the difference between before-and-after times. This is often the case with microbenchmarking. In such cases it’s preferable to use a benchmarking framework such as JMH. I didn’t think it was strictly necessary to use a framework to benchmark this program, though, since it runs long enough to avoid the usual benchmarking pitfalls. However, the differences in the runtimes between the later optimizations are getting smaller and smaller, and it’s possible that I was misled by my informal timing techniques.

Several commenters have suggested using the Files.lines() method to get a stream of lines, and then running this stream in parallel. I’ve made a few attempts to do this but I haven’t shown any here. One issue is with program organization. As it stands, this program’s main loop extracts data into three lists. Doing this using streams involves operations with side effects (which are not recommended for parallel streams) or creating an aggregate object that can be used to accumulate the results. These are certainly reasonable approaches, but I wasn’t able to get any speedup from using parallel streams — at least on my 2-core system. The additional overhead of aggregation seemed to more than offset the benefit gained from running on two cores. It’s quite possible that with more work, or running the program on a system with more cores, can realize a benefit from running in parallel.

I believe the changes I’ve shown improve the quality of the code as well as improving its performance. But it’s possible to optimize this program even further. I’ve have some additional changes that get the runtime consistently down to about 26 seconds. These changes involve replacing some library calls with hand-written, special-purpose Java code. I don’t usually recommend making such changes, as they result in programs that are more complicated, less maintainable, and more error-prone. That’s why I’m not showing them. The last variation shows, I think, the “sweet spot” that represents the best tradeoff between code quality and performance. It is often possible, though, to make programs go faster at the expense of making them more complicated.

With this article, I hope that I’ve been able to illustrate several programming techniques and APIs that anybody can use to speed up and  improve the quality of their code, and to help people improve their Java development skills.

Advertisements

Oracle Code One 2018

Oracle Code One 2018 was the week before last. Overall I thought it was a good conference, though I was a bit sad to see the retirement of the JavaOne name. It was in Moscone West, which has much better conference facilities than the hotels where JavaOne had been for the previous several years. Unfortunately there seem to be fewer places to hang out where you’d just run into people. The main part of Moscone center is still under construction; perhaps when that finishes things can be rearranged a bit.

Unusually for me, I presented only two talks and one lab this year:

Var With Style: Local Variable Type Inference in Java (slides, video)

Collections Refueled (slides, video)

Lambda Programming Laboratory (exercises)

A “problem” with the conference was that there were a lot of sessions I wanted to attend, but they either conflicted with each other, they conflicted with a talk I was giving, or I ended up talking to people I had run into instead of attending sessions. But this last bit, the “hallway track,” is really what a conference is all about: meeting with and talking to people.

Fortunately, some of the sessions were recorded, so I can still see some of the ones I missed. Here’s a playlist of recordings of those sessions, along with the keynotes. I still have a bunch of them on my “watch later” list.

Some photos I took at the conference are shown below.

 

Just arrived at Oracle Code One, on Sunday, before the conference. With the “Usual Suspects” ringleader, Amelia:

IMG_3140.JPG

 

Dr. Deprecator is in the house!

IMG_3145.JPG

 

All ready for my first talk, Var With Style. I guess I look pretty grim here. I was fine, though. I guess I was concentrating on getting the selfie and I forgot to smile!

IMG_3174.JPG

 

Getting ready for my second talk. This time I was kinda trying to smile, but it turned out more like a grimace!

IMG_3177.JPG

 

With conference buddies Trisha and Simon!

IMG_3179.JPG

Devoxx US 2017 Recap

 

Devoxx US 2017 was back in March 21-23 of this year, and I’m only now getting around to posting an article about it.

The conference was in the San Jose McEnery Convention Center, which is quite a convenient venue for me. It’s only a little bit farther from home than my office. The session rooms and exhibition space were pretty nice too.

Unfortunately, the attendance seemed fairly light, which might have had something to do with the postponement of the next Devoxx US until 2019, skipping 2018.

An uncrowded conference meant there was more time for conversations with other speakers and other conference attendees. This was really great. I remember one conversation in particular with Trisha Gee where we had time to talk about nulls and Optional in detail. Some of the ideas from this conversation wound up in an article Code Smells: Null that she wrote recently.

As is typical, I had several sessions at the conference.

Ten Simple Rules for Writing Great Test Cases
– conference session with Steve Poole | slides | video

This is somewhat refreshed and updated version of the BOF Ten Things You Should Know When Writing Good Unit Test Cases in Java that Paul Thwaite (Steve’s colleague at IBM) and I had at JavaOne 2013. We didn’t actually update it all that much; I think most of the advice here is quite broadly applicable and doesn’t go obsolete. Actually, we did update it – “now with added cloud.”

Streams in JDK 8: The Good, The Bad, and the Ugly
– BOF with Simon Ritter | slides

This was a reprise of the BOF that Simon gave at Devoxx BE 2016 where he pulled me up front and asked me to provide some extemporaneous commentary. This worked so well that we decided to have me as an official co-speaker for the BOF this time.

Collections Refueled – conference session | slides | video

This is my talk about the new stuff in the Collections Framework in Java 8 and 9. Unfortunately, I didn’t prepare for this very well, and I had 60 minutes of material but only 45 minutes to present it. I ended up having to skip a bunch of the Java 9 material towards the end. (My JavaOne 2016 version of this talk is probably better.)

Optional: The Mother of all Bikesheds – conference session | slides | video

I’m happy to say that this was the second-highest rated talk at Devoxx US, according to the ratings shown by the Java Posse during the closing keynote:

JavaPosse-TopRatedTalks

Hm, these are Devoxx alternative facts, so maybe they’re alternative ratings as well.

There is a YouTube playlist of all Devoxx US 2017 sessions, so if you missed anything you can always go back and replay it.

This evening, I presented Collections Refueled at the Silicon Valley JUG. Thanks to the JUG for having me, and to the attendees for all the interesting questions!

Here are the slides for my presentation: CollectionsRefueled.pdf

 

The first segment of Episode 23 of the Java Off-Heap podcast covered the deprecation of Object.finalize in Java 9 and deprecation and finalization in general. Deprecation is a subject near and dear to my heart. The hosts even mentioned me by name. Thanks for the shout-out, guys!

I wanted to clarify a few points and to answer some of the questions that weren’t resolved in that segment of the show.

Java Finalizers vs. C++ Destructors

The role of Java’s finalizers differs from C++ destructors. In C++ (prior to the introduction of mechanisms like shared_ptr) anytime you created something with new in a constructor, you were required to call delete on it in the destructor. People mistakenly carried this thinking over to Java, and they thought that it was necessary to write finalize methods to null out references to other objects. (This was never necessary, and the fortunately the practice seems to have died out long ago.) In Java, the garbage collector cleans up anything that resides on the heap, so it’s rarely necessary to write a finalizer.

Finalizers are useful if an object creates resources that aren’t managed by the garbage collector. Examples of this are things like file descriptors or natively allocated (“off-heap”) memory. The garbage collector doesn’t clean these up, so something else has to. In the early days of Java, finalization was the only mechanism available for cleaning up non-heap resources.

Phantom References

The point of finalization is that it allows one last chance at cleanup after an object becomes unreachable, but before it’s actually collected. One of the problems with finalization is that it allows “resurrection” of an object. When an object’s finalize method is called, it has a reference to this — the object about to be collected. It can hook the this reference back into the object graph, preventing the object from being collected. As a result, the object can’t simply be collected after the finalize method returns. Instead, the garbage collector must run again in order to determine whether the object is truly unreachable and can therefore be collected.

The reference package java.lang.ref was introduced all the way back in JDK 1.2. This package includes several different reference types, including PhantomReference. The salient feature of PhantomReference is that it doesn’t allow the object to be “resurrected.” It does this by making the contained reference be inaccessible. A holder of a phantom reference gets notified that the referent has become unreachable (strictly speaking, phantom-reachable) but there’s no way to get the referent out and hook it back into the object graph. This makes the garbage collector’s job easier.

Another advantage of a PhantomReference is that, like the other reference types, it can be cleared explicitly. Suppose there’s an object that holds some external resource like a file descriptor. Typically, such objects have a close method the application should call in order to release the descriptor. Prior to the introduction of the reference types, such objects also need a finalize method in order to clean up if the application had failed to call close. The problem is, even if the application has called close, the collector needs to do finalization processing and then run again, as described above, in order to collect the object.

PhantomReference and the other reference types have a clear method that explicitly clears the contained reference. An object that has released its native resources via an explicit call to a close method would call PhantomReference.clear. This avoids a subsequent reference processing step, allowing the object to be collected immediately when it becomes unreachable.

Why Deprecate Object.finalize Now?

A couple of things have changed. First, JEP 277 has clarified the meaning of deprecation in Java 9 so that it doesn’t imply that the API will be removed unless forRemoval=true is specified. The deprecation of Object.finalize is an “ordinary” deprecation in that it’s not being deprecated for removal. (At least, not yet.)

A second thing that’s changed in Java 9 is the introduction of a class java.lang.ref.Cleaner. Reference processing is often fairly subtle, and there’s a lot of work to be done to create a reference queue and a thread to process references from that queue. Cleaner is basically a wrapper around ReferenceQueue and PhantomReference that make reference handling easier.

What hasn’t changed is that for years, it’s been part of Java lore that using finalization is discouraged. It’s time to make a formal declaration, and the way to do this is to deprecate it.

Has Anything Ever Been Removed from Java SE?

The podcast episode mentioned a Quora answer by Cameron Purdy written in 2014, where he said that nothing had ever been removed from Java. When he wrote it, the statement was correct. Various features of the JDK had been removed (such as apt, the annotation processing tool), but public APIs had never been removed.

However, the following six APIs were deprecated in Java SE 8, and they have been removed from Java SE 9:

  1. java.util.jar.Pack200.Packer.addPropertyChangeListener
  2. java.util.jar.Pack200.Unpacker.addPropertyChangeListener
  3. java.util.logging.LogManager.addPropertyChangeListener
  4. java.util.jar.Pack200.Packer.removePropertyChangeListener
  5. java.util.jar.Pack200.Unpacker.removePropertyChangeListener
  6. java.util.logging.LogManager.removePropertyChangeListener

In addition, in Java SE 9, about 20 methods and six modules have been deprecated with forRemoval=true, indicating our intent to remove them from the next major Java SE release. Some of the classes and methods to be removed include:

  • java.lang.Compiler
  • Thread.destroy
  • System.runFinalizersOnExit
  • Thread.stop(Throwable)

The modules deprecated for removal are the following:

  1. java.activation
  2. java.corba
  3. java.transaction
  4. java.xml.bind
  5. java.xml.ws
  6. java.xml.ws.annotation

So yes, we are getting serious about removing stuff!

Will Finalization Be Removed?

As mentioned earlier, Object.finalize is not being deprecated for removal at this time. As such, its deprecation is merely a recommendation that developers consider migrating to alternative cleanup mechanisms. The recommended replacements are PhantomReference and the new Cleaner class.

That said, we do eventually want to get rid of finalization. It adds extra complexity to the garbage collector, and there are recurring cases where it causes performance problems.

Before we can get rid of it, though, we need to remove uses of it from the JDK. That’s more than just removing the overrides of finalize and rewriting the code to use Cleaner instead. The problem is that there are some public API classes in the JDK that override finalize and specify its behavior. In turn, their subclasses might override finalize and rely on the existing behavior of super.finalize(). Removing the finalize method would expose these subclasses to a potentially incompatible behavior change. This will need to be investigated carefully.

There might also be a transition period where calling of the finalize method is controlled by a command-line option. This would allow testing of applications to see if they can cope without finalization. Only after a transition period would we consider removing the finalization mechanism entirely. We might even leave the finalize method declaration around for binary compatibility purposes, even after the mechanism for calling it has been removed.

As you can see, removing finalization would require a long transition period spanning several JDK releases, taking several years. That’s all the more reason to start with deprecation now.

Now that Devoxx US is imminent, it’s about time for me to post about Devoxx BE 2016, which took place in November 2016 in Antwerp. That was several months ago, which was ages in conference time, so this post is mainly a placeholder to host slides and links to the videos.

Array Linked to a List, the Full Story! – José Paumard (video)

I was surprised to find that I was mentioned by name in the abstract for this university session. José Paumard took a tweet of mine from a year earlier (actually one by my alter ego, Dr Deprecator) and turned it into an entire university session. José was happy to have me attend the session, and he was gracious enough to invite me on stage for a few comments and questions.

Streams in JDK 8: The Good, The Bad and the Ugly – Simon Ritter (BOF)

This was another of my impromptu appearances. Simon had submitted this session, and he asked me to join him in presenting it. I said that I wasn’t sure what I would speak about. He said to me, “I’ll put up a slide and say a few words about it. I’m sure you’ll have an opinion.” (He was right.) This was a BOF, so it was pretty informal, but Simon came up with some really interesting examples, and we had a good discussion and covered a lot of issues.

Simon and I will be repeating this BOF at Devoxx US this coming week.

Ask the JDK Architects – panel session (video)

This was a panel session featuring Mark Reinhold and Brian Goetz (the actual JDK architects) along with Alan Bateman and myself (JDK Core Libraries engineers). This session consisted entirely of answering questions from the audience.

Optional: The Mother of All Bikesheds – conference session (slides, video)

This was a conference session about a single Java 8 API, java.util.Optional. Some were were skeptical that I could talk for an entire hour about a single API. I proved them wrong. Credit for the title goes to my übermanager at Oracle, Jeannette Hung. It refers to the many protracted mailing list discussions (“centithreads”) about the design of Optional.

Thinking in Parallel – joint conference session with Brian Goetz (slides, video)

This was an amazing experience because the auditorium was so full that people were sitting on the steps. Brian Goetz was the big draw here, but I also think it was packed because there were fewer sessions running at the same time.

* * *

I was pleased to learn that both of my conference sessions were in the top 20 talks for the conference. Thanks for your support!

I just finished a vJUG24 session entitled Optional: The Mother of All Bikesheds.

Video: YouTube

Slide deck: PDF

This slide deck has a few minor updates relative to what I presented in the vJUG24 session:

  • slides 21-22: clarify problem statement (the before and after code is correct)
  • slide 26: mention flatMap() for completeness
  • slide 31: add link to Stack Overflow question
  • slide 36: clarify reason for not deprecating Optional.get()
  • slide 42: new slide describing new methods in Java 9

For convenience, here are the six seven style rules I proposed in the session:

  1. Never, ever, use null for an Optional variable or return value.
  2. Never use Optional.get() unless you can prove that the Optional is present.
  3. Prefer alternative APIs over Optional.isPresent() and Optional.get().
  4. It’s generally a bad idea to create an Optional for the specific purpose of chaining methods from it to get a value.
  5. If an Optional chain has a nested Optional chain, or has an intermediate result of Optional, it’s probably too complex.
  6. Avoid using Optional in fields, method parameters, and collections.On a related note, I thought of another rule after I presented the session:
  7. Don’t use an Optional to wrap any collection type (List, Set, Map). Instead, use an empty collection to represent the absence of values.