The other day on Twitter I said, “Scanner is a weird beast. I wouldn’t necessarily use it as a good example for anything.” The context was a discussion about classes that are both an Iterator and are AutoCloseable. As it happens, Scanner is such an example. It’s an Iterator, because it allows iteration over a sequence of tokens, and it’s also AutoCloseable, because it might have an external resource (like a file) contained within it. I wouldn’t hold it up as an example of good object design, though. This article explains why.
Scanner has a pretty complicated API, but once you figure out how to use it, it’s incredibly useful. Its main issue is that it’s trying to do too many things at once. The good news is that you can use parts of the API for stylized uses and mostly ignore other parts of the API.
At its core, Scanner is about regex pattern matching. Unlike the Pattern and Matcher classes, which can only match on a fixed input such as a String, Scanner allows you to match over arbitrary input that might not even exist in memory. There are several Scanner constructors that allow input to be read from various sources such as files, InputStreams, or channels. Scanner handles buffering, and it reads additional input as necessary, and it discards any input that was skipped over during matching. This is really cool. It means you can do matching over arbitrarily sized input data using just a few KB of memory.
(Naturally this depends on the patterns used for matching as well as the well-formedness of input. For example, you can attempt to read a file line by line, and this will work for an arbitrarily sized file if it’s broken up into reasonably sized lines. If the file doesn’t have any line separators, Scanner will bring the whole file into memory, as the file conceptually contains one long line.)
Scanner has two fundamental modes of matching. The first mode is to break the input into tokens that are separated by delimiters. The delimiters are defined by the regex pattern you provide. (This is rather like the String.split method.) The second mode is to find chunks of text that result from matching the regex pattern you provide. In other words, the token mode provides the text between matches, and the find mode provides the text of the matches themselves. What’s odd about the Scanner API is that there are groups of methods that apply in one mode but not the other.
The methods that apply to the tokens mode are:
- delimiter
- locale
- hasNext* (excluding hasNextLine)
- next* (excluding nextLine)
- radix
- tokens
- useDelimiter
- useLocale
- useRadix
The methods that apply to the find mode are:
- findAll
- findInLine
- findWithinHorizon
- hasNextLine
- nextLine
- skip
(Additional Scanner methods apply to both modes.)
Here’s an example of using Scanner for matching tokens:
String story = """ "When I use a word," Humpty Dumpty said, in rather a scornful tone, "it means just what I choose it to mean - neither more nor less." "The question is," said Alice, "whether you can make words mean so many different things." "The question is," said Humpty Dumpty, "which is to be master - that's all." """; List<String> words = new Scanner(story) .useDelimiter("[- \\.\n\",]+") .tokens() .collect(toList());
(Note, this example uses the new Text Blocks feature, which was previewed in JDK 13 and 14 and which is scheduled to be final in JDK 15.)
Here, we set the delimiter pattern to match whitespace and various punctuation marks, so the tokens consist of text between the delimiters. The results are:
[When, I, use, a, word, Humpty, Dumpty, said, in, rather, a, scornful, tone, it, means, just, what, I, choose, it, to, mean, neither, more, nor, less, The, question, is, said, Alice, whether, you, can, make, words, mean, so, many, different, things, The, question, is, said, Humpty, Dumpty, which, is, to, be, master, that's, all]
In this example I used the tokens() method to provide a stream of tokens. Scanner implements Iterator<String>, which allows you to iterate over the tokens that were found, using the typical hasNext/next methods. Unfortunately, Scanner does not implement Iterable, which would allow you use it within a for-loop.
Scanner also provides pairs of hasNext/next methods for converting tokens to data. For example, it provides hasNextInt and nextInt methods that search for the next token and convert it to an int (if available). Corresponding pairs of methods are also available for BigInteger, boolean, byte, double, float, long, and short. These pairs of methods are “iterator-like” in that the hasNextX/nextX method pairs are just like the hasNext/next method pair of an Iterator, with the addition of data conversion. But there’s no way to wrap them in an Iterator, like Iterator<BigInteger> or Iterator<Double>, without writing your own adapter code. This is unfortunate, since Scanner is an Iterator<String> but its Iterator is only over tokens, not the value-added iterator-like constructs that include data conversions.
The other main mode of Scanner is the find mode, which provides a succession of matches from a pattern you provide. Here’s an example of that:
List<String> words = new Scanner(story) .findAll("[A-Za-z']+") .map(MatchResult::group) .collect(toList());
Here, instead of matching delimiters between tokens, I’ve provided a pattern that matches the results I want to get. Note that return of findAll() is Stream<MatchResult> and which must be converted to strings; that’s what the MatchResult::group method does. The resulting list is the exact same list of words as the previous example. Personally, I find this mode more useful than the tokens mode. You’re providing the pattern for the text you’re interested in, as opposed to a pattern for the delimiters between the text you’re interested in. Also, you get back MatchResult objects, which are useful for extracting substrings of what you matched. This isn’t available in tokens mode.
I started off this article saying that Scanner is weird but useful. It’s weird because it has these two distinct modes. It has groups of methods that apply to one mode but not the other. If you look at the API carefully (or at the implementation) you’ll also see that there is also a bunch of internal state that applies to one mode but not the other. It seems like Scanner should have been split into two classes. Another weird thing about Scanner is that it’s an Iterator<String>, which elevates one part of one of the modes to the top level of the API and relegates the other parts to second-class status.
That said, Scanner provides some very useful services. It does I/O and buffering for you, and if regex matching needs more input, it handles that automatically. I’m also partial to the streams-returning methods like findAll() and tokens() — I have to admit, I added them — but they make bulk processing of arbitrary input quite easy. I hope you find these aspects of Scanner useful as well.
> Unfortunately, Scanner does not implement Iterable, which would allow you use it within a for-loop.
It’s unfortunate from a convenience perspective, but Iterator is for single-iteration use cases, and Iterable for multiple “resettable” iterations, right? So while convenient from an enhanced for loop perspective, it would be a misuse of the API to let Scanner implement Iterable. Would you agree?
Yeah I didn’t want to get into the single- vs multiple-use issues surrounding Iterable in the main article. I was wondering if somebody would raise the issue. 🙂 Scanner is weird in yet another way, though. It’s not necessarily resettable, for example if you’re reading from the network. However, the with token-based parsing, the termination of the iterator (hasNext == false) doesn’t necessarily exhaust the Scanner. It terminates when there are no *matching* tokens remaining. You can iterate again using a different pattern. For example, you could read all the ints out of a file with one for-loop, and then you could read the remaining non-int tokens with a second for-loop. You’d probably need some kind of adapter to get Iterables of different things out of the Scanner. Not quite the same thing. Hmm, I should play around with this some more.
Wow, that is funny. Just tried it out and it works indeed. So if the Scanner finds no matches, it will still load the whole input in memory I guess.
Another question I have: How is the Scanner’s “streaming pattern matching” implemented internally? It seems to me like the Pattern/Matcher works on only on CharSequence (and CharSequence has a length method).
It depends on the exact circumstances, but for token processing, if there’s no match, the Scanner is left at the next token that didn’t match. So you can have a while loop that calls hasNextInt() / nextInt() and it will process tokens as long as they can be converted to ints. You can then process subsequent tokens using hasNext() / next() or whatever.
The matching streams do use Pattern/Matcher internally. It turns out that CharBuffer implements CharSequence! Scanner keeps an internal Matcher pointed at the buffer. If a match fails in the buffer, or it succeeds but matcher.hitEnd is true, the scanner will discard any text before the current point and then read more input into the buffer and try again. If it still hasn’t found anything it will expand the buffer and keep trying. It is possible that the buffer expands to include the entire input, but if it’s successfully finding tokens or delimiters, it will report a stream of matches while keeping only a limited amount of input in the buffer at any given time. Kind of clever, actually, but it’s also really fiddly code.
Cool! And clever indeed 🙂 Although it sounds like there’s a waste of CPU cycles when the buffer needs to be expanded. The Matcher presumably has a DFA for the pattern, and would be able to work on a stream (Reader), had only the interface allowed for it. Possible RFE?
I think the presumption is that matches are small compared to the buffer size, so the compact() operation that’s done prior to refilling the buffer is relatively inexpensive. If the scanner needs to keep searching for a match, then yes, the buffer resizing and copying is relatively expensive. But it’s doubled in size each time, which if I’m not mistaken amortizes to linear time for reading the entire input.
I’m not sure Scanner could use a Matcher that works on a single-pass input like a Reader. The Scanner is continually resetting the Matcher’s region and retrying. As for the Matcher, its matching engine uses backtracking so it would have to buffer up input from the beginning of a potential match. I don’t know if an NFA could match from something like a Reader.
[…] >> Scanner is a Weird but Useful Beast [stuartmarks.wordpress.com] […]
[…] Scanner is a Weird but Useful Beast – I had forgotten all about Scanner until I read this. […]
It sounds like you would like Scanner to be two subclasses, TokenScanner and MatchScanner, instead of the one overloaded (traditional “doing to much” overloaded, not OOP overloaded) class.
I don’t know if those two subclasses is exactly the right factoring, but the diagnosis of Scanner doing “too much” is correct. The nugget of functionality buried inside of Scanner is the ability to match over an input buffer instead of a fixed string. It might have been interesting if that were exposed. Then, things like token matching, splitting, line reading, etc. could be layered on top of that. To a certain extent you can see that layering inside of Scanner’s implementation, if you look through the source code, but it’s all brought out to the API and the result is something of a jumble.