Mr Branding: Inside Java 9 – Performance, Compiler, and More

Monday, November 7, 2016

Inside Java 9 – Performance, Compiler, and More

Java 9 has a lot to offer besides modularity: new language features and a lot of new or improved APIs, GNU-style command options, multi-release JARs, improved logging, and more. Let's explore this "more" and look at performance improvements, many thanks to string trickery, the compiler, garbage collection, and JavaDoc.

Performance Improvements

Java becomes more performant from release to release and 9 is no exception. There are a couple of interesting changes targeted at reducing CPU cycles or saving memory.

Compact Strings

When you look at a Java application's heap and rip out all the object headers and pointers that we use to organize state, only raw data remains. What does it consist of? Primitives of course - many, many, many of which are chars, lumped together in char arrays that back String instances. As it turns out these arrays occupy somewhere between 20 % and 30 % of an average application's live data (including headers and pointers). Any improvement in this area would be a big win for a huge portion of Java programs! And indeed, there is room for improvement.

A char takes up two bytes because it represents a full UTF-16 code unit but as it turns out the overwhelming majority of strings only require ISO-8859-1, which is a single byte. This is huge! With a new representation that only uses a single byte when possible, memory footprint caused by strings could be cut almost in half. This would reduce memory consumption for average applications by 10 % to 15 % and also reduce runtime by spending less time collecting garbage.

Of course that's only true if it came without overhead. Free lunch anyone? JEP 254 gave it a try...

Implementation

In Java 8 String has a field char[] value - that's the array we just discussed, which holds the string's characters. The idea is to use a byte array instead and spend either one or two bytes per character, depending on the required encoding.

This may sound like a case for variable-sized records like UTF-8, where the distinction between one and two bytes is made per character.But then there'd be no way to predict for a single character which array slots it will occupy, thus requiring random access (e.g. charAt(int)) to perform a linear scan. Degrading random access performance from constant to linear time was an unacceptable regression.

Instead, either each character can be encoded with a single byte, in which case this is the chosen representation, or if at least one of them requires two, two bytes will be used for all of them. A new field coder will denote how the bytes encode characters and many methods in String evaluate it to pick the correct code path.

When a new string is constructed in Java 8, the char array is usually created afresh and then populated from the constructor parameters. For example when new String(myChars) is called, Arrays.copyOf is used to assign a copy of myChars to value. This is done to prevent sharing the array with user code and there are only a select few cases where the array is not copied, for example when a string is created from another. So since the value array is never shared with code outside of String the refactoring to a byte array is safe (yay for encapsulation). And because constructor arguments are copied anyways, transforming it adds no prohibitive overhead.

Here's how that looks:

// this is a simplified version of a String constructor,
// where `char[] value` is the argument
if (COMPACT_STRINGS) {
    byte[] val = StringUTF16.compress(value);
    if (val != null) {
        this.value = val;
        this.coder = LATIN1;
        return;
    }
}
this.coder = UTF16;
this.value = StringUTF16.toBytes(value);

There are a couple of things to note here:

The boolean flag COMPACT_STRINGS, which is the implementation of the command line flag XX:-CompactStrings and with which the entire feature can be disabled.
The utility class StringUTF16 is first used to try and compress the value array to single bytes and, should that fail and return null, convert it to double bytes instead.
The coder field is assigned the respective constant that marks which case applies.

If you find this topic so interesting that you're still awake at this point, I highly recommend watching Aleksey Shipilev's instructive and entertaining talk on compact strings and indyfied string concatenation with the great subtitle:

Why those [expletive] [expletive] [expletive] cannot do the feature in a month, but spend a year instead?!

Performance

Before we really look at performance there's a nifty little detail to observe. The JVM 8-byte-aligns objects in memory, which means that when an object takes up less than a multitude of 8 bytes the rest is wasted. In the JVM's most common configuration, a 64-bit VM with compressed references, a String requires 20 bytes (12 for the object header, 4 for the value array, and a final 4 for the cached hash) - which leaves 4 more bytes to squeeze in the coder field without adding to the footprint. Nice.

Compact strings are largely a memory optimization so it would make sense to observe the garbage collector. Trying to make sense of G1's logs was beyond the scope of this post, though, so I focused on runtime performance. This makes sense because if strings require less memory, creating them should also be faster.

To gauge runtime performance I ran this code:

long launchTime = System.currentTimeMillis();
List<String> strings = IntStream.rangeClosed(1, 10_000_000)
        .mapToObj(Integer::toString)
        .collect(toList());
long runTime = System.currentTimeMillis() - launchTime;
System.out.println("Generated " + strings.size() + " strings in " + runTime + " ms.");

launchTime = System.currentTimeMillis();
String appended = strings.stream()
        .limit(100_000)
        .reduce("", (left, right) -> left + right);
runTime = System.currentTimeMillis() - launchTime;
System.out.println("Created string of length " + appended.length() + " in " + runTime + " ms.");

First it creates a list of ten million strings, then it concatenates the first 100'000 of them in a spectacularly naive way. And indeed running the code either with compact strings (which is the default on Java 9) or without (with -XX:-CompactStrings) I observed a considerable difference:

# with compact strings
Generated 10000000 strings in 1044 ms.
Created string of length 488895 in 3244 ms.
# without compact strings
Generated 10000000 strings in 1075 ms.
Created string of length 488895 in 7005 ms.

Now, whenever somebody talks about microbenchmarks like this, you should immediately mistrust them if they don't use JMH. But in this case I didn't want to go through the potential trouble of running JMH with Java 9, so I took the easy way out. This means that the results could be total rubbish because some optimization or other screwed me over. Hence take the results with a truck load of salt and see them as a first indication as opposed to a proof for improved performance.

But you don't have to trust me. In the talk linked above Aleksey shows his measurements, starting at 36:30, citing 1.36x better throughput and 45 % less garbage.

Indified String Concatenation

Quick repetition of how string concatenation works... Say you write the following:

String s = greeting + ", "  + place + "!"

Then the compiler will create bytecode that uses the StringBuilder to create s by first appending the individual parts and then calling toString to get the result. At runtime, the JIT compiler may recognize these append-chains and if it does, it can boost performance considerably. It will generate code that checks the arguments' lengths, creates an array of the correct size, copies the characters straight into that array, and, et voila, wraps it into a String.

It doesn't get better than that but recognizing these append-chains and proving they can be replaced with the optimized code is not trivial and breaks down quickly. Apparently all you need is a long or a double in that concatenation and the JIT will not be able to optimize.

But why so much effort? Why not just have a method String.concat(String... args) that the bytecode calls? Because creating a varargs array on a performance critical path is not the best idea. Also, primitives don't really go well with that unless you toString all of them beforehand, which in turn prevents stringifying them straight into the target array. And don't even think about String.concat(Object... args), which would box every primitive.

So another solution is needed to get better performance. The next best thing is to let javac emit better bytecode but that has drawbacks as well:

Every time a new optimization is implemented, the byte code changes again.
For users to profit from these optimizations, they have to recompile their code - something Java generally avoids if feasible.
Since all JVMs should be able to JIT compile all variants, the testing matrix explodes.

So what else can be done? Maybe, an abstraction is missing here? Can't the bytecode just declare the intent of "concat these things" and let the JVM handle the rest?

Yes, this is pretty much the solution employed by JEP 280 - at least for the former part. Thanks to the magic of invokedynamic, the bytecode can express the intent and arguments (without boxing) but the JVM does not have to provide that functionality and can instead route back into the JDK for an implementation. This is great because within the JDK all kinds of private APIs can be used for various tricks (javac can only use public APIs).

Let me once again refer you to Aleksey's talk - the second half, starting at 37:58, covers this part. It also contains some numbers, which show a speed-up of to 2.6x and up to 70 % less garbage - and this is without compact strings!

Another Mixed Bag

There's another string-related improvement but this one I didn't quite get. As I understand it different JVM processes can share loaded classes via class-data sharing (CDS) archives. In these archives strings in the class data (more precisely, the constant pool) are represented as UTF-8 strings and turned into String instances on demand. The memory footprint can be reduced by not always creating new instances but sharing them across different JVMs. For the garbage collector to cooperate with this mechanism it needs to provide a feature called pinned regions, which only G1 has. This understanding seems to be clashing with the JEP's title Store Interned Strings in CDS Archives, so if this interests you, you should take a look for yourself. (JEP 250)

A basic building block of Java concurrency are monitors - each object has one and each monitor can be owned by at most one thread at a time. For a thread to gain ownership of a monitor it must call a synchronized method declared by that object or enter a synchronized block that synchronizes on the object. If several threads try to do that at, all but one are placed in a wait set and the monitor is said to be contended, which creates a performance bottleneck. For one, the application itself wastes time waiting but on top of that the JVM has to do some work orchestrating the lock contention and choosing a new thread once the monitor becomes available again. This orchestration by the JVM is refined, which should improve performance in highly contested code. (JEP 143)

In Java 2D all anti-aliasing (except for fonts) is performed by a so-called rasterizer. This is an internal subsystem with no API available to Java developers. But it lies on the hot path and its performance is crucial for many graphics intensive applications. OpenJDK uses Pisces, Oracle JDK uses Ductus, where the former shows much poorer performance than the latter. Pisces is now to be replaced with the Marlin graphics renderer, which promises superior performance at the same quality and accuracy. It is likely that Marlin will match Dustus in terms of quality, accuracy, and single thread performance and even surpass it in multi threaded scenarios. (JEP 265, some history and context)

Anecdotal evidence suggests that running an application with an active security manager degrades performance by 10 % to 15 %. An effort was undertaken to reduce this gap with various small optimizations. (JEP 232)

The SPARC and Intel CPUs recently introduced instructions that are well-suited for cryptographic operations. These were used to improve performance of GHASH and RSA computation. (JEP 246)

Continue reading %Inside Java 9 – Performance, Compiler, and More%

by Nicolai Parlog via SitePoint

Mr Branding