Sunday 14 April 2013

How can JVM jitter get caused by a for loop with no object allocations?

It is not uncommon for Java programs to pause for periods of time, sometimes long periods. Garbage Collection is usually the culprit and there is a lot of documentation on its effects. However it is not the only cause, and while benchmarking I/O performance on Java I hit an unusual case where summing up the contents of an array of bytes caused every other Java thread to block for 2.5 seconds while not blocking or slowing down itself. Very curious, and it highlights that the side effects between Java threads can be very subtle and non-obvious. For this blog post I have documented how I performed the test, and how I tracked down the offender. Along the way I cover some aspects of how Hotspot works its magic, including compilation of Java code and when it will halt the execution of all Java threads.

The Problem

Here is the offending code, it consists of two loops that together sum up the contents of an array of bytes 'n' times. No obvious causes for blocking other JVM threads here. It does not synchronise, or use volatile, CAS, wait, or notify. In fact the data structures are all created on one thread and accessed from that one thread without sharing it.

 
public long invoke( int targetCount ) {
    long sum = 0;
    for ( int i=0; i<targetCount; i++ ) {
        for ( int j=0; j<arraySize; j++ ) {
            sum += array[j];
        }
    }

    return sum;
}
This code runs on my laptop in 2.4 seconds, averaging 2.5 array reads and summations every nanosecond. This benchmark did not become interesting until I decided to also measure the JVM jitter. My end goal was to benchmark disk IO, and JVM jitter is a way to see the side effects of overloading the IO system and kernel (say by memory mapping too much data); however it showed something very interesting in the totally in-memory case that I had not expected. The jitter thread got blocked for 2.5s, it was a one off block that did not repeat in the same run; but it did occur at the same point of every run. Why would a for loop cause other threads to block? EVER?

Measuring JVM Jitter

Perhaps the problem was with how I measured the jitter. To measure I started a background daemon thread that repeatedly sleeps for 1ms and then measures how long after the 1ms period it took for it to actually wake up. It then stores the max length of blocked time and repeats. Java is well known for having limitations in its reading of time, and most thread schedulers are optimised for speed over accuracy thus it is expected that the thread will pretty much always wake up some time after the 1ms period. The question is, how long after. I hope that you agree that this code works well and does what it says on the tin.

public class MeasureJitter extends Thread {
    private AtomicLong maxJitterWitnessedNS = new AtomicLong(0);

    public MeasureJitter() {
        setDaemon( true );
    }

    public void reset() {
        maxJitterWitnessedNS.set( 0 );
    }

    public double getMaxJitterMillis() {
        return maxJitterWitnessedNS.get()/1000000.0;
    }

    public void printMaxJitterMillis() {
        System.out.println( "getMaxJitterMillis() = " + getMaxJitterMillis() );
    }

    @Override
    public void run() {
        super.run();

        long preSleepNS = System.nanoTime();
        while( true ) {
            try {
                Thread.sleep( 1 );
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

            long wakeupNS = System.nanoTime();
            long jitterNS = Math.max(0, wakeupNS - (preSleepNS+1000000));

            long max = Math.max( maxJitterWitnessedNS.get(), jitterNS );
            maxJitterWitnessedNS.lazySet( max );

            preSleepNS = wakeupNS;
        }
    }
}
It turns out that on my system the normal latency was 0.4ms, however it always jumped to 2.5s at the same point of the test run. Every time. That is a jump from 0.4ms to 2500ms, a huge jump. When benchmarking Java, one always skips the first values as they are quite erratic and slow. However this peaked my interest because while the background jitter measuring thread was delayed by 2.5 seconds the throughput of the array counting was not slowed at all. So the question was, why? And inorder to understand it, I had to peel back the layers of the JVM.

Java and Stop the World events

There are times when Java has to halt all threads. The most famous of these is Garbage Collection. A recorded statistic for Java is an average of 1 second of halting all Java threads per 1-2Gb of heap space; this statistic is obviously highly variable. In this case, the heap space was 500Mb, there were no objects being allocated and the main work thread was not stopped. It was the background daemon thread that was getting the Jitter. So this did not sound like GC, but just to be sure I added -XX:+PrintGCDetails to the JVM start up command and confirmed that there was indeed no GC going on. My next suspicion was that it was the Daemon thread that was not getting scheduled for some reason. However the Unix thread schedulers are very efficient, and while it may assign the jitter thread to the same cpu core that was doing the work that would not block it from being run for two and a half seconds. I would have expected either 1) another core to have stolen the work before 2.5 seconds was up or 2) for the busy core to have been paused and the other thread run (this is called preemption). In either case the result should have been variable. With variations in the jitter time, and in the case of stealing time from the worker thread, slowing the worker thread down. None of this was the case, so my suspicion fail upon Hotspot itself. Was Hotspot blocking other threads as a side effect of generating native assembler code?
Alexey Ragozin reports that the Oracle Hotspot implementation will perform a request to stop all threads when:
  • Garbage collection pauses (depends on the GC algorithm being used)
  • Code deoptimization (in response to new information from newly loaded classes, null pointer exceptions, division by zero traps and so forth)
  • Flushing code cache
  • Class redefinition (e.g. hot swap or instrumentation)
  • Biased lock revocation (and as we learned in this post, turning the BiasedLock mechanism on)
  • Various debug operation (e.g. deadlock check or stacktrace dump)

Hotspot Compilation on modern JVMs

The latest Oracle VMs use tiered compilation. That is they start executing Java Byte Codes via an interpretor. Which is a flash back in time to Java's origins, back before the days of JIT and Hotspot; to a time when Java had a deserved reputation for being very slow. Very very slow ;) The VM still executes code via the interpretor so that it can startup faster, and to gather some statistics to inform what it should compile/optimise and how. Every method call has with it a counter, as does the back edge of every for loop that has not been unrolled (including nested for loops). When the sum of these two counters exceeds a threshold, then Hotspot will schedule a request to compile the method. By default this compilation happens in the background, and it then replaces the interpreted version of the method when it is next called.
For methods that take a very long time to complete, or perhaps never complete because they stay within a loop then the JVM is capable of optimising the method from a safe point in the loop. It can then swap in the new implementation when the method is at that same point. HotSpot calls this 'On Stack Replacement' or OSR for short. Now my assumption has always been that this hot replacement would only slow down the thread that was executing the code, and in most cases not even that thread should ever suffer.
Hotspot's compilation can be monitored by adding the following JVM flag: '-XX:+PrintCompilation'
This flag resulted in the following output:
2632   2%      com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ 14 (65 bytes)
6709   2%     made not entrant  com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ -2 (65 bytes)
The format for this output is documented here. In short it says that the method that held the for loops above was compiled to native code, swapped in to the running method via OSR and then 4.1 seconds later the optimised version of the code was flagged sub optimal and to be replaced (made not entrant). Was this the cause? According to the documentation of this output, marking the method none entrant does not prevent the method that is currently running. At some point the JVM should swap the method over to a newly compiled version and mark the old version as a zombie, ready to be cleaned out. Marking it non-entrant is designed to stop any new methods from running that version of the code. Could this be blocking other threads? I'd hope not as events that invalidate optimised code is fairly common in Java. Java is cited as having cases where it can run faster than statically compiled C code. This happens because Java can look at how it is really being used at run time and re-optimise the code on the fly. Causing big latencies here would kill that selling point of Java. So the question is, how do we get a smoking gun that proves that this is the problem?


Debugging Hotspot Safe Points

The JVM cannot just stop all threads. When the JVM wants to perform an exclusive change it has to register its desire to stop all threads and then it has to wait for the threads to stop. Threads only check the need to pause at 'safe points', which have been nicely documented by Alexey Ragozin in his blog post titled Safepoints in HotSpot JVM. The safepoints are on JNI boundaries (enter or exit), or calling a Java method.
To see when Java is halting the world, add the following flag to the command line: '-XX:+PrintGCApplicationStoppedTime'. Despite its misleading name, this option prints out the time that the JVM sits waiting for all threads to stop. It is not just about GC time.
For me, this flag resulted in the following smoking gun.

Total time for which application threads were stopped: 2.5880809 seconds
The pause event was identical to the jitter being reported. To understand this further the following two JVM flags have to be used: ' -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1'.
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
4.144: EnableBiasedLocking              [      10          1              1    ]      [  2678     0  2678     0     0    ]  0   
Total time for which application threads were stopped: 2.6788891 seconds
This output tells us that the JVM waited 2.678 seconds to stop the world and that the reason for wanting to stop the world was to enable biased locking. Biased locking, as it turns out is an optimisation for lock acquisitions which are usually only locked by the same thread (but could be locked by other threads). It reduces the cost of the same thread re-acquiring the same lock at the expense of increasing the cost for any other thread that wants to acquire it. This then is our smoking gun.

It turns out that during start up of the JVM, Oracle keep biased locking disabled for the first four seconds. This is because there is a lot of contention within the JVM during this startup window and thus they delay turning it on and turning it on is a Stop the World event.  The stop the world event then had to wait for a safe point to be reached, and it so happens that the thread running had no safe point within its for loop which took a little while to run to completion.

Conclusions

JVM jitter occurs anytime that the JVM wants to stop all threads so that it can gain exclusive access to its internal data structures. How long this jitter lasts depends on how long it takes to stop all of the threads, and this in turn depends on when the threads will reach the next safe point. Calling native methods that take a long time to finish, or using long loops that do not call out to any other method can result in very long delays. We saw a 2.5s delay here due to a for loop iterating over an array of bytes. Admit-ably this is a contrived micro benchmark, however Gil Tene CTO from Azul Systems has reported seeing delays like this of 10s or more in production systems. The 10s example that he gave was from a Clojure program that involved a large memory copy.



















No comments:

Post a Comment