Showing posts with label java. Show all posts

Friday, 10 May 2013

Notes on Java 8's removal of PermSpace

The details were posted to the openjdk mailing list below. The short version is that PermSpace is being renamed ;)

http://mail.openjdk.java.net/pipermail/hotspot-dev/2012-September/006679.html

Java 7: Interned strings moved to Heap
Java8: Moves Permspace from heap to a native memory region called 'Metaspace' and changes it to automatically grow in size by default

This means that the out of permspace error will be less common (more memory available to consume as the space grows in size), but leaks may be harder to track down as diagnostic leaks on that information will be less useful (stack tools will no longer be usable). To regain the errors to detect the leaks earlier, MaxMetaspaceSize will be supported.

Sunday, 14 April 2013

How can JVM jitter get caused by a for loop with no object allocations?

It is not uncommon for Java programs to pause for periods of time, sometimes long periods. Garbage Collection is usually the culprit and there is a lot of documentation on its effects. However it is not the only cause, and while benchmarking I/O performance on Java I hit an unusual case where summing up the contents of an array of bytes caused every other Java thread to block for 2.5 seconds while not blocking or slowing down itself. Very curious, and it highlights that the side effects between Java threads can be very subtle and non-obvious. For this blog post I have documented how I performed the test, and how I tracked down the offender. Along the way I cover some aspects of how Hotspot works its magic, including compilation of Java code and when it will halt the execution of all Java threads.

The Problem

Here is the offending code, it consists of two loops that together sum up the contents of an array of bytes 'n' times. No obvious causes for blocking other JVM threads here. It does not synchronise, or use volatile, CAS, wait, or notify. In fact the data structures are all created on one thread and accessed from that one thread without sharing it.

 
public long invoke( int targetCount ) {
    long sum = 0;
    for ( int i=0; i<targetCount; i++ ) {
        for ( int j=0; j<arraySize; j++ ) {
            sum += array[j];
        }
    }

    return sum;
}

This code runs on my laptop in 2.4 seconds, averaging 2.5 array reads and summations every nanosecond. This benchmark did not become interesting until I decided to also measure the JVM jitter. My end goal was to benchmark disk IO, and JVM jitter is a way to see the side effects of overloading the IO system and kernel (say by memory mapping too much data); however it showed something very interesting in the totally in-memory case that I had not expected. The jitter thread got blocked for 2.5s, it was a one off block that did not repeat in the same run; but it did occur at the same point of every run. Why would a for loop cause other threads to block? EVER?

Measuring JVM Jitter

Perhaps the problem was with how I measured the jitter. To measure I started a background daemon thread that repeatedly sleeps for 1ms and then measures how long after the 1ms period it took for it to actually wake up. It then stores the max length of blocked time and repeats. Java is well known for having limitations in its reading of time, and most thread schedulers are optimised for speed over accuracy thus it is expected that the thread will pretty much always wake up some time after the 1ms period. The question is, how long after. I hope that you agree that this code works well and does what it says on the tin.

public class MeasureJitter extends Thread {
    private AtomicLong maxJitterWitnessedNS = new AtomicLong(0);

    public MeasureJitter() {
        setDaemon( true );
    }

    public void reset() {
        maxJitterWitnessedNS.set( 0 );
    }

    public double getMaxJitterMillis() {
        return maxJitterWitnessedNS.get()/1000000.0;
    }

    public void printMaxJitterMillis() {
        System.out.println( "getMaxJitterMillis() = " + getMaxJitterMillis() );
    }

    @Override
    public void run() {
        super.run();

        long preSleepNS = System.nanoTime();
        while( true ) {
            try {
                Thread.sleep( 1 );
            } catch (InterruptedException e) {
                e.printStackTrace();
            }

            long wakeupNS = System.nanoTime();
            long jitterNS = Math.max(0, wakeupNS - (preSleepNS+1000000));

            long max = Math.max( maxJitterWitnessedNS.get(), jitterNS );
            maxJitterWitnessedNS.lazySet( max );

            preSleepNS = wakeupNS;
        }
    }
}

It turns out that on my system the normal latency was 0.4ms, however it always jumped to 2.5s at the same point of the test run. Every time. That is a jump from 0.4ms to 2500ms, a huge jump. When benchmarking Java, one always skips the first values as they are quite erratic and slow. However this peaked my interest because while the background jitter measuring thread was delayed by 2.5 seconds the throughput of the array counting was not slowed at all. So the question was, why? And inorder to understand it, I had to peel back the layers of the JVM.

Java and Stop the World events

There are times when Java has to halt all threads. The most famous of these is Garbage Collection. A recorded statistic for Java is an average of 1 second of halting all Java threads per 1-2Gb of heap space; this statistic is obviously highly variable. In this case, the heap space was 500Mb, there were no objects being allocated and the main work thread was not stopped. It was the background daemon thread that was getting the Jitter. So this did not sound like GC, but just to be sure I added -XX:+PrintGCDetails to the JVM start up command and confirmed that there was indeed no GC going on. My next suspicion was that it was the Daemon thread that was not getting scheduled for some reason. However the Unix thread schedulers are very efficient, and while it may assign the jitter thread to the same cpu core that was doing the work that would not block it from being run for two and a half seconds. I would have expected either 1) another core to have stolen the work before 2.5 seconds was up or 2) for the busy core to have been paused and the other thread run (this is called preemption). In either case the result should have been variable. With variations in the jitter time, and in the case of stealing time from the worker thread, slowing the worker thread down. None of this was the case, so my suspicion fail upon Hotspot itself. Was Hotspot blocking other threads as a side effect of generating native assembler code?
Alexey Ragozin reports that the Oracle Hotspot implementation will perform a request to stop all threads when:

Garbage collection pauses (depends on the GC algorithm being used)
Code deoptimization (in response to new information from newly loaded classes, null pointer exceptions, division by zero traps and so forth)
Flushing code cache
Class redefinition (e.g. hot swap or instrumentation)
Biased lock revocation (and as we learned in this post, turning the BiasedLock mechanism on)
Various debug operation (e.g. deadlock check or stacktrace dump)

Hotspot Compilation on modern JVMs

The latest Oracle VMs use tiered compilation. That is they start executing Java Byte Codes via an interpretor. Which is a flash back in time to Java's origins, back before the days of JIT and Hotspot; to a time when Java had a deserved reputation for being very slow. Very very slow ;) The VM still executes code via the interpretor so that it can startup faster, and to gather some statistics to inform what it should compile/optimise and how. Every method call has with it a counter, as does the back edge of every for loop that has not been unrolled (including nested for loops). When the sum of these two counters exceeds a threshold, then Hotspot will schedule a request to compile the method. By default this compilation happens in the background, and it then replaces the interpreted version of the method when it is next called.
For methods that take a very long time to complete, or perhaps never complete because they stay within a loop then the JVM is capable of optimising the method from a safe point in the loop. It can then swap in the new implementation when the method is at that same point. HotSpot calls this 'On Stack Replacement' or OSR for short. Now my assumption has always been that this hot replacement would only slow down the thread that was executing the code, and in most cases not even that thread should ever suffer.
Hotspot's compilation can be monitored by adding the following JVM flag: '-XX:+PrintCompilation'
This flag resulted in the following output:

2632   2%      com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ 14 (65 bytes)
6709   2%     made not entrant  com.mosaic.benchmark.datastructures.array.UncontendedByteArrayReadBM::invoke @ -2 (65 bytes)

The format for this output is documented here. In short it says that the method that held the for loops above was compiled to native code, swapped in to the running method via OSR and then 4.1 seconds later the optimised version of the code was flagged sub optimal and to be replaced (made not entrant). Was this the cause? According to the documentation of this output, marking the method none entrant does not prevent the method that is currently running. At some point the JVM should swap the method over to a newly compiled version and mark the old version as a zombie, ready to be cleaned out. Marking it non-entrant is designed to stop any new methods from running that version of the code. Could this be blocking other threads? I'd hope not as events that invalidate optimised code is fairly common in Java. Java is cited as having cases where it can run faster than statically compiled C code. This happens because Java can look at how it is really being used at run time and re-optimise the code on the fly. Causing big latencies here would kill that selling point of Java. So the question is, how do we get a smoking gun that proves that this is the problem?

Debugging Hotspot Safe Points

The JVM cannot just stop all threads. When the JVM wants to perform an exclusive change it has to register its desire to stop all threads and then it has to wait for the threads to stop. Threads only check the need to pause at 'safe points', which have been nicely documented by Alexey Ragozin in his blog post titled Safepoints in HotSpot JVM. The safepoints are on JNI boundaries (enter or exit), or calling a Java method.
To see when Java is halting the world, add the following flag to the command line: '-XX:+PrintGCApplicationStoppedTime'. Despite its misleading name, this option prints out the time that the JVM sits waiting for all threads to stop. It is not just about GC time.
For me, this flag resulted in the following smoking gun.

Total time for which application threads were stopped: 2.5880809 seconds

The pause event was identical to the jitter being reported. To understand this further the following two JVM flags have to be used: ' -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1'.

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
4.144: EnableBiasedLocking              [      10          1              1    ]      [  2678     0  2678     0     0    ]  0   
Total time for which application threads were stopped: 2.6788891 seconds

This output tells us that the JVM waited 2.678 seconds to stop the world and that the reason for wanting to stop the world was to enable biased locking. Biased locking, as it turns out is an optimisation for lock acquisitions which are usually only locked by the same thread (but could be locked by other threads). It reduces the cost of the same thread re-acquiring the same lock at the expense of increasing the cost for any other thread that wants to acquire it. This then is our smoking gun.

It turns out that during start up of the JVM, Oracle keep biased locking disabled for the first four seconds. This is because there is a lot of contention within the JVM during this startup window and thus they delay turning it on and turning it on is a Stop the World event. The stop the world event then had to wait for a safe point to be reached, and it so happens that the thread running had no safe point within its for loop which took a little while to run to completion.

Conclusions

JVM jitter occurs anytime that the JVM wants to stop all threads so that it can gain exclusive access to its internal data structures. How long this jitter lasts depends on how long it takes to stop all of the threads, and this in turn depends on when the threads will reach the next safe point. Calling native methods that take a long time to finish, or using long loops that do not call out to any other method can result in very long delays. We saw a 2.5s delay here due to a for loop iterating over an array of bytes. Admit-ably this is a contrived micro benchmark, however Gil Tene CTO from Azul Systems has reported seeing delays like this of 10s or more in production systems. The 10s example that he gave was from a Clojure program that involved a large memory copy.

Sunday, 24 February 2013

Are Java Assertions Enabled?

A quick tip if you ever need to test whether assertions are enabled in Java, without having to throw and catch an assertion exception.

    boolean ASSERTIONS_ENABLED = false;

    assert ASSERTIONS_ENABLED = true;

This code sets ASSERTIONS_ENABLED to false and then, if and only if assertions are enabled then the assert expression will run. The assert expression sets ASSERTIONS_ENABLED to true and returns true. Thus no exception will be thrown.

Sunday, 29 July 2012

A neat tool for detecting Memory Leaks from within a Unit Test (Java)

On the train back from work today I crafted an implementation of a HashWheel (a very cool algorithm for scheduling work), and while planning my unit tests I wanted to know that when a task had fired or been cancelled that the HashWheel released the job. Ala no memory leak. So I knocked up the following few lines that I thought were cool enough to share. The best bit was; it found a memory leak that I had missed straight away! Result! :)

I also added it to a github repository where you can find the full source code, take a look at TestTools.java under the public github repository threadtesting-mosaic.

The tool waits for up to three seconds (usually less than 40 milliseconds on my laptop; if it is going to pass that is) before throwing an exception.

To use the tool, just instantiate WeakReference and call TestTools.spinUntilReleased( ref ).

    public static void spinUntilReleased( final Reference ref ) {
        Runtime.getRuntime().gc();

        spinUntilTrue( new Predicate() {
            public boolean eval() {
                return ref.get() == null;
            }
        });
    }

Here is an example unit test that makes use of this tool:

    @Test
    public void scheduleTask_cancelViaTicketAndLetGoOfTask_expectTaskToBeGCd() {
        MyTask task1 = new MyTask();

        HashWheel.Ticket ticket = hashWheel.register( 120, task1 );
        ticket.cancel();

        final Reference ref = new WeakReference(task1 );
        task1 = null;

        TestTools.spinUntilReleased( ref );
    }

For completeness, here is the implementation of spinUntilTrue. Which can also be found in the threadtesting-mosaic repository.

    public static void spinUntilTrue( Predicate predicate ) {
        spinUntilTrue( 3000, predicate );
    }

    public static void spinUntilTrue( long timeoutMillis, Predicate predicate ) {
        long startMillis = System.currentTimeMillis();

        while ( !predicate.eval() ) {
            long durationMillis = System.currentTimeMillis() - startMillis;

            if ( durationMillis > timeoutMillis ) {
                throw new IllegalStateException( "Timeout" );
            }

            Thread.yield();
        }
    }

Tuesday, 29 May 2012

Java Modulo vs Bitmask benchmarks

We all like things to run fast. A well known optimisation for ring buffers, consistent hashes and hash tables alike is to replace the modulo operator with a bitmask. The trick being to keep the length of the arrays to being a power of two. Thus replacing a costly operation with a much cheaper one. The question that I wanted to answer here is not whether this optimisation works, but on the Java stack how much of a difference does it make. The following timings were taken running JDK 1.6.0_31 on a 2GHz hyper threaded quad core MacBook, which repeated each operation from cold ten million times each. The table shows that so long as the array length is a multiple of two then hotspot will eventually replace % with & making this optimisation only worth applying to speed up the pre-optimisation window. It also shows that the speed up is roughly a factor of three.

	Min (ms)	Max (ms)	Per Iteration
v % 5	174.812	273.81	1.5ns
v % 4	68.684	127.508	0.6ns before hotspot, 0.4ns after
v & 3	64.45	89.134	0.4ns

Conclusion: When using % n, aim to have n to a power of two. This gave a threefold performance improvement within this benchmark. The interesting result here is that Hotspot is capable of optimising the modulo operator, rewriting it to a bitmask for us on the fly so long as the division is to the power of two.

Appendium (May 2013): Over the last year I have learnt more about how Intel CPUs are structured, and a few interesting facts have emerged. Firstly a divide instruction takes approximately 80 CPU cycles to execute (ouch!) and can be pipelined at the throughput rate of 20-30 cycles. This is compared to a logical AND which takes 1 cycle to execute, and has a throughput of 0.5. It also turns out that only one of the CPU ports is capable of performing division but multiple are capable of bitmasks. Thus using a bitmask is not only much faster than division but more of them may be run in parallel too. CPU time taken from x86 Optimisation guide (checkout table C-16 in the appendices). This level of parallelism was not seen in this microbenchmark because the instructions were all modifying the same data. Also note that this benchmark focused on measuring throughput interlaced with additions and for loop iteration rather than instruction latency; thus hiding some of the cost of performing a division.

View the benchmark code

Wednesday, 28 December 2011

Handling Java nulls in Scala

Null handling in Scala can be painful, to ease this pain I like to avoid the problem entirely by not using them at all. However sometimes one has to cross talk between Java and Scala, and then one has to face the problem head on.

The cleanest approach that I have found, that does not result in a deranged mix of generics or implicits is to use an anti corruption layer as close to Java as possible that converts Java's use of null into an Option as soon as possible.

For example:

    Option(o).map(_.asInstanceOf[T])

If o is null, then the above snippet will evaluate to None, else Some(o).

String Option with map and closures is rarely readable, at least not until you have bathed in it for long enough to have Scala'ized your mind. So I like to wrap this up with a function, as follows.

  def asInstanceOfNbl[T]( o:AnyRef ) : Option[T] = Option(o).map(_.asInstanceOf[T])

Sunday, 12 June 2011

Java 8 wishlist

I have blogged before about my dissatisfaction with the rate of improvement of the Java platform. While the situation is understandable, it still leaves me annoyed. Below is an updated wish list that I would like to see in Java 8.

I am increasingly considering getting my hands dirty in adding some of these features. If only I had more time. So my first wish is for a clock with a big 'stop the world' button on it. Having an infinite amount of time, with high energy levels, no risk of rsi and no risk of interruptions would be great. Please create this for me Mr S. Hawking.

jvm
support jar level constant pools to reduce class sizes
enhanced gc
transactional memory support
add method parameter names to reflection
access to javaadoc from api (provide in optional packaging that can be accessed via reflection); include comment following abstract method.

Simple syntax sugar
.?
multi line strings
versioned jar file/module system
drop the annoying diamond operator (<>) added in Java 7 but keep the functionality
optional method parameters with default values
language syntax for reflection
language support for collections
pre condition support on method signatures

More involved language features
mix ins
closures
Language support for custom types
replace try-close with keyword; keyword declares that an object is not to escape the stack
Inferred type support (eg var s = "abc"; s is strongly typed to String why double declare it)

Saturday, 7 May 2011

Test tip with Maven Surefire

When starting a new Maven module, I strongly encourage everybody to set the unit tests to execute in parallel.

There are a few benefits to this:

most obviously the unit tests run faster on modern multi-cpu boxes
it helps to tease out concurrent bugs earlier
it helps to enforce good unit testing practices; specifically that there should be no side effects between tests and the order of test execution should not matter
enabling this feature later on in a project is possible, but it is nearly always a pain so start it early and you will get its benefits almost for free

I recently introduced this practice to a new team that I am working with and upon enabling it on a recent module whose unit tests were already in a good state it highlighted a concurrency bug in a third party library that was being pulled in that had not yet been detected. Who would have guessed it? Early detection of problems is awesome.

To enable in Maven just add the following runes to you pom.xml file:

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-surefire-plugin</artifactId>
    <configuration>
        <parallel>methods</parallel>
        <threadCount>2</threadCount>
    </configuration>
</plugin>

Java Wish List

When I was young (oh man I sound old just for saying that), when I was younger (eeee not any better; oh dear). Back when I used to write compilers, and hack the Linux kernel I never thought twice about adding a new language feature to make my job easier. Over the years the amount of effort that requires, usually in tooling of various IDEs and the lack of will for companies that I work for to develop such tools I have fallen out of the habit of asking what language feature would make this job easier (it has instead been replaced with ‘what framework feature or library will make this job easier’).

So as I become increasingly itchy with Java I have found myself once again asking these questions, “what language features do I want?”. I was surprised at how long this list became. I include short one descriptions of the main features that came to mind, I will look to expand on some of them in future posts as short descriptions do not do them justice.

Improved GC behaviour where performance does not drop off for big heaps.
Software/Hardware Tranactional Memory (STM/HTM) support built into the language, ala Clojure.
Simplify the try-with-resource syntax. Something like the following should suffice, using the usual curly braces and the stack for scope.

autoclose InputStream in = new FileInputStream("a.txt");

Type inference Scala style. This one feature in Scala, in my opinion makes Scala the language to look at these days.
A JVM enhancement, support constant pools across an entire jar file and not just a single class. It would reduce the file size greatly.
To give extra design intent information to the compiler which would give extra compile time checking, allow the developer to declare that a object is to not leave the stack. References that escape the stack can cause thread safety problems and so receiving compile time checks when it matters would be a big help.

local Context ctx = new Context();

Compiler to error when references to this escape a constructor during instantiation. This is a common cause of concurrency bugs and so should just not be allowed.
Closures closures closures, functions as objects, everything as objects, objects objects objects. Closures. – steps down off of his soap box –
Remove statics. Naturally not achievable for Java, but like Global variables they break many of todays best practice design patterns and they should be put out of their missary. Another good point of Scala .
Build immutable and lockable objects into the language syntax
Support lazy evaluation of expressions passed as arguments to methods.
Build object field and method reflection into the syntax of the language.
Clean and elegant closure support. Okay easier said than done, I’ll expand my ten pence worth on how this should look later but as a teaser I don’t like the Groovy or Scala syntax and prefer the style of using anonymous method syntax. It just feels more uniform to me.
Reflection can retrieve method parameter names, and even java doc
Refreash the internationalisation frameworks and build it more cleanly into the language. I like how Apple have done it in ObjectiveC over the Java resource approach.
language eval support
simple dynamic sql syntax, Nick Reeves taught me a wonderful syntax for this a decade ago and I would love to see it become main stream.
Another theft from Scala, implicit type conversions and operator overloading (okay so thats C++ style, we all borrow from each other . So I can write a Radian class and pass it around and get compile time type checking and use of language operators and conversions etc
for ( T a : list ) should not throw NPE when list is null
Groovy had it right when they added the .? operator for avoiding NPE exceptions when calling a.b().c()
Default values for method parameters (geesh is Java the only language without this)
Multiple line strings
Language support for templates
Language support for SQL and XML
Add Interner support
Add cache framework api built up using the delegate design pattern
one keyword property declaration
cleaner support for hashcode, equals and tostring..

Monday, 31 January 2011

Advice on learning Spring

I was recently asked for advice on how to go about learning Spring, specifically materials that are in addition to the main Spring documentation. This got me thinking as to why some people pick up Spring in a matter of hours while others have to work at it. As with most material there are two parts to Spring, firstly the nuts and bolts of the framework itself and secondly the patterns and practices behind it. This split is so common in computing (and beyond) that it formed the basis of which University that I went to, I found a long time ago that focusing on the patterns, practices and theory gave me skills that transferred easily between different technologies. Spring is no different.

Given this line of thought I broke my reply into two parts, for the nut and bolts of Spring read the first chapter on the Spring documentation 'Core Technologies - The IoC container'. I have not found anything better than this. However do not expect to fully 'get' it on the first pass, or depending on your experiences even the third or forth. Despite Spring's reputed simplicity, it is obscured behind both its own size and as hinted previously 'patterns, practices and theory'. No real prizes for guessing that the main practice behind Spring is Dependency Injection (DI), and Inversion of Control (IOC); they are both referred to frequently within the Spring documentation and dependency injection particularly is a technique that dates back long before Spring and its ilk.

So how does one really learn Dependency Injection and Inversion of Control? The academically inclined may go back to studying Object Orientated (OO) techniques, such as abstraction, encapsulation, polymorphism and things like object cohesion and coupling. The more practically minded person will benefit from doing, and for this I strongly recommend Test Driven Design (TDD), and mocking. Every test that you write following TDD principles is practicing the core design skills needed for efficient and effective use of Spring. When you have mastered the ability to break a design cleanly along its lines of abstraction, then the tests will become focused, simple and robust. At this point the objects being tested will slide into Spring or any other IOC container very easily.

These skills are easy to describe, but difficult to master. Don't be scared to try once, then throw it away and to redo. Most people are scared of the time that this represents, especially if they are working on a project with tight deadlines. However I prefer to view it as an investment, and I make sure that I build time into any estimates that I provide. One can never expect to become quick at this if they always stick to their first design. A form of deadlock happens when we stick to our first designs, we never experiment to find better solutions and eventually we enter a vicious cycle of fire fighting spaghetti code. Once that happens a project will become paralysed. So do yourself a favour, come up with three designs to every problem; and then go with the forth. Lastly I recommend that when you have had a few goes at this, find somebody who is more practiced than yourself and you respect, ask them for feedback. Rome was not built by one man alone.

Other material

Logging best practices

This post is a brain dump of the practices that have proved useful to me over time while maintaining production Java systems. Please share your experiences, especially if you have come across any great time savers not already mentioned here.

The Practices

The scale of the system that you are working on does affect how valuable each of the individual practices below would be to you, some of the practices that I mention were introduced due to experiences working with server farms in excess of 150 JVMs running in a very 'chatty' SOA architecture and so those practices may not be as valuable in, say a single process Swing application. However I have come to the conclusion that once these practices are wrapped into a reusable set of classes and bedded into the mind set of the developers then they cost next to nothing to use; and so I recommend putting in place at least a placeholder for each practice at the start of a project as retrofitting large code basis is often more expensive than putting in a good placeholder early.

Easy log file access

The more effort and time involved in gaining access to a systems log file(s), then the natural consequence is that the log file will be used less frequently. I have worked in environments where developers had to request a copy of the production log files in order to begin investigation into a problem. Such requests would take between 1 hour and 3 days to be actioned. This barrier to entry blocks all pre-emptive problem solving from happening, as well as gaining timely feedback from experiments used to reproduce and track down a problem.

My ideal is to have real time search access to all production logs, tools suck as Splunk make this very straight forward. Especially as it is capable of combining logs from multiple servers. If a licensed solution is not viable then roll up your sleeves and write a few scripts to achieve the same results.

Self contained messages

Each individual message needs to be complete in its own right. That is, a call to the LOG class should be one per key event being reported on. All of the
information for that event must be contained in that call, such as stack traces and values of key data that describe the uniqueness of the event that has just occurred. Do not place stack traces in a separate call to the log than the message describing its context.

If the information or data is spread across multiple calls to the LOG class then it becomes likely
in a multithreaded environment that the messages will get interlaced making them more difficult to decipher. Naturally this interlacing does not occur often during development cycle as a developer working by him/her self tends to only send one request into the system at a time.

Logs telling the Story of what happened

Context is King

A single line in the log file describes a single event, a series of events tells a story, and in order to understand the meaning of the story it needs to flow; the relationships between each event needs to be clear and concise.

To achieve this with Log4J I use the MDC class to push context information that is important to the story being described, such as: the name of the user who made the request, a unique request id that was generated when the request started, the thread id at time of the message being generated, and the security principle that the request was running as at that time. In environments consisting of multiple co-operating servers you may also find the following values useful in the MDC: the name of the SOA service that generated the logged event, the machine name where the logged event was generated. Which pieces of data you place on the MDC will depend on your context, remember to make sure that each item pays for itself. It is not a free journey, and we don't want to give away any free tickets.

Context across machine boundaries

In multi machine environments being able to tie the story together becomes harder, and yet all the more important. Consider passing some of the values that you you push onto he MDC between machine boundaries as part of the remote calls, I make it a matter of point to always pass the user name and request id across machine boundaries. This means that if an error is reported to the user, they can then be given an incident number to report to the support staff. This incident number is the request id mentioned previously and will allow a developer to reconstruct the full story of what the user was doing as the error occurred from the log files.

Zero tolerance to errors

On projects that I have worked on that have had no compile time warnings, it has always been easy to spot when one has introduced a warning and not wanting to be the one to be caught 'peeing in the proverbial pool' I tend to fix new warnings fairly promptly. If on the other hand there are lots of warnings already in place then my brain shuts off and I no longer notice them, I won't carry on the previous comparison but I do find that after a build has more than a few warnings they quickly start to multiply with out even being noticed. The same phenomena happens with errors in production logs, it is very easy to spot and respond to them when they are rare however when they have become frequent then they become the norm and in my experience it is a significant sign that the project is struggling with a vicious circle that needs to be broken. If you are lucky to be working on a green field project, then it is best to start this practice early.

Logging as part of QA

The log files are part of the interface to a system, think of it as an API in its own right. They may not be used by the end customer, but in circumstances when they are used to support the end customer then they do have an audience. To ensure quality of any interface to a system it must be both tested and used frequently. Failure to perform these checks results in the rapid build up of dust, decay and entropy. No body enjoys house cleaning, so automating these tests can greatly reduce the burden. The usual suspects apply here, unit testing, mocking of the Log interface and TDD.

Logging levels

Tools such as log4j support multiple coarse grains of categorising the messages logged. It makes a significant difference to the quality of the log files if developers agree when to use each of the common log levels early on in the project, and adhere to them. This consistency helps to reduce spam in the log files and helps turn off the noise when going into production. Personally I start off using DEBUG for all of my messages and then increase its level when I can describe the benefit for doing so. The following table captures the type of explanations that I look for with each logging level. This table is intended to help get thoughts flowing, and is by no means a summary of the only rules that a team could use.

Level	Usage	Expected action	Example
FATAL	Reporting problems that until fixed will cost the company money and will carry on costing the company money until resolved. Once a problem has been reported do not immediately repeat the same message.	Fix ASAP. Even if that means waking people up in the middle of the night. As such false alarms and frequent alerts will not be taken to kindly.	The database is down and all user requests are being rejected.
ERROR	A problem has occurred that was not automatically recovered, but it is not critical to the revenue stream of the company.	In low to medium volumes ERROR messages will trigger investigation in a timely manner. High volumes may be escalated to FATAL. NB: A user entering 'foo' into a number only field should never be reported as an error at this level, it should be recovered automatically at the GUI level.	Updating a users details failed due to an unexpected database error.
WARN	A problem has occurred that was both unexpected and automatically recovered.	Actively monitor with the goal of pre-empting problems that could escalate. Frequent reoccurring problems will need to be fed back to development for resolution.	Connection to the exchange rate server has gone down but it is not causing an immediate problem due to caches or disk space is down to the last 10% etc.
INFO	Report key system state changes and business level events.	Used to answer infrequent queries about the behaviour of the system and its users.	Jim deposited 53 pounds sterling.
DEBUG	Detailed information about each request.	Used to investigate the causes of a problem. Will usually be disabled in production by default as it will output a lot of contextual information.	Value of field x is 92.4, or 'starting process Y' and 'finished process Y in 95ms'

Minimising Logging Overhead

Logging is not free. It costs the developer time to add it to a system, it costs the system time to execute the log statements and it costs the poor soul who has to investigate/support the system time to understand the logs. As a consequence of this think twice before logging a new event, is it really needed. If it is then how expensive is it to produce the event and how often will it appear in the logs?

don't log unless the event adds to the story being told, what you don't know what the story is? go back to context is king and do not collect two hundred pounds
if the event is very common, bulk them together; don't spam
if the message to be logged is a constant string use LOG.level("msg")
if the message requires much string comparisons and runtime performance is important us isXXXEnabled
to improve readability and performance I sometimes use an interface to reduce the number of ifs; also useful if the logs are to be translatable

Machine parsable logs

To make processing logs easier write to the logs in a consistent format. When embedding data into the logs provide consistent names for the data etc.

For example, one approach that I have worked with is to output log messages like this:

[user=Jim] has logged on.

Runtime adjustable levels

As discussed under 'Minimising logging overheads' we want to avoid spamming the log files. However when diagnosing a problem we need data, when reproducing problems we sometimes need a lot of data. For this reason it is very useful to have a mechanism that can increase the log levels on a machine, or group of machines without having to reboot the machine. The most flexible approach is to be able to do this per user; remember to not give untrusted access to such a mechanism.

Archive the logs

You do not want to run out of disk space in production. Monitor the amount of data on the disk drives, and archive off logs for future reference.

Useful Tools

Log4J
Custom shell scripts written in grep/sed/awk, perl, ruby, etc
Splunk

Saturday, 8 January 2011

Performance comparison of Java2D image operations

Abstract

The motivation for this blog item is to capture notes during some spikes that I made while figuring out which approach to manipulating photographs performs the best in Java2D. For the purposes of this article I am reducing the brightness of a picture as a test case, a simple operation that merely requires the colour values of each pixel to be divided by two. More complicated algorithms can be built up upon the back of this article once we have an understanding of which approach to reading and writing pixels performs the best.

The Test Setup

The test picture

A 11.7Mb png file, measuring 3296 pixels in width and 2472 pixels in height.

The test machines

Mac Airbook: Intel Dualcore 2.1Ghz 2Gb ramNv8d8an GeForce 9400M
Win Desktop: AMD DualCore 2.6Ghz 2Gb ram Nvidia Quadro FX1500

MacOSX (airbook): 718 ms
Windows XP: 985 ms

Approach 2

Read each pixel in as an encoded integer from BufferedImage.getRGB(x,y), and write them out using setRGB.

for ( int x=0; x<imagewidth ; x++ ) {
    for ( int y=0; y<imageHeight; y++ ) {
        int rgb = inImage.getRGB( x, y );
 
        int alpha = ((rgb >> 24) & 0xff);
        int red = ((rgb >> 16) & 0xff);
        int green = ((rgb >> 8) & 0xff);
        int blue = ((rgb ) & 0xff);
 
        int rgb2 = (alpha < < 24) | ((red/2) << 16) | ((green/2) << 8) | (blue/2); 
        outImage.setRGB(x, y, rgb2);
    }
}

MacOSX (airbook): 1495 ms
Windows XP: 2219 ms

Using getRGB is twice as slow as getPixel, and provides no particular benefits. Lets avoid this approach and see if we can optimise getPixel any.

Approach 3

As Approach 1 however rather than reading a pixel as a set of three integers, it reads them in as three floats.

final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();

final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();
 
float[] pixel = new float[3];                 
for ( int x=0; x<imagewidth ; x++ ) {          
    for ( int y=0; y<imageHeight; y++ ) {     
        pixel = inRaster.getPixel( x, y, pixel );   
 
        pixel[0] = pixel[0]/2;                
        pixel[1] = pixel[1]/2;                
        pixel[2] = pixel[2]/2;                
 
        outRaster.setPixel( x, y, pixel );          
    }                                         
}
MacOSX (airbook)
901 ms
Windows XP
1203 ms
 
A little slower than the reading the values per component. This surprised me, either the noise of the test has hidden the improvement or there is some overhead here that is not good. Further trials are needed to differentiate these two possibilities. But before we explore this further lets try the approach provided for by Java2D, the ConvolutionOp.  Approach 4
Swing provides a class specifically designed for convolution operations, such as reducing the brightness of a picture. Convolution is the term given to a group of image processing algorithms that average a group of pixels together to create a new value for a single pixel.    The following code uses a 1x1 matrix to take an average of only the pixel that will be replaced.    
float[] DARKEN = {1.0f};
 
Kernel kernel = new Kernel(1, 1, DARKEN);
ConvolveOp cop = new ConvolveOp(kernel,ConvolveOp.EDGE_NO_OP, null);


cop.filter(inImage, outImage); 
MacOSX (airbook)
184 ms
Windows XP
172 ms
 
The convolution implementation provided by Swing was significantly faster than the per pixel baseline that was tried first. This was only to be expected given that the Sun Engineers would have spent time tuning the code for precisely this type of use.  Approach 5
To see if the per pixel approach can be improved on I took the faster of the two approaches tried which used integers and used the getPixels method that is capable of reading multiple pixels at a time.    This spike will show whether there is much overhead in accessing the pixels one at a time via getPixel verses a bulk fetch and set of multiple pixels. The batch size has been set to match the width of the picture.      final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();
 
int[] pixels = new int[3*imageWidth];                                                                                             
for ( int y=0; y<imageheight ; y++ ) {                            
    pixels = inRaster.getPixels( 0, y, imageWidth, 1, pixels );        
 
    for ( int x=0; x<imageWidth; x++ ) {                         
        int m = x*3;                                             
        pixels[m+0] = pixels[m+0]/2;                             
        pixels[m+1] = pixels[m+1]/2;                             
        pixels[m+2] = pixels[m+2]/2;                             
    }                                                            
 
    outRaster.setPixels( 0, y, imageWidth, 1, pixels );                
}
MacOSX (airbook)
534 ms
Windows XP
578 ms
 
Approach 6
Reading an entire row of pixels in at a time was faster than accessing a single pixel at a time. However it is still not approaching the performance of the ConvolutionOp. Perhaps fetching two rows at a time will be faster still?  final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();

final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();
 
int[] pixels = new int[3*imageWidth*2];                                                                             
for ( int y=0; y<imageheight ; y+=2 ) {                    
    pixels = inRaster.getPixels( 0, y, imageWidth, 2, pixels ); 
 
    for ( int x=0; x<imageWidth; x++ ) {                  
        int m = x*3;                                      
        pixels[m+0] = pixels[m+0]/2;                      
        pixels[m+1] = pixels[m+1]/2;                      
        pixels[m+2] = pixels[m+2]/2;                      
 
        int n = m+imageWidth*3;                           
        pixels[n+0] = pixels[n+0]/2;                      
        pixels[n+1] = pixels[n+1]/2;                      
        pixels[n+2] = pixels[n+2]/2;                      
    }                                                     
 
    outRaster.setPixels( 0, y, imageWidth, 2, pixels );         
}
MacOSX (airbook)
429 ms
Windows XP
453 ms
 
Approach 7
Reading in two rows of pixels at a time was for the most part faster than reading a row at a time. As with all of the timings taken Java varies greatly each time, so out of curiosity I wanted to know how much slower processing half a row at a time would be.      final WritableRaster inRaster   = inImage.getRaster();
final WritableRaster outRaster = outImage.getRaster();
 
int    halfWidth = imageWidth/2;                                    
int[] pixels       = new int[3*halfWidth];                             
 
for ( int y=0; y<imageheight ; y++ ) {                            
    pixels = inRaster.getPixels( 0, y, halfWidth, 1, pixels );         
 
    for ( int x=0; x<halfWidth; x++ ) {                          
        int m = x*3;                                             
        pixels[m+0] = pixels[m+0]/2;                             
        pixels[m+1] = pixels[m+1]/2;                             
        pixels[m+2] = pixels[m+2]/2;                             
    }                                                            
 
    outRaster.setPixels( 0, y, halfWidth, 1, pixels );                 
 
    pixels = inRaster.getPixels( halfWidth, y, halfWidth, 1, pixels ); 
 
    for ( int x=0; x<halfWidth; x++ ) {                          
        int m = x*3;                                             
        pixels[m+0] = pixels[m+0]/2;                             
        pixels[m+1] = pixels[m+1]/2;                             
        pixels[m+2] = pixels[m+2]/2;                             
    }                                                            
 
    outRaster.setPixels( halfWidth, y, halfWidth, 1, pixels );         
}
MacOSX (airbook)
418 ms
Windows XP
453 ms
 
This time the spike was faster than processing two rows of pixels at a time, so it would appear that two rows is faster than one row at a time and half a row is faster still. Huh? What is going on here? It would appear that the noise in the timing of the operations is greater than the performance improvement seen by  varying the number of pixels processed at a time. Clearly reading multiple is preferable to one at a time, but after that it is not competing significantly with the ConvolveOp which is still King.  Approach 8
Investigating the getPixel methods was interesting but if we are to see a significant improvement in performance a totally different approach is going to be needed. For the last spike I tried to bypass as many of Java2Ds layers as possible and access the picture data directly via the DataBuffer class. It is still a long way from the hardware, as is common in Java however it will give us an idea how much overhead the Java2d classes BufferedImage and WritableRaster add.        // For this approach to work it is important that both DataBuffers use the same picture encoding under the hood as each other,
// otherwise the picture will corrupt
final WritableRaster outRaster = inImage.getRaster().createCompatibleWritableRaster(imageWidth, imageHeight);
 
DataBuffer in   =  inRaster.getDataBuffer();                                                 
DataBuffer out = outRaster.getDataBuffer();                                                
 
 
int size = in.getSize();                                                                   
for ( int i=0; i<size ; i++ ) {                                                             
    out.setElem( 0, i, in.getElem(0, i)/2 );                                               
}                                                                                          
 
BufferedImage outImage = new BufferedImage(inImage.getColorModel(), outRaster, true, null);

MacOSX (airbook)
71 ms
Windows XP
63 ms
 
Conclusions
Comparing performance of different image processing approaches in Java has been a challenge. Java is unable to give anything close to constant time for processing the same image, during the coarse of running the above code fragments I saw variations ranging from 700ms to 7000ms. To help smooth the results I placed a call to System.gc() between each spike and reran the tests many times to bed the system in and to take an average ignoring any really wild values. This behaviour makes Java very unsuitable for any type of real time image processing applications.    The variance of Java's performance being said, there are some very clear trends in the results. Specifically if you are implementing a convolution algorithm and performance is not your key concern then the ConvolveOp is excellent, however if performance is vital then with some extra effort in understanding the encoding of the DataBuffer used by your image then you can get a 2-5x performance boost over ConvolveOp by accessing the DataBuffer directly. If the algorithm that you are working with does not boil down to a convolution then you will do okay to read in a line of pixels at a time using getPixels (avoid getRGB) however the clear winner was the DataBuffer.  After thoughts
I wrote this article out of curiosity about Java's image processing capabilities, it is widely recognised that there are better languages for the job. I have found it possible to do reasonable image processing effects for Java applications in pure Java but I would not consider it for anything that needed real time response times. Staying in the Java realm for the moment it would be interesting to also compare SWT with the approaches tested in this article, as well as using JNDI access to hardware such as graphics cards. However at this point of accessing hardware it removes the main reason why I considered Java; platform independence. Perhaps I will try C# or good ol' C next.  Appendix
Wikipedia definition of Convolution
Compiled spikes jar file
Source code for all of the spikes

Friday, 10 May 2013

Sunday, 14 April 2013

The Problem

Measuring JVM Jitter

Java and Stop the World events

Hotspot Compilation on modern JVMs

Debugging Hotspot Safe Points

Conclusions

Sunday, 24 February 2013

Sunday, 29 July 2012

Tuesday, 29 May 2012

Wednesday, 28 December 2011

Sunday, 12 June 2011

Saturday, 7 May 2011

Monday, 31 January 2011

Other material

The Practices

Context is King

Context across machine boundaries

Useful Tools

Saturday, 8 January 2011

The Test Setup

The test picture

The test machines

The Code Spikes

Approach 1

Approach 2

Approach 3

Approach 4

Approach 5

Approach 6

Approach 7

Approach 8

Conclusions

After thoughts

Appendix