Tuesday, 29 May 2012

Java Modulo vs Bitmask benchmarks

We all like things to run fast. A well known optimisation for ring buffers, consistent hashes and hash tables alike is to replace the modulo operator with a bitmask. The trick being to keep the length of the arrays to being a power of two. Thus replacing a costly operation with a much cheaper one. The question that I wanted to answer here is not whether this optimisation works, but on the Java stack how much of a difference does it make. The following timings were taken running JDK 1.6.0_31 on a 2GHz hyper threaded quad core MacBook, which repeated each operation from cold ten million times each. The table shows that so long as the array length is a multiple of two then hotspot will eventually replace % with & making this optimisation only worth applying to speed up the pre-optimisation window. It also shows that the speed up is roughly a factor of three.

Min (ms)Max (ms)Per Iteration
v % 5174.812273.811.5ns
v % 468.684127.5080.6ns before hotspot, 0.4ns after
v & 364.4589.1340.4ns

Conclusion: When using % n, aim to have n to a power of two. This gave a threefold performance improvement within this benchmark.  The interesting result here is that Hotspot is capable of optimising the modulo operator, rewriting it to a bitmask for us on the fly so long as the division is to the power of two.

Appendium (May 2013): Over the last year I have learnt more about how Intel CPUs are structured, and a few interesting facts have emerged. Firstly a divide instruction takes approximately 80 CPU cycles to execute (ouch!) and can be pipelined at the throughput rate of 20-30 cycles. This is compared to a logical AND which takes 1 cycle to execute, and has a throughput of 0.5.  It also turns out that only one of the CPU ports is capable of performing division but multiple are capable of bitmasks.  Thus using a bitmask is not only much faster than division but more of them may be run in parallel too.   CPU time taken from x86 Optimisation guide (checkout table C-16 in the appendices).  This level of parallelism was not seen in this microbenchmark because the instructions were all modifying the same data.  Also note that this benchmark focused on measuring throughput interlaced with additions and for loop iteration rather than instruction latency; thus hiding some of the cost of performing a division.

View the benchmark code

No comments:

Post a Comment