Min (ms) | Max (ms) | Per Iteration | |
v % 5 | 174.812 | 273.81 | 1.5ns |
v % 4 | 68.684 | 127.508 | 0.6ns before hotspot, 0.4ns after |
v & 3 | 64.45 | 89.134 | 0.4ns |
Conclusion: When using % n, aim to have n to a power of two. This gave a threefold performance improvement within this benchmark. The interesting result here is that Hotspot is capable of optimising the modulo operator, rewriting it to a bitmask for us on the fly so long as the division is to the power of two.
Appendium (May 2013): Over the last year I have learnt more about how Intel CPUs are structured, and a few interesting facts have emerged. Firstly a divide instruction takes approximately 80 CPU cycles to execute (ouch!) and can be pipelined at the throughput rate of 20-30 cycles. This is compared to a logical AND which takes 1 cycle to execute, and has a throughput of 0.5. It also turns out that only one of the CPU ports is capable of performing division but multiple are capable of bitmasks. Thus using a bitmask is not only much faster than division but more of them may be run in parallel too. CPU time taken from x86 Optimisation guide (checkout table C-16 in the appendices). This level of parallelism was not seen in this microbenchmark because the instructions were all modifying the same data. Also note that this benchmark focused on measuring throughput interlaced with additions and for loop iteration rather than instruction latency; thus hiding some of the cost of performing a division.
View the benchmark code
This comment has been removed by a blog administrator.
ReplyDelete