Helping ordinary people create extraordinary websites!
GET OUR NEWSLETTER
Your Email:
 

Java Theory and Practice: Anatomy of a Flawed Microbenchmark

By Brian Goetz
2005-04-27


Benchmark Code Doesn't Look Like Real Code

The flaws discussed so far can be fixed with a reasonable amount of reworking of the benchmark code, introducing some test parameters such as the degree of contention, and running it on a greater variety of systems and for multiple values of the test parameters. But it also suffers from some flaws in approach, which may not be resolvable with any amount of tweaking. To see why this is so, you need to think like the JVM, and understand what is going to happen when SyncLockTest is compiled.

The Heisenbenchmark principle
When writing a microbenchmark to measure the performance of a language primitive like synchronization, you're grappling with the Heisenberg principle. You want to measure how fast operation X is, so you don't want to do anything else besides X. But often, the result is a do-nothing benchmark, which the compiler can optimize away partially or completely without you realizing it, making the test run faster than expected. If you put extraneous code Y into your benchmark, you're now measuring the performance of X+Y, introducing noise into your measurement of X, and worse, the presence of Y changes how the JIT will optimize X. Writing a good microbenchmark means finding that elusive balance between not enough filler and dataflow dependency to prevent the compiler from optimizing away your entire program, and so much filler that what you're trying to measure gets lost in the noise.

Because runtime compilation uses profiling data to guide its optimization, the JIT may well optimize the test code differently than it would real code. As with all benchmarks, there is a significant risk that the compiler will be able to optimize away the whole thing, because it will (correctly) realize that the benchmark code doesn't actually do anything or produce a result that is used for anything. Writing effective benchmarks requires that we "fool" the compiler into not pruning away code as dead, even though it really is. The use of the counter variables in the Incrementer classes is a failed attempt to fool the compiler, but compilers are often smarter than we give them credit for when it comes to eliminating dead code.

The problem is compounded by the fact that synchronization is a built-in language feature. JIT compilers are allowed to take some liberties with synchronized blocks to reduce their performance cost. There are cases when synchronization can be removed completely, and adjacent synchronization blocks that synchronize on the same monitor may be merged. If we're measuring the cost of synchronization, these optimizations really hurt us, as we don't know how many synchronizations (potentially all of them, in this case!) are being optimized away. Worse, the way the JIT optimizes the do-nothing code in SyncTest.increment() may be very different from how it optimizes a real-world program.

But it gets worse. The ostensible purpose of this microbenchmark was to test which was faster, synchronization or ReentrantLock. Because synchronization is built into the language, wheras ReentrantLock is an ordinary Java class, the compiler will optimize a do-nothing synchronization differently than it will a do-nothing ReentrantLock-acquire. This optimization makes the do-nothing synchronization appear faster. Because the compiler will optimize the two cases both differently from each other and differently from the way it would in a real-world situation, the results of this program tell us very little about the difference between their relative performance in a real-world situation.

Dead code elimination
In December's article, I discussed the problem of dead-code elimination in benchmarks -- that because benchmarks often do nothing of value, the compiler can often eliminate whole chunks of the benchmark, distorting the timing measurements. This benchmark has that problem in several ways. The fact that the compiler can eliminate some code as dead is not necessarily fatal to our cause, but the problem here is that it will be able to perform different degrees of optimization on the two code paths, systematically distorting our measurements.

The two Incrementer classes purport to do some do-nothing work (incrementing a variable). But a smart JVM will observe that the counter variables are never accessed, and can therefore eliminate the code associated with incrementing them. And here is where we have a serious problem -- now the synchronized block in the SyncIncrementer.increment() method is empty, and the compiler may be able to entirely eliminate it, whereas LockIncrementer.increment() still contains locking code that the compiler may or may not be able to entirely eliminate. You might think that this code is a point in favor of synchronization -- that the compiler can more readily eliminate it -- but it's something that is far more likely to happen in do-nothing benchmarks than in real-world, well-written code.

It is this problem -- that the compiler can optimize one alternative better than the other, but the difference only becomes apparent in do-nothing benchmarks -- that makes this approach to comparing the performance of synchronization and ReentrantLock so difficult.

Loop unrolling and lock coalescing
Even if the compiler does not eliminate the counter management, it may still decide to optimize the two increment() methods differently. A standard optimization is loop unrolling; the compiler will unroll the loops to reduce the number of branches. How many iterations to unroll depends on how much code is inside the loop body, and there is "more" code inside the loop body of LockIncrementer.increment() than SyncIncrementer.increment(). Further, when SyncIncrementer.increment() is unrolled and the method call inlined, the unrolled loop will be a sequence of lock-increment-unlock groups. Because these all lock the same monitor, the compiler can perform lock coalescing (also called lock coarsening) to merge adjacent synchronized blocks, meaning that SyncIncrementer will be performing fewer synchronizations than might be expected. (And it gets worse; after coalescing the locks, the synchronized body will contain only a sequence of increments, which can be strength-reduced into a single addition. And, if this process is applied repeatedly, the entire loop can be collapsed into a single synchronized block with a single "counter=10000000" operation. And yes, real-world JVMs can perform these optimizations.)

Again, the problem is not strictly that the optimizer is optimizing away our benchmark, but that it is able to apply a different degree of optimization to one alternative than to another, and that the types of optimizations that it can apply to each alternative would not likely be applicable in real-world code.

Flaw scorecard
This list is not exhaustive, but here are just some of the reasons why this benchmark didn't do what its creator thought it did:

• No warmup was performed, and it made no account for the time taken by JIT execution.

• The test was susceptible to errors induced by monomorphic call transformation and subsequent deoptimization.

• The code protected by the synchronized block or ReentrantLock is effectively dead, which distorts how the JIT will optimize the code; it may be able to eliminate the entirety of the synchronization test.

• The test program wants to measure the performance of a locking primitive, but it did so without including the effects of contention, and was run only on a single-processor system.

• The test program was not run on a large enough variety of platforms.

• The compiler will be able to perform more optimization on the synchronization test than the ReentrantLock test, but not in a way that will help real-world programs that use synchronization.

Tutorial Pages:
» Is There Any Other Kind?
» A Flawed Microbenchmark
» Benchmark Code Doesn't Look Like Real Code
» Ask the Wrong Question, Get the Wrong Answer
» How to Write a Perfect Microbenchmark
» Resources


First published by IBM developerWorks


 | Bookmark
Related Tutorials:
» All about JAXP, Part 1
» Make Database Queries Without the Database
» Load List Values for Improved Efficiency
» 2 Ways To Implement Session Tracking
» A Simple Way to Read an XML File in Java
» Develop Aspect-Oriented Java Applications with Eclipse and AJDT

Advertise with Us!


Tutorials Scripts Web Hosting Developer Manuals
Resources