|
Helping ordinary people create extraordinary websites! |
Java Theory and Practice: Anatomy of a Flawed MicrobenchmarkBy Brian Goetz2005-04-27
A Flawed Microbenchmark After publication of my October article, "More flexible, scalable locking in JDK 5.0," a colleague sent me the SyncLockTest benchmark (shown in Listing 1), which was purported to determine whether the synchronized primitive or the new ReentrantLock class was "faster." After running it on his laptop, he came to the conclusion that synchronization was faster, contrary to the conclusion of the article, and presented his benchmark as "evidence." The entire process -- the design of the microbenchmark, its implementation, its execution, and the interpretation of its results were flawed in a number ways. The colleague in this case is a pretty smart guy who's been around the block a few times, which goes to show how difficult this stuff can be. Listing 1. Flawed SyncLockTest microbenchmark interface Incrementer {
SyncLockTest defines two implementations of an interface, and uses System.nanoTime() to time the execution of each implementation 10,000,000 times. Each implementation increments a counter in a thread-safe manner; one using built-in synchronization, and the other using the new ReentrantLock class. The stated goal was to answer the question, "Which is faster, synchronization or ReentrantLock?" Let's look at why this seemingly harmless-looking benchmark fails to measure what it purports to measure or, indeed, whether it measures anything useful at all.
Flaws in conception Ignoring for the moment the implementation flaws, SyncLockTest also suffers from a more serious, conceptual flaw -- it misunderstands the question it is trying to answer. It purports to measure the performance costs of synchronization and ReentrantLock, techniques used to coordinate the action of multiple threads. However, the test program contains only a single thread, and therefore is guaranteed never to experience contention. It omits testing the very situation that makes locking relevant in the first place! In early JVM implementations, uncontended synchronization was slow, a fact that was well-publicized. However, the performance of uncontended synchronization has improved substantially since then. (See Resources for a paper that describes some of the techniques used by JVMs to optimize the performance of uncontended synchronization.) On the other hand, contended synchronization was and still is far more expensive than uncontended synchronization. When a lock is contended, not only does the JVM have to maintain a queue of waiting threads, but it must use system calls to block and unblock the threads that are not able to acquire the lock immediately. Further, applications that experience a high degree of contention usually exhibit lower throughput, not only because more time is spent scheduling threads and less time doing actual work, but because CPUs may remain idle when threads are blocked waiting for a lock. A benchmark to measure the performance of a synchronization primitive should take into account a realistic degree of contention. Flaws in methodology This design failing was compounded by at least two failings of execution -- it was run only on a single-processor system (an unusual environment for a highly concurrent program, and whose synchronization performance may well differ substantially from multiprocessor systems), and only on a single platform. When testing the performance of a given primitive or idiom, especially one that interacts with the underlying hardware to such a significant degree, it is necessary to run benchmarks on many platforms before drawing a conclusion about its performance. When testing something as complicated as concurrency, a dozen different test systems, spanning multiple processors, and a range of processor counts (to say nothing of memory configurations and processor generations) would be advisable to start to get a picture of the overall performance of a given idiom. Flaws in implementation As to implementation, SyncLockTest ignores a number of aspects of dynamic compilation. As you saw in the December installment, the HotSpot JVM first executes a code path in interpreted mode, and only compiles it to machine code after a certain amount of execution. Failing to properly "warm up" the JVM can result in a significant skewing of performance measurements in two ways. First, the time taken by the JIT to analyze and compile the code path is included in the runtime of the test. More importantly, if compilation runs in the middle of your test run, your test result will be the sum of some amount of interpreted execution, plus the JIT compilation time, plus some amount of optimized execution, which doesn't give you much useful insight into the actual performance of your code. And if the code is not compiled prior to running your test, and not compiled during the test, then the entirety of your test run will be interpreted, which also doesn't give you much insight into the real-world performance of the idiom you are testing. SyncLockTest also falls victim to the inlining and deoptimization problem discussed in December, where the first timing pass measures code that has been aggressively inlined with monomorphic call transformation, and the second pass measures code that has been subsequently deoptimized due to the JVM loading another class that extends the same base class or interface. When the timing test method is called with an instance of SyncIncrementer, the runtime will recognize that there is only one class loaded that implements Incrementer, and will convert virtual method calls to increment() into direct method calls to SyncIncrementer. When the timing test method is then run with an instance of LockIncrementer, test() will be recompiled to use virtual method calls, meaning that the second pass through the test() harness method will be doing more work on each iteration than the first, turning our test into a comparison between apples and oranges. This can seriously distort the results, making whichever alternative that runs first look faster. Tutorial Pages: » Is There Any Other Kind? » A Flawed Microbenchmark » Benchmark Code Doesn't Look Like Real Code » Ask the Wrong Question, Get the Wrong Answer » How to Write a Perfect Microbenchmark » Resources First published by IBM developerWorks |
|