|
Helping ordinary people create extraordinary websites! |
Java Theory and Practice: Anatomy of a Flawed MicrobenchmarkBy Brian Goetz2005-04-27
Ask the Wrong Question, Get the Wrong Answer The scary thing about microbenchmarks is that they always produce a number, even if that number is meaningless. They measure something, we're just not sure what. Very often, they only measure the performance of the specific microbenchmark, and nothing more. But it is very easy to convince yourself that your benchmark measures the performance of a specific construct, and erroneously conclude something about the performance of that construct. Even when you write an excellent benchmark, your results may be only valid on the system you ran it on. If you run your tests on a single-processor laptop system with a small amount of memory, you may not be able to conclude anything about the performance on a server system. The performance of low-level hardware concurrency primitives, like compare-and-swap, differ fairly substantially from one hardware architecture to another. The reality is that trying to measure something like "synchronization performance" with a single number is just not possible. Synchronization performance varies with the JVM, processor, workload, JIT activity, number of processors, and the amount and character of code being executed using synchronization. The best you can do is run a series of benchmarks on a series of different platforms, and look for similarity in the results. Only then can you begin to conclude something about the performance of synchronization. In benchmarks run as part of the JSR 166 (java.util.concurrent) testing process, the shape of the performance curves varied substantially between platforms. The cost of hardware constructs such as CAS varies from platform to platform and with number of processors (for example, uniprocessor systems will never have a CAS fail). The memory barrier performance of a single Intel P4 with hyperthreading (two processor cores on one die) is faster than with two P4s, and both have different performance characteristics than Sparc. So the best you can do is try and build "typical" examples and run them on "typical" hardware, and hope that it yields some insight into the performance of our real programs on our real hardware. What constitutes a "typical" example? One whose mix of computing, IO, synchronization, and contention, as well as whose memory locality, allocation behavior, context switching, system calls, and inter-thread communication, approximates that of a real-world application. Which is to say that a realistic benchmark looks an awful lot like a real-world program. Tutorial Pages: » Is There Any Other Kind? » A Flawed Microbenchmark » Benchmark Code Doesn't Look Like Real Code » Ask the Wrong Question, Get the Wrong Answer » How to Write a Perfect Microbenchmark » Resources First published by IBM developerWorks |
|