High-Performance Programming Techniques on Linux and Windows
In his previous column, Ed began measuring memory-to-memory transfers. He looked in some detail at transferring 16 MB of memory on Linux and Windows 2000. Due to the limited scope of last month’s measurements, no useful conclusions could be drawn. In this column, however, Ed expands his view to include varying the block size of the transfer.
My previous column focused exclusively on 16-MB blocks of memory, and this time I will be discussing block sizes varying from 4 bytes to 64 MB. Previously, I examined various methods of performing the memory transfers and determined that using the system supplied memcpy() routine is probably a good idea (at least until we learn otherwise, I will proceed with that assumption).
In the tests described here, I ran several passes to assure myself that the data is reproducible. My tests were run only on a ThinkPad 600X (650 MHz) with 576 MB of memory. Other dual booting systems were not available.
I encourage you to run these tests on dual booting machines and report the results. Also, I encourage you to critique the programming methodology in the programs and suggest ways to improve performance. Our goal here is to demonstrate best programming practices, not to prove that one system is better than another.
Testing Memory Copy Time
I will be using almost the same program as last month. Keeping with a cheap and simple source code management system, I renamed that program to memxfer5c.cpp.
The changes I’ve made allow for blocks sizes less than 32 bytes. Last month’s program only allowed transfers greater than or equal to 32 bytes due to the partial loop unrolling done on “double •” transfers. Memxfer5c.cpp simply doesn’t do “double •” transfers for block sizes less than 32 bytes. Also, an error in the usage message was corrected.
Now let’s examine block size and programming technique. The test script for our test runs is test2c.sh. It consists mostly of a long list of memxfer5c command lines with varying block size and repetition counts.
Memxfer5c.cpp has almost the same usage message:
memxfer5c Usage message
Usage: memxfer5c.exe [-f] [-w] [-s] [-p] [-csv] size cnt [method]
-f flag says to malloc and free of the "cnt" times.
-w = set process min and max working set size to "size"
-s = silent; only print averages
-p = prep; "freshen" cache before; -w disables
-csv = print output in CSV format
0: "memcpy (default)"
1: "char •"
2: "short •"
3: "int •"
4: "long •"
5: "__int64 •"
6: "double •"
A typical line in the test script file looks like:
memxfer5c -csv -s -p 4 64m 0 1 2 3 4 5 6
The idea here is to output data in a comma-separated list format ("-csv"), summarize the information ("-s"), "prime" the cache ("-p"), use 4 byte transfers, do 64 million of them, and perform that test for each of the methods listed in the usage message.
Memxfer5c.cpp compiles with the following commands:
gcc -O2 memxfer5c.cpp -o memxfer5c
cl -O2 memxfer5c.cpp -o memxfer5c.exe
(Note: The support routines used by memxfer5c.cpp were described in the introductory column of this series.)
Since the memxfer5c.cpp program is available and there are few differences from last month's program, no listings are provided. Instead, I will show the graphical results of our test runs. The test was run with the following command:
bash test2c.sh a
The a argument is an annoying little thing that forces me to think about what the test does. If the script is run with no arguments, it just prints a usage message and exits. Any argument will work.
We ran the script on a ThinkPad 600X Model 2645-9FU with 576 MB of memory and a 12-GB disk. The 600X is a 648 MHz Pentium III machine. (My previous column showed how to find the processor model and the MHz rating in Windows and Linux.) The operating systems tested were:
• Linux 2.2.16-22
• Linux 2.4.4
• Windows 2000 SP1
The Linux kernels were hosted in a Red Hat 7.0 environment, so we use the gcc that is included with the Red Hat 7.0 distribution (version 2.96). On Windows 2000, we use the Microsoft C++ Version 12.00.8168 from Visual Studio 6.0.
A typical output of one of our runs looks -- after straightening the columns -- like :
memxfer5c.exe -s -p -csv 4 67108864 Win2k
"memcpy", 4,268435456, 5.862, 45.789
"char *", 4,268435456, 3.243, 82.771
"short *",4,268435456, 3.139, 85.507
"int *", 4,268435456, 2.721, 98.667
"long *", 4,268435456, 2.720, 98.672
"memcpy", 4,268435456, 5.862, 45.795
"memcpy", 4,268435456, 5.858, 45.828
Of this information, only the first, second and last columns are used in plotting graphs. The other two columns are present to verify that the last column is correct. For instance, from the "memcpy" line above, we see:
45.792 = 268435456/(5.862*1E6)
I am using truncated numbers and can verify the correctness of the 45.792 number to four significant digits. That is sufficient for the graphical work.
Take a look at the data. The first three graphs (made with Microsoft Excel) represent the individual runs on each of the operating systems, and show very similar behavior between the systems.
Figure 1. Linux 2.2.16-22
Figure 2. Linux 2.4.4
Figure 3. Win2k
I have also produced a plot for each method showing the different operating systems:
Figure 4. Memcpy Linux and Win2k
Figure 5. "Char" Linux and Win2k
Figure 6. "Short" Linux and Win2k
Figure 7. "Int" Linux and Win2k
Figure 8. "Long" Linux and Win2k
Figure 9. "_int64 Linux and Win2k
Figure 10. "Double" Linux and Win2k
There are some oddities about these graphs. First, using "long *" seems to be a better idea than "int *" on Linux. Since int and long variables are the same size on both Linux and Windows, something must be amiss. A disassembly of the relevant portions of main (cases 3 and 4) shows the following:
Disassembly of cases 3 and 4 ("int *" and "long*")0x8048ec0 <main+1124>: xor %esi,%esi
0x8048ec2 <main+1126>: cmp %edi,%esi
0x8048ec4 <main+1128>: mov 0xffffffdc(%ebp),%ecx
0x8048ec7 <main+1131>: mov 0xffffffd8(%ebp),%edx
0x8048eca <main+1134>: jae 0x8048f87 <main+1323>:
0x8048ed0 <main+1140>: mov (%edx),%eax
0x8048ed2 <main+1142>: add $0x4,%esi
0x8048ed5 <main+1145>: mov %eax,(%ecx)
0x8048ed7 <main+1147>: add $0x4,%edx
0x8048eda <main+1150>: add $0x4,%ecx
0x8048edd <main+1153>: cmp %edi,%esi
0x8048edf <main+1155>: jb 0x8048ed0 <main+1140>
0x8048ee8 <main+1164>: xor %esi,%esi
0x8048eea <main+1166>: cmp %edi,%esi
0x8048eec <main+1168>: mov 0xffffffdc(%ebp),%ecx
0x8048eef <main+1171>: mov 0xffffffd8(%ebp),%edx
0x8048ef2 <main+1174>: jae 0x8048f87 <main+1323>:
0x8048ef8 <main+1180>: mov (%edx),%eax
0x8048efa <main+1182>: add $0x4,%esi
0x8048efd <main+1185>: mov %eax,(%ecx)
0x8048eff <main+1187>: add $0x4,%edx
0x8048f02 <main+1190>: add $0x4,%ecx
0x8048f05 <main+1193>: cmp %edi,%esi
0x8048f07 <main+1195>: jb 0x8048ef8 <main+1180>
As you can see, the code is identical. I copied the memxfer5c.cpp source file to x.cpp and started fiddling. The first thing I did was verify that the correct cases were actually being executed. After that, I tried switching the order of cases 3 and 4. The result was that the speed of "int *" became faster and speed of "long *" became slower. The same two values were present, but by swapping the locations of the code, the performance was also swapped. My only conclusion is that the code must be on a cache boundary. As can be seen above, there are no page boundaries crossed by the two sections of code.
A second mystery is the behavior of Linux on "int *" and "long *" when compared to Windows. Windows scales appropriately when using "char *", "short *", and "int *". Each graph shows Windows roughly doubling in memory transfer performance. Linux stalls with the 4-byte transfers at around 600 MB/sec. Why it does this is indeed unclear.
An additional curiosity is the "jaggies" in the memcpy graph for Linux and Windows. The "jaggies" show up in the Linux graphs when using block sizes from 4K to 32K in size. Here is an expanded picture of that region without using the log scale on the block size axis.
Figure 11. Memcpy Linux and Win2k
This behavior looks like a beat frequency between cache levels. A system architect would have to say for sure why Linux behaves like this.
A note about compiler optimization
One reader, Matteo Ianeselli, pointed out the "-funroll-loops" option on the gcc compiler. Because we enter variables for loop limits, it is not clear to me how a compiler could unroll one of these loops. Depending on the value I enter on the command line, the compiler might have unrolled the loop too far. I ran a quick test on a ThinkPad 770X using the "-funroll-loops" option. My results showed:
• For transfer sizes smaller than 1.5 MB, unrolling loops was slightly slower than not using the option.
• For transfer sizes larger than 1.5 MB, unrolling loops produced approximately a 135/131 speedup.
I have previously looked at the various options on the Microsoft C++ compiler to see if there is anything better than "-O2". In my tests, I found nothing. A continued search was abandoned. However, if one of our readers compiles memxfer5c.cpp with his or her favorite optimization parameterization on cl.exe AND it yields better performance, please let us all know in the discussion forum. Searching the parameter space of cl.exe is more efficient when a large number of people participate!
It seems this time that I have created more questions than answers. My results look predictable except for the surprises noted above.
Previously, I cautiously concluded that using memcpy() on both Linux and Windows is a good idea. This month's measurements affirm that conclusion. Memcpy() produced better results than any of the other methods on both Windows and Linux. As for comparing Windows and Linux, Windows transfers memory faster when using a 4-byte pointer and transferring less than 200 KB. The partially unwound "double *" method is also faster on Windows when the transfer is less than 10 KB. In all other cases, Linux seems to move memory faster.
• Examine the source code for the memory transfer program, the test script, and an Excel spread sheet with the graphs and raw data:
• Read the introductory column that launched this series; it defines the measurement tools Ed uses.
• Read Ed's previous column in this series.
• Find more Linux resources in the developerWorks Linux zone, including these articles: