High Performance Programming Techniques on Linux and Windows
In his introductory column Dr. Bradford introduced measurement tools and plans for future performance investigations on Linux and Windows 2000. His focus this month is on a simple operation, memory to memory copy, and how long it takes to move bytes around in memory.
Memory copies are frequent and pervasive in computing. They show up in networked applications, database applications, scientific applications, and almost any other application or service you can think of. Because they are so universal, programmers are somewhat cavalier about memory copies and perform them with a variety of programming techniques. Memory copy can be a simple movement of an entire block of memory, or it can be something like an access pattern, which is not a copy at all.
An access pattern is formed by accessing the columns of a matrix. You may, for instance, need every twelfth word of a 12×12 matrix. Those twelve words would represent the first column of the matrix. Frequently a column access requires only 32 or 64 bits per column element; rarely are more than 64 bits required. Complex matrices might require two 64-bit values per column element. The number of bytes between accesses is called a stride, and access to memory is parameterized with a stride value.
Measuring Block Memory Copy Speeds
In this installment we will look at only block memory copies, which are pretty common even in first-year programming courses. Measuring memory copy speeds is tricky because computers have level-1, level-2, and sometimes level-3 caches, so the tests have to deal with cache sizes. We will simply look at the end result, which is how fast the data can be moved. The point is to understand the code path length rather than system buffering or cache issues, both of which will be addressed in a future article. Code path is important because it gives an indication of how fast bytes can be moved and how much overhead is associated with such a move.
Memory transfers are carried out with simple programming techniques. Simple for loops and the memcpy() library routine are the most frequent mechanisms. Less frequent are structure assignments and file IO techniques.
Memory transfers often play only a small part in the overall performance picture. When they do play a part, it is good to know what the operating system and hardware can deliver as a team.
There are too many dimensions to investigate and cover in one column, so the focus here will be narrowed down. This means that a usable conclusion won’t be available until we expand our scope. Some of the parameterizations available for examination include:
• Stride – as mentioned above, relates frequently to matrix access
• Total transfer size – total amount of data moved
• Block size – how much data is moved in a single operation (closely related to the next parameter)
• Programming technique – such as memcpy, (char*, double*) pointer
• System page size
• Translation Lookaside Buffer (TLB) size
The objective here is to fix the total transfer size at 16 MB. We will also fix the stride at 0, which means doing simple block memory transfers. The block size will be varied by varying the programming technique. The system page size is 4K in Windows 2000 and 4K in Linux. And since the computer is the same, the TLB size is the same. There are no threading issues because only a single thread will be used to move memory.
The program we will use is called memxfer5b.cpp, which has gone through several versions. Its usage message follows:
memxfer5b usage message
Usage: memxfer5b.exe [-f] [-w] [-s] [-p] size cnt [method]
-f flag says to malloc and free of the "cnt" times.
-w = set process min and max working set size to "size"
-s = silent; only print averages
-p = prep; "freshen" cache before; -w disables
-csv = print output in CSV format
0: "memcpy (default)"
1: "char *"
2: "short *"
3: "int *"
4: "long *"
5: "__int64 *"
6: "double *"
The “-p” option is useful because the initial copy in most of the test runs was slower than the remaining runs. The implication is that the cache was being loaded. The measurements here are specifically aimed at code paths and not memory transfer speeds. For that reason, we use the “-p” option to “prime” memory. The goal is to achieve the very fastest (shortest time) memory transfers possible. In real situations, it is more likely that the initial transfer with the un-primed cache would be the environment the program would experience. Nevertheless, if we choose the code path that leads to the best performance in the tests, we are likely to be on the right track for optimal production code performance.
Memxfer5b can use seven different memory transfer techniques. “Recommended” memcpy() API is available on Linux and Windows. Six different pointer types are also available.
Memxfer5b.cpp compiles easily with either of the following commands:
gcc -O2 memxfer5b.cpp -o memxfer5b
cl -O2 memxfer5b.cpp -o memxfer5b.exe
Memxfer5b.cpp uses the same support routines as described in my introductory column. Add a routine called Malloc() to the list of support routines. Malloc() behaves just like malloc(), but it prints an error message and exits if it cannot allocate memory. For our purposes this is sufficient. We are not measuring the speed of memory allocation; any failure to allocate memory in this test probably means there is a bug in the program, or we have reached a system limit. In either case we wouldn’t want to proceed. (The Malloc() routine was also mentioned in my introductory column, but it was not included in any of the source code. It is included in this installment — see Resources.)
Previously, we looked at the various options on the Microsoft C++ compiler, to see if there is anything better than “-O2″. In our tests, we found nothing and a continued search was abandoned. However, if one of our readers compiles memxfer5b.cpp with his or her favorite optimization parameterization on cl.exe AND it yields better performance, please let us all know by using the discussion forum; click the Discuss icon at the top or bottom of the article. Searching the parameter space of cl.exe is more efficient when a large number of people contribute.
The main loop of memxfer5b.cpp is the part that actually moves memory and brackets the moves with calls to our timing routines. Memxfer5b.cpp can explore other areas of block memory transfers. A future version will add striping so we can simulate matrix operations.
We will compile and run this test on Windows 2000 Advanced Server Service Pack 1 installed, Linux 2.2.16 and Linux 2.4.4. The Linuxes are running in the Red Hat 7.0 environment. On Linux we will be using the gcc that is included with the Red Hat 7.0 distribution. On Windows 2000, we will use the Microsoft C++ Version 12.00.8168 from Visual Studio 6.0. Our test system will be ThinkPad 600X Model 2645-9FU with 576 MB of memory and a 12 GB disk drive. The 600X is a 648 MHz Pentium III machine according to Windows 2000 and 647.767 MHz according to Linux. The Windows 2000 and Linux “mechanisms” for determining that information are:
Windows 2000 navigation path to MHz:
Linux command to display MHz:
The program prints the resulting memory speed in megabytes per second. If the “-s” flag is specified, then the memory copy speed is computed over the “cnt” runs. Otherwise, each run is printed. Our test will be done as follows:
memxfer5b -p -s -csv 16m 8 0 1 2 3 4 5 6
memxfer5b -p -s -csv 16m 8 0 1 2 3 4 5 6
memxfer5b -p -s -csv 16m 8 0 1 2 3 4 5 6
memxfer5b -p -s -csv 16m 8 0 1 2 3 4 5 6
This gives us four of eight runs. Don’t expect to see a lot of variation; the repeated attempts will either verify our expectations or give us some bounds on what to expect.
The results are shown in the following table. Both the older version of Linux and the newer version seem much faster in memory transfers than Windows 2000. It is not clear what is going on here. Further investigation is required.
An examination of the assembly listing for the Windows memxfer5b.exe program shows that the Microsoft C++ compiler uses a “rep movs” instruction when it sees a call to the memcpy API. Otherwise, it dutifully uses the character-by-character move for “char *”, a word-by-word move for “short *”, and, surprisingly, a dword-by-dword move for “int *”, “long *”, “__int64 *”, and “double *”. (Memxfer5b avoids all mis-aligned boundary conditions.)
A similar examination of the Linux binary shows (almost) the same code. The only exception is that gcc actually uses the memcpy() routine. Both compilers move bytes, 16-bit words and 32-bit words only. The generated assembly code for both Linux and Windows is quite similar.
One possibility for the differences in memory copy performance is that Windows does more work in the background than Linux. To test that theory, let’s write a very small computationally intensive program that calculates a single fractal point. With no arguments, fract2.cpp iterates the fractal formula until either 100,000,000 iterations has transpired or the point “escapes”. Fract2.cpp is a simple program with a short floating point (double) loop computation.
We found the following:
<table border=1 width=”60%”> <tbody> <tr> <td width=”50%”><b><font face=”Courier New” size=3>Operating system</font></b></td> <td width=”50%”><b><font face=”Courier New” size=3>Time to complete (seconds)</font></b></td></tr> <tr> <td width=”50%”><font face=”Courier New”>Linux 2.2.16-22</font></td> <td width=”50%”><font face=”Courier New”>4.065</font></td></tr> <tr> <td width=”50%”><font face=”Courier New”>Linux 2.4.4</font></td> <td width=”50%”><font face=”Courier New”>4.087</font></td></tr> <tr> <td width=”50%”><font face=”Courier New”>Windows 2000 AS</font></td> <td width=”50%”><font face=”Courier New”>4.300</font></td></tr></tbody></table>
The times suggest that more computation can be accomplished on the same hardware if it is running either version of Linux. Again, when the two optimized compile programs were disassembled we found that the generated loop was almost identical. In fact, the Linux loop contained two more instructions than the Windows generated loop.
The fract2.cpp times do not account for the large difference in the memory copy speeds. Fract2.cpp might account for some of the differences, but not all of them.
At this point in our investigations, we don’t have enough information to fully understand the memory copy anomaly. Furthermore, we have only looked at one small aspect of block memory moves. For that reason, no justifiable conclusions can yet be drawn about the merits of Linux versus Windows. We can conclude that using memcpy on both platforms is a good idea for sixteen megabyte transfers. We might also conclude that there is a small amount of additional overhead with the Windows operating system as demonstrated by the fract2.cpp program. However, it would be best to let the readers comment on the techniques used here and offer suggestions as to how to make either system more efficient on these two small and insignificant tests.
This article has raised more questions than it has answered, and we will be looking more closely at memory performance in upcoming articles. Until that time, the question of whether one system can move blocks of memory better than the other remains open.
• Participate in the discussion forum on this article. (You can also click Discuss at the top or bottom of the article to access the forum.)
• Read the introductory article that launched this column; it defines the measurement tools Ed uses.
• Participate in the discussion forum on this article by clicking Discuss at the top or bottom of the article.
• Get files mentioned in this article:
• memxfer5b.cpp is the test program; it can use seven different memory transfer techniques.
• The main loop of memxfer5b.cpp is the part that actually moves memory.
• fract2.cpp is a simple program with a short floating point (double) loop computation. It iterates the fractal formula until either 100,000,000 iterations has transpired or the point “escapes”.