Optimizing AMD Opteron Memory Bandwidth, Part 1: Single-Thread, Read-Only

Background

The memory hierarchy on modern computers is complex almost beyond belief. For well over 20 years, I have been working on the subject of memory bandwidth in computer systems and despite this experience (including 12 years on the microprocessor & system design teams at SGI, IBM, and AMD) the complexity might be beyond what my brain can handle. But since I enjoy a challenge, I am embarking on a set of posts to see if I can explain the memory bandwidth of a modern system and use my understanding of the details to create an implementation with superior performance.

System Under Test

For this set of posts the primary system under test is a Dell PowerEdge M605 blade with two quad-core AMD Opteron model 2389 processors. Each processor chip has two channels of DDR2/800 DRAM, with two 2GiB dual-rank DIMM installed on each channel. 2 sockets * 2 channels/socket * 2 DIMMs/channel * 2 GiB/DIMM = 16 GiB total memory installed on the blade.
The peak memory bandwidth of each channel is 8 Bytes * 800 MHz = 6.4 GB/s, giving a peak memory bandwidth of 12.8 GB/s per socket and 25.6 GB/s for the blade.

Choice of Test Kernel

I will begin with what appears to be a ridiculously simple example — the summation of all the elements of a single contiguous array of 64-bit numbers stored in local memory using a single thread of execution. By the time I am finished, I suspect you will agree that this simple start was a good choice….

In pseudo-code, one might write the basic kernel as:

sum = 0.0;
for (i=0; i<N; i++) sum += array[i];

The execution time of the computation is measured, and the data transfer rate is computed in MB/s. Note that MB/s is 10^6 Bytes per second, which gives numerical values almost 5% higher than would be obtained if I were computing transfer rate in MiB/s (=2^20 Bytes per second).

The system under test makes use of a 64KiB Data Cache, a 512 KiB unified Level 2 cache, and a 6144 KiB shared Level 3 cache, for a total of 6720 KiB of cache. Since I am interested in bandwidth from DRAM, the test uses N=32,768,000, which corresponds to an array size of 256,000 KiB — slightly over 38 times the total cache size. The kernel is repeated 100 times to “flush” the cache, and the average bandwidth is computed and presented.

A Sequence of Implementations

Implementation 1: Simple with Serial Compilation

The following source code was compiled with the Intel version 11.1 C compiler, using the commands:

icc -O2 ReadOnly.c -S
as ReadOnly.s -o ReadOnly.o
icc -O2 ReadOnly.o -o ReadOnly.icc.serial

Splitting the compilation up like this allows me to see the assembly code every time I compile, so I can monitor what the compiler is doing.

— ReadOnly.c —

#define N 32768000
#define NTIMES 100

extern double mysecond();      // a simple wall-clock timer -- appended
double a[N];                   // the data array

int main()
{
    int i,j;
    double sum;
    double t0, bw, times[NTIMES];

    for (i=0; i<NTIMES; i++) {
        times[i] = 0.0;
    }
    for (i=0; i<N; i++) {
        a[i] = 1.0;
    }

    sum = 0.0;
    for (j=0; j<NTIMES; j++) {
        t0 = mysecond();
        for (i=0; i<N; i++) {
           sum += a[i];
        }
        times[j] = mysecond()-t0;
    }
    printf("sum = %f\n",sum);
    for (i=0; i<NTIMES; i++) {
        bw = sizeof(double)*(double) N / times[i]/1e6;
        printf("iter, time, bw (MB/s) %d, %f, %f\n",i,times[i],bw);
    }
}

/* A gettimeofday routine to give access to the wall
 clock timer on most UNIX-like systems.  */

#include <sys/time.h>
double mysecond()
{
    struct timeval tp;
    struct timezone tzp;
    int i;

    i = gettimeofday(&tp,&tzp);
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

— End of ReadOnly.c —

Running the code under the “time” command gives output like:

sum = 3276800000.000000
iter, time, bw (MB/s) 0, 0.063371, 4136.659284
iter, time, bw (MB/s) 1, 0.063181, 4149.100482
iter, time, bw (MB/s) 2, 0.063225, 4146.205961
iter, time, bw (MB/s) 3, 0.063187, 4148.693441
iter, time, bw (MB/s) 4, 0.063210, 4147.191209
iter, time, bw (MB/s) 5, 0.063176, 4149.429305
iter, time, bw (MB/s) 6, 0.063195, 4148.176926
iter, time, bw (MB/s) 7, 0.063240, 4145.221181
iter, time, bw (MB/s) 8, 0.063204, 4147.582311
           [...]
iter, time, bw (MB/s) 94, 0.063249, 4144.643036
iter, time, bw (MB/s) 95, 0.063239, 4145.283693
iter, time, bw (MB/s) 96, 0.063278, 4142.737862
iter, time, bw (MB/s) 97, 0.063261, 4143.846398
iter, time, bw (MB/s) 98, 0.063239, 4145.283693
iter, time, bw (MB/s) 99, 0.063240, 4145.236809
real    0m6.519s
user    0m6.412s
sys    0m0.105s

It is important to save and use the times for each iteration so that the compiler will actually execute them. It is also helpful to have a quick visual feedback on the iteration-to-iteration variability of the memory bandwidth, which is clearly small here.

So the system under test delivers a very steady 4.145 GB/s using this version of the code. This is only 32% of the peak memory bandwidth of 12.8 GB/s for the socket, which is an uninspiring result. Don’t worry — it will get a lot better before I am through!

Analysis of Implementation 1

So why does this sample program deliver such a small fraction of the peak memory bandwidth of the node?
Instead of looking at all the possible performance limiters (most of which we will get to in due time), I will cut to the chase and give you the answer:
The performance limit here is directly due to the limited number of outstanding cache misses available to a single core.
The relevant formula is often referred to as “Little’s Law”, which in this case reduces to the simple statement that

 Latency * Bandwidth = Concurrency

where Latency is the time required to load a cache line from memory (about 74 ns on the system under test, as I reported earlier), Bandwidth is the 12.8 GB/s peak transfer rate of the DRAM on one processor chip, and Concurrency is the quantity of data that must be “in flight” in order to “fill the pipeline” or “tolerate the latency”. For the system under test, the required concurrency is 74 ns * 12.8 GB/s = 947 bytes = 14.8 cache lines. Unfortunately, each core in the Opteron Family10h processor only supports a maximum of 8 cache read misses.

Rearranging the formula to Bandwidth = Concurrency/Latency allows us to estimate how much bandwidth we think a processor should be able to get for a given Latency and a given Concurrency. Using 8 cache lines (512 Bytes) and 74 ns suggests that the maximum sustainable bandwidth will be about 6.92 GB/s. Our observed result of 4.145 GB/s is well below this value. Substituting the observed bandwidth allows us to compute the effective concurrency, which is 4.145 GB/s * 74 ns = 306 Bytes = 4.8 Cache Lines.

Some insight into the limited concurrency is available by re-compiling the code at optimization level “-O1”, which reduces the performance to 3.393 GB/s, corresponding to an effective concurrency of 251 Bytes or 3.9 cache lines.

The assembly code for the innermost loop at “-O1” is:

..B1.8:   
        addsd     a(,%rax,8), %xmm2  
        incq      %rax       
        cmpq      $32768000, %rax  
        jl        ..B1.8

while the assembly code for the same loop at “-O2” is:

..B1.7: 
        addpd     a(,%rax,8), %xmm0     
        addq      $2, %rax 
        cmpq      $32768000, %rax 
        jl        ..B1.7

In the first case the use of the “addsd” (Add Scalar, Double Precision) instruction shows that the compiler is using a single summation variable, while in the second case, the “addpd” (Add Packed, Double Precision) shows that the compiler is using two summation variables — the upper and lower halves of a “packed double” SSE register. Because of the data dependence on the sequence of summations, the code at “-O1” experiences 32,768,000 pipeline stalls (one per addition), while the code at “-O2” experiences 16,384,001 pipeline stalls — half as many (plus one at the end to add the two partial sums together). The floating-point add instructions used here have a dependent operation latency of four cycles. Some of this is overlapped with the pointer update, compare, and branch, but not all of it. The results at “-O1” correspond to about 6.84 CPU cycles per element, while the results at “-O2” correspond to about 5.60 CPU cycles per element, a difference of 1.24 cycles per element.
The surprising (and important) result here is that these extra floating point pipeline latencies are not overlapped with the memory latencies — after all a few extra stall cycles in the core should be negligible compared to the ~215 cycles of memory latency (74 ns * 2.9 GHz). The problem is that these floating-point pipeline stalls are delaying the execution of the subsequent memory load references that are necessary to allow additional hardware prefetches to be issued from the core to the memory system.

In my next entry, I will show now software prefetch instructions can directly increase the number of outstanding cache misses and how explicitly coding for more partial sum variables can indirectly allow more outstanding prefetches by eliminating the floating-point pipeline stalls and allowing subsequent memory references to be issued more quickly….

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.

Comments

George Peter says

November 6, 2010 at 1:35 pm

> The kernel is repeated 100 times
> to “flush” the cache,

I think that to flush the cache, one
should write or read another array
larger than the cache size.
John McCalpin says

November 8, 2010 at 3:37 pm

There are some times when using a different address range to flush the cache is useful, but it does not make any difference in this case.
I did not really run the kernel 100 times to flush the cache — the array is large enough that it flushes the cache every iteration. I ran 100 iterations for two reasons: (1) it allows me to gather performance counters around the whole program with overhead from initialization and error checking limited to ~1%, and (2) monitoring variability across the 100 iterations helps quantify the amount of noise in the data (typically caused by other processes competing for memory bandwidth).
Victor says

November 11, 2010 at 8:11 am

John, if I google for Little’s Law I get explanations that do not involve latency but something like time-in-the-system. Can you give me a link for your interpretation?
John McCalpin says

November 11, 2010 at 11:19 am

In its original form, Little’s Law pertains to average occupancy in a queue. The form that I am using is from Burton Smith, who uses it in every one of his presentations on parallelism in computer systems. David Bailey discusses this version and provides a proof of correctness in this 1997 report: http://crd.lbl.gov/~dhbailey/dhbpapers/little.pdf
John McCalpin says

November 11, 2010 at 11:27 am

I should add that it is somewhat dangerous recasting Little’s Law in terms of Latency rather than Occupancy, but in a system with a single level queue, Latency provides a lower bound on Occupancy (and therefore an upper bound on attainable bandwidth). There are certainly computer systems out there whose queue Occupancy is significantly higher than the corresponding Latency, but that does not apply to the system under test (or its close relatives). The memory controller prefetcher in the system under test provides a second level of queueing, which can make the Occupancy slightly less than the Latency. I will provide direct measurements of queue occupancy for these test cases in a future blog entry.
Danika Brooks says

September 27, 2011 at 7:31 pm

Thanks John, I’ve been looking everywhere for help with AMD optimization and this is exactly what I was looking for. Your a real life saver! 🙂