John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

Optimizing AMD Opteron Memory Bandwidth, Part 2: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 8th November 2010

In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today I will see how far this single-thread performance can be enhanced.

The surprising result from the previous experiments was that the floating-point pipeline latency made visible by the dependent floating-point add operations was quite important in limiting the number of outstanding cache line fetches, and therefore constituted an important limiter in overall performance. The dependent operation latency of the floating-point pipeline in the AMD Opteron Family10h processor is 4 cycles, so four add operations must be operating concurrently to fill the pipeline.

Filling the Floating-Point Pipeline

Scalar SSE

The code was modified to produce “Version 003”, which declares four separate summation variables (sum0, sum1, sum2, sum3), and unrolls the inner loop to handle a cache line at a time:

        for (i=0; i<N; i+=8) {
            sum0 += a[i+0];
            sum1 += a[i+1];
            sum2 += a[i+2];
            sum3 += a[i+3];
            sum0 += a[i+4];
            sum1 += a[i+5];
            sum2 += a[i+6];
            sum3 += a[i+7];
       sum = sum0 + sum1 + sum2 + sum3;

The modified code was compiled (again with the Intel version 11.1 compiler) at “-O2”. The assembly code for the inner loop was:

        addsd     a(,%rax,8), %xmm3
        addsd     8+a(,%rax,8), %xmm2
        addsd     16+a(,%rax,8), %xmm1
        addsd     24+a(,%rax,8), %xmm0 
        addsd     32+a(,%rax,8), %xmm3  
        addsd     40+a(,%rax,8), %xmm2
        addsd     48+a(,%rax,8), %xmm1
        addsd     56+a(,%rax,8), %xmm0
        addq      $8, %rax
        cmpq      $32768000, %rax 
        jl        ..B1.7  

The assembly code follows the C source code exactly. I was a little surprised that the compiler did not combine these 8 scalar operations into 4 packed operations, but it is good to remember that compilers are unpredictable beasts, and need to be monitored closely. Performance for Version 003 was 4.511 GB/s — about 9.5% faster than Version 002. In terms of execution time per floating-point addition operation, this optimization saved about 0.5 cycles per element.

Vector SSE

Continuing further in this direction, it is time to force the generation of packed double SSE arithmetic operations. The floating-point add unit is two 64-bit elements wide, so to fill the pipeline the four add operations really need to be ADDPD — packed double adds. While it may be possible to convince the compiler to generate the desired code with portable code, I decided to bite the bullet here and use some compiler extensions to get what I wanted. Version 006 (don’t worry — I have not forgotten 004 & 005) includes these declarations that the compiler interprets as packed double floating-point variables:

    __m128d sum0,sum1,sum2,sum3;
    __m128d x0,x1,x2,x3;

Note that these variables can only be used in limited ways — primarily as sources or destinations of assignment functions or SSE intrinsic functions. For example, to set the initial values I use a compiler intrinsic:

        x0 = _mm_set_pd(0.0,0.0);
        x1 = _mm_set_pd(0.0,0.0);
        x2 = _mm_set_pd(0.0,0.0);
        x3 = _mm_set_pd(0.0,0.0);
        sum0 = _mm_set_pd(0.0,0.0);
        sum1 = _mm_set_pd(0.0,0.0);
        sum2 = _mm_set_pd(0.0,0.0);
        sum3 = _mm_set_pd(0.0,0.0);

The inner loop of the summation is also coded with special intrinsic functions (defined in and similarly-named files):

    for (i=0; i<N; i+=8) {
                x0 = _mm_load_pd(&a[i+0]);
                sum0 = _mm_add_pd(sum0,x0);
                x1 = _mm_load_pd(&a[i+2]);
                sum1 = _mm_add_pd(sum1,x1);
                x2 = _mm_load_pd(&a[i+4]);
                sum2 = _mm_add_pd(sum2,x2);
                x3 = _mm_load_pd(&a[i+6]);
                sum3 = _mm_add_pd(sum3,x3);

The _mm_load_pd() intrinsic is read as “multi-media load packed double”. It expects a 16-byte aligned pointer as its argument, and returns a value into a variable declared as type __m128d. The _mm_add_pd() intrinsic is the “multi-media add packed double” instruction. It has two arguments of type __m128 which are added together and written back into the first argument — this behavior mimics the x86 ADDPD assembly language function. The left-hand-side of the assignment is also used for the output variable — I don’t know what happens if this does not match the first argument. Caveat Emptor.
The assembly code generated for this loop is exactly what I wanted:

        addpd     a(,%rax,8), %xmm3 
        addpd     16+a(,%rax,8), %xmm2 
        addpd     32+a(,%rax,8), %xmm1
        addpd     48+a(,%rax,8), %xmm0
        addq      $8, %rax 
        cmpq      $32768000, %rax  
        jl        ..B1.10

There are a couple of ways to “unpack” packed double variables in order to perform the final summation across the partial sums. In this case the vector is very long, so the time required to perform the last couple of summations is tiny and the code does not need to be efficient. I picked the first approach that I could figure out how to code:

            x0 = _mm_set_pd(0.0,0.0);
            x0 = _mm_add_pd(x0,sum0);
            x0 = _mm_add_pd(x0,sum1);
            x0 = _mm_add_pd(x0,sum2);
            x0 = _mm_add_pd(x0,sum3);
            _mm_storel_pd(&temp1, x0);
            _mm_storeh_pd(&temp2, x0);
            sum = temp1 + temp2;

This code clears a packed double variable, then adds the four (packed double) partial sums to generate a final pair of partial sums in the upper and lower halves of x0. The _mm_storel_pd() intrinsic stores the 64-bit double in the “low” half of x0 into the memory location pointed to by the first argument, while _mm_storeh_pd() stores the 64-bit double in the “high” half of x0 into the memory location pointed to by its first argument. These two doubles are then added together to build the final sum value.
The performance improvement provided by optimizing the inner loop was bigger than I expected — Version 006 delivered 5.246 GB/s — a full 27% faster than Version 002 (naive code compiled at “-O2”) and 16% faster than Version 004 (4 scalar partial sums). This optimization saved an addition 0.72 cycles per element relative to the scalar SSE Version 004. On the down side, this is still only about 41% of the peak memory bandwidth available to each processor chip, so there is a long way to go.

Next time — all about prefetching….

Posted in Computer Hardware | 1 Comment »

Optimizing AMD Opteron Memory Bandwidth, Part 1: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 3rd November 2010

Optimizing AMD Opteron Memory Bandwidth, Part 1: Single-Thread, Read-Only


The memory hierarchy on modern computers is complex almost beyond belief.  For well over 20 years, I have been working on the subject of memory bandwidth in computer systems and despite this experience (including 12 years on the microprocessor & system design teams at SGI, IBM, and AMD) the complexity might be beyond what my brain can handle.   But since I enjoy a challenge, I am embarking on a set of posts to see if I can explain the memory bandwidth of a modern system and use my understanding of the details to create an implementation with superior performance.

System Under Test

For this set of posts the primary system under test is a Dell PowerEdge M605 blade with two quad-core AMD Opteron model 2389 processors.  Each processor chip has two channels of DDR2/800 DRAM, with two 2GiB dual-rank DIMM installed on each channel.  2 sockets * 2 channels/socket * 2 DIMMs/channel * 2 GiB/DIMM = 16 GiB total memory installed on the blade.
The peak memory bandwidth of each channel is 8 Bytes * 800 MHz = 6.4 GB/s, giving a peak memory bandwidth of 12.8 GB/s per socket and 25.6 GB/s for the blade.

Choice of Test Kernel

I will begin with what appears to be a ridiculously simple example — the summation of all the elements of a single contiguous array of 64-bit numbers stored in local memory using a single thread of execution.  By the time I am finished, I suspect you will agree that this simple start was a good choice….

In pseudo-code, one might write the basic kernel as:

sum = 0.0;
for (i=0; i<N; i++) sum += array[i];

The execution time of the computation is measured, and the data transfer rate is computed in MB/s.  Note that MB/s is 10^6 Bytes per second, which gives numerical values almost 5% higher than would be obtained if I were computing transfer rate in MiB/s (=2^20 Bytes per second).

The system under test makes use of a 64KiB Data Cache, a 512 KiB unified Level 2 cache, and a 6144 KiB shared Level 3 cache, for a total of 6720 KiB of cache.   Since I am interested in bandwidth from DRAM, the test uses N=32,768,000, which corresponds to an array size of 256,000 KiB — slightly over 38 times the total cache size.   The kernel is repeated 100 times to “flush” the cache, and the average bandwidth is computed and presented.

A Sequence of Implementations

Implementation 1: Simple with Serial Compilation

The following source code was compiled with the Intel version 11.1 C compiler, using the commands:

icc -O2 ReadOnly.c -S
as ReadOnly.s -o ReadOnly.o
icc -O2 ReadOnly.o -o ReadOnly.icc.serial

Splitting the compilation up like this allows me to see the assembly code every time I compile, so I can monitor what the compiler is doing.

— ReadOnly.c —

#define N 32768000
#define NTIMES 100

extern double mysecond();      // a simple wall-clock timer -- appended
double a[N];                   // the data array

int main()
    int i,j;
    double sum;
    double t0, bw, times[NTIMES];

    for (i=0; i<NTIMES; i++) {
        times[i] = 0.0;
    for (i=0; i<N; i++) {
        a[i] = 1.0;

    sum = 0.0;
    for (j=0; j<NTIMES; j++) {
        t0 = mysecond();
        for (i=0; i<N; i++) {
           sum += a[i];
        times[j] = mysecond()-t0;
    printf("sum = %f\n",sum);
    for (i=0; i<NTIMES; i++) {
        bw = sizeof(double)*(double) N / times[i]/1e6;
        printf("iter, time, bw (MB/s) %d, %f, %f\n",i,times[i],bw);

/* A gettimeofday routine to give access to the wall
 clock timer on most UNIX-like systems.  */

#include <sys/time.h>
double mysecond()
    struct timeval tp;
    struct timezone tzp;
    int i;

    i = gettimeofday(&tp,&tzp);
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );

— End of ReadOnly.c —

Running the code under the “time” command gives output like:

sum = 3276800000.000000
iter, time, bw (MB/s) 0, 0.063371, 4136.659284
iter, time, bw (MB/s) 1, 0.063181, 4149.100482
iter, time, bw (MB/s) 2, 0.063225, 4146.205961
iter, time, bw (MB/s) 3, 0.063187, 4148.693441
iter, time, bw (MB/s) 4, 0.063210, 4147.191209
iter, time, bw (MB/s) 5, 0.063176, 4149.429305
iter, time, bw (MB/s) 6, 0.063195, 4148.176926
iter, time, bw (MB/s) 7, 0.063240, 4145.221181
iter, time, bw (MB/s) 8, 0.063204, 4147.582311
iter, time, bw (MB/s) 94, 0.063249, 4144.643036
iter, time, bw (MB/s) 95, 0.063239, 4145.283693
iter, time, bw (MB/s) 96, 0.063278, 4142.737862
iter, time, bw (MB/s) 97, 0.063261, 4143.846398
iter, time, bw (MB/s) 98, 0.063239, 4145.283693
iter, time, bw (MB/s) 99, 0.063240, 4145.236809
real    0m6.519s
user    0m6.412s
sys    0m0.105s

It is important to save and use the times for each iteration so that the compiler will actually execute them. It is also helpful to have a quick visual feedback on the iteration-to-iteration variability of the memory bandwidth, which is clearly small here.

So the system under test delivers a very steady 4.145 GB/s using this version of the code. This is only 32% of the peak memory bandwidth of 12.8 GB/s for the socket, which is an uninspiring result. Don’t worry — it will get a lot better before I am through!

Analysis of Implementation 1

So why does this sample program deliver such a small fraction of the peak memory bandwidth of the node?
Instead of looking at all the possible performance limiters (most of which we will get to in due time), I will cut to the chase and give you the answer:
The performance limit here is directly due to the limited number of outstanding cache misses available to a single core.
The relevant formula is often referred to as “Little’s Law”, which in this case reduces to the simple statement that

 Latency * Bandwidth = Concurrency

where Latency is the time required to load a cache line from memory (about 74 ns on the system under test, as I reported earlier), Bandwidth is the 12.8 GB/s peak transfer rate of the DRAM on one processor chip, and Concurrency is the quantity of data that must be “in flight” in order to “fill the pipeline” or “tolerate the latency”. For the system under test, the required concurrency is 74 ns * 12.8 GB/s = 947 bytes = 14.8 cache lines. Unfortunately, each core in the Opteron Family10h processor only supports a maximum of 8 cache read misses.

Rearranging the formula to Bandwidth = Concurrency/Latency allows us to estimate how much bandwidth we think a processor should be able to get for a given Latency and a given Concurrency. Using 8 cache lines (512 Bytes) and 74 ns suggests that the maximum sustainable bandwidth will be about 6.92 GB/s. Our observed result of 4.145 GB/s is well below this value.  Substituting the observed bandwidth allows us to compute the effective concurrency, which is 4.145 GB/s * 74 ns = 306 Bytes = 4.8 Cache Lines.

Some insight into the limited concurrency is available by re-compiling the code at optimization level “-O1”, which reduces the performance to 3.393 GB/s, corresponding to an effective concurrency of 251 Bytes or 3.9 cache lines.

The assembly code for the innermost loop at “-O1” is:

        addsd     a(,%rax,8), %xmm2  
        incq      %rax       
        cmpq      $32768000, %rax  
        jl        ..B1.8

while the assembly code for the same loop at “-O2” is:

        addpd     a(,%rax,8), %xmm0     
        addq      $2, %rax 
        cmpq      $32768000, %rax 
        jl        ..B1.7

In the first case the use of the “addsd” (Add Scalar, Double Precision) instruction shows that the compiler is using a single summation variable, while in the second case, the “addpd” (Add Packed, Double Precision) shows that the compiler is using two summation variables — the upper and lower halves of a “packed double” SSE register. Because of the data dependence on the sequence of summations, the code at “-O1” experiences 32,768,000 pipeline stalls (one per addition), while the code at “-O2” experiences 16,384,001 pipeline stalls — half as many (plus one at the end to add the two partial sums together). The floating-point add instructions used here have a dependent operation latency of four cycles. Some of this is overlapped with the pointer update, compare, and branch, but not all of it. The results at “-O1” correspond to about 6.84 CPU cycles per element, while the results at “-O2” correspond to about 5.60 CPU cycles per element, a difference of 1.24 cycles per element.
The surprising (and important) result here is that these extra floating point pipeline latencies are not overlapped with the memory latencies — after all a few extra stall cycles in the core should be negligible compared to the ~215 cycles of memory latency (74 ns * 2.9 GHz). The problem is that these floating-point pipeline stalls are delaying the execution of the subsequent memory load references that are necessary to allow additional hardware prefetches to be issued from the core to the memory system.

In my next entry, I will show now software prefetch instructions can directly increase the number of outstanding cache misses and how explicitly coding for more partial sum variables can indirectly allow more outstanding prefetches by eliminating the floating-point pipeline stalls and allowing subsequent memory references to be issued more quickly….

Posted in Computer Hardware | 6 Comments »

Opteron/PhenomII STREAM Bandwidth vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.

This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂

Single-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 10,101 MB/s 9,813 MB/s 9,081 MB/s
2.5 GHz 10,077 MB/s 9,784 MB/s 9,053 MB/s
2.1 GHz 10,075 MB/s 9,783 MB/s 9,051 MB/s

Two-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 13,683 MB/s 12,851 MB/s 11,490 MB/s
2.5 GHz 13,337 MB/s 12,546 MB/s 11,252 MB/s
2.1 GHz 13,132 MB/s 12,367 MB/s 11,112 MB/s

So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.

Posted in Computer Hardware, Reference | 4 Comments »

Opteron Memory Latency vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

  • AMD Phenom II 555 Processor
    • Two cores
    • Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
    • Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
  • 4 GB of DDR3/1600 DRAM
    • Two 2GB unregistered, unbuffered DIMMs
    • Dual-rank DIMMs
    • Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066 DDR3/800
3.2 GHz 51.58 ns 54.18 ns 57.46 ns 66.18 ns
2.5 GHz 52.77 ns 54.97 ns 58.30 ns 65.74 ns
2.1 GHz 53.94 ns 55.37 ns 58.72 ns 65.79 ns
0.8 GHz 82.86 ns 87.42 ns 94.96 ns 94.51 ns


  1. The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Posted in Computer Hardware, Reference | 2 Comments »

AMD Opteron Local Memory Latency Chart

Posted by John D. McCalpin, Ph.D. on 1st October 2010

I need to use the whiteboard in my office for something else, so it is time to transcribe the table of Opteron system memory latencies from there to here.

The following table presents my most reliable & accurate measurements of open page memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller.

Processor Family/Revision 1 socket systems 2 socket systems 4-socket systems
Opteron K8, RevE/RevF 2.4 – 3.0 GHz) 60 ns 60 ns 95 ns
Family10h, Rev B3 (2.3 GHz) (not yet measured) 85 ns 100 ns / 130 ns
Family10h, Rev C2 (2.9 GHz) 54 ns 74 ns (not yet measured)
Family10h, Rev D0 (2.6 GHz) (not yet measured) 78 ns 54 ns


  1. Results Updated 2010-10-04 to provide separate 1 socket and 2 socket numbers!!
  2. Memory latency is weakly dependent on CPU frequency, Northbridge frequency, and DRAM frequency in Opteron systems.
  3. Memory latency is controlled by the longer of the time required to get the data from DRAM and the time required to receive “snoop responses” from all the other chips in the system.
  4. Family10h Revision B3 Opterons have higher latency than the Family K8 Opterons for a variety of reasons:
    • The memory controller in K8 Opterons runs at the full CPU speed, while the memory controller in Family10h Opterons runs at a lower frequency.
    • The difference in frequencies between the CPU and memory controller in Family10h Opterons requires an asynchronous boundary between the two. This increases latency.
    • Family10h Opterons have a shared L3 cache that must be checked before sending load requests to memory. Since the L3 cache is physically larger than the L2 caches and is shared across all the cores on the chip, extra latency is incurred for requests that miss in the L3 and go to memory.
    • Family10h Opterons support the HyperTransport version 3 protocol (though Revisions B & C run in HyperTransport version 1 compatibility mode), which appears to add some latency to the probe responses.
  5. Family10h Revision B and C Opterons in 4-socket systems may have different latencies for different sockets, depending on the details of the HyperTransport interconnect. On the TACC “Ranger” system, the SunBlade x6420 nodes have two sockets that have direct connections to all the other sockets, and two sockets that are only connected to two of the other three sockets. The sockets with direct connections to all other sockets display a local memory latency of 100 ns, while the sockets that are not directly connected to all other sockets have to wait longer for some probe responses and show a latency of 130 ns.
  6. Family10h Revision D0 4-socket systems have decreased memory local memory latency because of the “HT Assist” feature, which uses 1 MB of the shared L3 to maintain a directory of lines that are potentially in a modified state in another processor’s cache. If the cache line is not listed in the directory, then the value in memory is current and probes do not need to be broadcast to other chips if all you want to do is read the data.

If anyone has a working 4-socket Family10h Revision C2 (“Shanghai”) system, I would like to run this benchmark to fill in the last entry in the table!

Posted in Computer Hardware | 1 Comment »