## Optimizing AMD Opteron Memory Bandwidth, Part 2: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 8th November 2010

In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today I will see how far this single-thread performance can be enhanced.

The surprising result from the previous experiments was that the floating-point pipeline latency made visible by the dependent floating-point add operations was quite important in limiting the number of outstanding cache line fetches, and therefore constituted an important limiter in overall performance. The dependent operation latency of the floating-point pipeline in the AMD Opteron Family10h processor is 4 cycles, so four add operations must be operating concurrently to fill the pipeline.

## Filling the Floating-Point Pipeline

### Scalar SSE

The code was modified to produce “Version 003”, which declares four separate summation variables (*sum0, sum1, sum2, sum3*), and unrolls the inner loop to handle a cache line at a time:

for (i=0; i<N; i+=8) { sum0 += a[i+0]; sum1 += a[i+1]; sum2 += a[i+2]; sum3 += a[i+3]; sum0 += a[i+4]; sum1 += a[i+5]; sum2 += a[i+6]; sum3 += a[i+7]; } sum = sum0 + sum1 + sum2 + sum3;

The modified code was compiled (again with the Intel version 11.1 compiler) at “-O2”. The assembly code for the inner loop was:

..B1.7: addsd a(,%rax,8), %xmm3 addsd 8+a(,%rax,8), %xmm2 addsd 16+a(,%rax,8), %xmm1 addsd 24+a(,%rax,8), %xmm0 addsd 32+a(,%rax,8), %xmm3 addsd 40+a(,%rax,8), %xmm2 addsd 48+a(,%rax,8), %xmm1 addsd 56+a(,%rax,8), %xmm0 addq $8, %rax cmpq $32768000, %rax jl ..B1.7

The assembly code follows the C source code exactly. I was a little surprised that the compiler did not combine these 8 scalar operations into 4 packed operations, but it is good to remember that compilers are unpredictable beasts, and need to be monitored closely. Performance for Version 003 was 4.511 GB/s — about 9.5% faster than Version 002. In terms of execution time per floating-point addition operation, this optimization saved about 0.5 cycles per element.

### Vector SSE

Continuing further in this direction, it is time to force the generation of packed double SSE arithmetic operations. The floating-point add unit is two 64-bit elements wide, so to fill the pipeline the four add operations really need to be *ADDPD* — packed double adds. While it may be possible to convince the compiler to generate the desired code with portable code, I decided to bite the bullet here and use some compiler extensions to get what I wanted. Version 006 (don’t worry — I have not forgotten 004 & 005) includes these declarations that the compiler interprets as packed double floating-point variables:

__m128d sum0,sum1,sum2,sum3; __m128d x0,x1,x2,x3;

Note that these variables can only be used in limited ways — primarily as sources or destinations of assignment functions or SSE intrinsic functions. For example, to set the initial values I use a compiler intrinsic:

x0 = _mm_set_pd(0.0,0.0); x1 = _mm_set_pd(0.0,0.0); x2 = _mm_set_pd(0.0,0.0); x3 = _mm_set_pd(0.0,0.0); sum0 = _mm_set_pd(0.0,0.0); sum1 = _mm_set_pd(0.0,0.0); sum2 = _mm_set_pd(0.0,0.0); sum3 = _mm_set_pd(0.0,0.0);

The inner loop of the summation is also coded with special intrinsic functions (defined in and similarly-named files):

for (i=0; i<N; i+=8) { x0 = _mm_load_pd(&a[i+0]); sum0 = _mm_add_pd(sum0,x0); x1 = _mm_load_pd(&a[i+2]); sum1 = _mm_add_pd(sum1,x1); x2 = _mm_load_pd(&a[i+4]); sum2 = _mm_add_pd(sum2,x2); x3 = _mm_load_pd(&a[i+6]); sum3 = _mm_add_pd(sum3,x3); }

The *_mm_load_pd()* intrinsic is read as “multi-media load packed double”. It expects a 16-byte aligned pointer as its argument, and returns a value into a variable declared as type __m128d. The *_mm_add_pd()* intrinsic is the “multi-media add packed double” instruction. It has two arguments of type __m128 which are added together and written back into the first argument — this behavior mimics the x86 ADDPD assembly language function. The left-hand-side of the assignment is also used for the output variable — I don’t know what happens if this does not match the first argument. *Caveat Emptor.*

The assembly code generated for this loop is exactly what I wanted:

..B1.10: addpd a(,%rax,8), %xmm3 addpd 16+a(,%rax,8), %xmm2 addpd 32+a(,%rax,8), %xmm1 addpd 48+a(,%rax,8), %xmm0 addq $8, %rax cmpq $32768000, %rax jl ..B1.10

There are a couple of ways to “unpack” packed double variables in order to perform the final summation across the partial sums. In this case the vector is very long, so the time required to perform the last couple of summations is tiny and the code does not need to be efficient. I picked the first approach that I could figure out how to code:

x0 = _mm_set_pd(0.0,0.0); x0 = _mm_add_pd(x0,sum0); x0 = _mm_add_pd(x0,sum1); x0 = _mm_add_pd(x0,sum2); x0 = _mm_add_pd(x0,sum3); _mm_storel_pd(&temp1, x0); _mm_storeh_pd(&temp2, x0); sum = temp1 + temp2;

This code clears a packed double variable, then adds the four (packed double) partial sums to generate a final pair of partial sums in the upper and lower halves of x0. The *_mm_storel_pd()* intrinsic stores the 64-bit double in the “low” half of x0 into the memory location pointed to by the first argument, while *_mm_storeh_pd()* stores the 64-bit double in the “high” half of x0 into the memory location pointed to by its first argument. These two doubles are then added together to build the final sum value.

The performance improvement provided by optimizing the inner loop was bigger than I expected — Version 006 delivered 5.246 GB/s — a full 27% faster than Version 002 (naive code compiled at “-O2”) and 16% faster than Version 004 (4 scalar partial sums). This optimization saved an addition 0.72 cycles per element relative to the scalar SSE Version 004. On the down side, this is still only about 41% of the peak memory bandwidth available to each processor chip, so there is a long way to go.

Next time — all about prefetching….

Posted in Computer Hardware | 1 Comment »