In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today I will see how far this single-thread performance can be enhanced.
The surprising result from the previous experiments was that the floating-point pipeline latency made visible by the dependent floating-point add operations was quite important in limiting the number of outstanding cache line fetches, and therefore constituted an important limiter in overall performance. The dependent operation latency of the floating-point pipeline in the AMD Opteron Family10h processor is 4 cycles, so four add operations must be operating concurrently to fill the pipeline.
Filling the Floating-Point Pipeline
Scalar SSE
The code was modified to produce “Version 003”, which declares four separate summation variables (sum0, sum1, sum2, sum3), and unrolls the inner loop to handle a cache line at a time:
for (i=0; i<N; i+=8) { sum0 += a[i+0]; sum1 += a[i+1]; sum2 += a[i+2]; sum3 += a[i+3]; sum0 += a[i+4]; sum1 += a[i+5]; sum2 += a[i+6]; sum3 += a[i+7]; } sum = sum0 + sum1 + sum2 + sum3;
The modified code was compiled (again with the Intel version 11.1 compiler) at “-O2”. The assembly code for the inner loop was:
..B1.7: addsd a(,%rax,8), %xmm3 addsd 8+a(,%rax,8), %xmm2 addsd 16+a(,%rax,8), %xmm1 addsd 24+a(,%rax,8), %xmm0 addsd 32+a(,%rax,8), %xmm3 addsd 40+a(,%rax,8), %xmm2 addsd 48+a(,%rax,8), %xmm1 addsd 56+a(,%rax,8), %xmm0 addq $8, %rax cmpq $32768000, %rax jl ..B1.7
The assembly code follows the C source code exactly. I was a little surprised that the compiler did not combine these 8 scalar operations into 4 packed operations, but it is good to remember that compilers are unpredictable beasts, and need to be monitored closely. Performance for Version 003 was 4.511 GB/s — about 9.5% faster than Version 002. In terms of execution time per floating-point addition operation, this optimization saved about 0.5 cycles per element.
Vector SSE
Continuing further in this direction, it is time to force the generation of packed double SSE arithmetic operations. The floating-point add unit is two 64-bit elements wide, so to fill the pipeline the four add operations really need to be ADDPD — packed double adds. While it may be possible to convince the compiler to generate the desired code with portable code, I decided to bite the bullet here and use some compiler extensions to get what I wanted. Version 006 (don’t worry — I have not forgotten 004 & 005) includes these declarations that the compiler interprets as packed double floating-point variables:
__m128d sum0,sum1,sum2,sum3; __m128d x0,x1,x2,x3;
Note that these variables can only be used in limited ways — primarily as sources or destinations of assignment functions or SSE intrinsic functions. For example, to set the initial values I use a compiler intrinsic:
x0 = _mm_set_pd(0.0,0.0); x1 = _mm_set_pd(0.0,0.0); x2 = _mm_set_pd(0.0,0.0); x3 = _mm_set_pd(0.0,0.0); sum0 = _mm_set_pd(0.0,0.0); sum1 = _mm_set_pd(0.0,0.0); sum2 = _mm_set_pd(0.0,0.0); sum3 = _mm_set_pd(0.0,0.0);
The inner loop of the summation is also coded with special intrinsic functions (defined in and similarly-named files):
for (i=0; i<N; i+=8) { x0 = _mm_load_pd(&a[i+0]); sum0 = _mm_add_pd(sum0,x0); x1 = _mm_load_pd(&a[i+2]); sum1 = _mm_add_pd(sum1,x1); x2 = _mm_load_pd(&a[i+4]); sum2 = _mm_add_pd(sum2,x2); x3 = _mm_load_pd(&a[i+6]); sum3 = _mm_add_pd(sum3,x3); }
The _mm_load_pd() intrinsic is read as “multi-media load packed double”. It expects a 16-byte aligned pointer as its argument, and returns a value into a variable declared as type __m128d. The _mm_add_pd() intrinsic is the “multi-media add packed double” instruction. It has two arguments of type __m128 which are added together and written back into the first argument — this behavior mimics the x86 ADDPD assembly language function. The left-hand-side of the assignment is also used for the output variable — I don’t know what happens if this does not match the first argument. Caveat Emptor.
The assembly code generated for this loop is exactly what I wanted:
..B1.10: addpd a(,%rax,8), %xmm3 addpd 16+a(,%rax,8), %xmm2 addpd 32+a(,%rax,8), %xmm1 addpd 48+a(,%rax,8), %xmm0 addq $8, %rax cmpq $32768000, %rax jl ..B1.10
There are a couple of ways to “unpack” packed double variables in order to perform the final summation across the partial sums. In this case the vector is very long, so the time required to perform the last couple of summations is tiny and the code does not need to be efficient. I picked the first approach that I could figure out how to code:
x0 = _mm_set_pd(0.0,0.0); x0 = _mm_add_pd(x0,sum0); x0 = _mm_add_pd(x0,sum1); x0 = _mm_add_pd(x0,sum2); x0 = _mm_add_pd(x0,sum3); _mm_storel_pd(&temp1, x0); _mm_storeh_pd(&temp2, x0); sum = temp1 + temp2;
This code clears a packed double variable, then adds the four (packed double) partial sums to generate a final pair of partial sums in the upper and lower halves of x0. The _mm_storel_pd() intrinsic stores the 64-bit double in the “low” half of x0 into the memory location pointed to by the first argument, while _mm_storeh_pd() stores the 64-bit double in the “high” half of x0 into the memory location pointed to by its first argument. These two doubles are then added together to build the final sum value.
The performance improvement provided by optimizing the inner loop was bigger than I expected — Version 006 delivered 5.246 GB/s — a full 27% faster than Version 002 (naive code compiled at “-O2”) and 16% faster than Version 004 (4 scalar partial sums). This optimization saved an addition 0.72 cycles per element relative to the scalar SSE Version 004. On the down side, this is still only about 41% of the peak memory bandwidth available to each processor chip, so there is a long way to go.
Next time — all about prefetching….
Looking ahead, this seems to be turning into a great series, thanks. A couple comments:
An explanation for why the compiler is not producing packed instructions for v003: it cannot assume that the array is 16-byte aligned, therefore it would need to produce fringe code so that the main loop is actually aligned. This causes code growth, but it’s not so much for this simple problem, however, any sane implementation would assume associative math. Since the standards prohibit that, the unsafe math compiler options are rarely used and the compiler developers do not seem to put a great deal of effort into generating packed instructions (unless perhaps when the compiler is in control of alignment and the sizes are statically known). FWIW, gcc (>= 4.3, IIRC) will “vectorize” the naive loop
for (i=0; i<N; i++) sum += a[i];
with -ftree-vectorize (enabled by default at -O3) and -ffast-math. Unfortunately, up through 4.5 (but not in 4.6 which uses addpd with a memory operand) it still does not fix up alignment so although the math is done with addpd, the the loads are two instructions (movsd + movhpd). You can get quite a lot of diagnostics about why gcc is or is not vectorizing by using -ftree-vectorizer-verbose=n where 0<=n<=9.
In any case, the fact that you change the ABI for this function when moving from v003 to v006 is a very big deal for library code.
Finally, you mention using an output variable that does not match the primary for an intrinsic. This is not a problem because there is no particular correspondence between the names of the __m128d variables in your code and physical registers. The compiler is still in complete control of registers and will spill them into memory if necessary, as well as duplicating and renaming them within the body of the function, so shoot for clarity and don't worry too much about making them match (except that it if they happen to match, it is somewhat easier to line up the generated assembly with the C).