John McCalpin's blog

Dr. Bandwidth explains all….

Optimizing AMD Opteron Memory Bandwidth, Part 5: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 11th November 2010

Single Thread, Read Only Results Comparison Across Systems

In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (“Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.

Test Systems

  1. 2-socket AMD Family10h Opteron Revision C2 (“Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
  2. 2-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  3. 4-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  4. 4-socket AMD Family10h Opteron 6174, Revision E0 (“Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
  5. 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333

All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.

The source code, scripts, and results are all available in a tar file: ReadOnly_2010-11-12.tar.bz2

Results

Code Version Notes Vector SSE Large Page SW Prefetch 4 KiB pages accessed Ref System (2p Shanghai) 2-socket Istanbul 4-socket Istanbul 4-socket Magny-Cours 1-socket PhenomII
Version001 “-O1” 1 3.401 GB/s 3.167 GB/s 4.311 GB/s 3.734GB/s 4.586 GB/s
Version002 “-O2” 1 4.122 GB/s 4.035 GB/s 5.719 GB/s 5.120 GB/s 5.688 GB/s
Version003 8 partial sums 1 4.512 GB/s 4.373 GB/s 5.946 GB/s 5.476 GB/s 6.207 GB/s
Version004 add SW prefetch Y 1 6.083 GB/s 5.732 GB/s 6.489 GB/s 6.389 GB/s 7.571 GB/s
Version005 add vector SSE Y Y 1 6.091 GB/s 5.765 GB/s 6.600 GB/s 6.398 GB/s 7.580 GB/s
Version006 remove prefetch Y 1 5.247 GB/s 5.159 GB/s 6.787 GB/s 6.403 GB/s 6.976 GB/s
Version007 add large pages Y Y 1 5.392 GB/s 5.234 GB/s 7.149 GB/s 6.653 GB/s 7.117 GB/s
Version008 split into triply-nested loop Y Y 1 4.918 GB/s 4.914 GB/s 6.661 GB/s 6.180 GB/s 6.616 GB/s
Version009 add SW prefetch Y Y Y 1 6.173 GB/s 5.901 GB/s 6.646 GB/s 6.568 GB/s 7.736 GB/s
Version010 multiple pages/loop Y Y Y 2 6.417 GB/s 6.174 GB/s 7.569 GB/s 6.895 GB/s 7.913 GB/s
Version011 multiple pages/loop Y Y Y 4 7.063 GB/s 6.804 GB/s 8.319 GB/s 7.245 GB/s 8.583 GB/s
Version012 multiple pages/loop Y Y Y 8 7.260 GB/s 6.960 GB/s 8.378 GB/s 7.205 GB/s 8.642 GB/s
Version013 Version010 minus SW prefetch Y Y 2 5.864 GB/s 6.009 GB/s 7.667 GB/s 6.676 GB/s 7.469 GB/s
Version014 Version011 minus SW prefetch Y Y 4 6.743 GB/s 6.483 GB/s 8.136 GB/s 6.946 GB/s 8.291 GB/s
Version015 Version012 minus SW prefetch Y Y 8 6.978 GB/s 6.578 GB/s 8.112 GB/s 6.937 GB/s 8.463 GB/s

Comments

There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….

  • For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
  • The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
    ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large.
  • This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
  • The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.

Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:

Computed “Effective Concurrency” (= memory latency * measured bandwidth) for all 15 versions of the ReadOnly code on five test systems.

Posted in Computer Hardware | 5 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, and Part 3, it is time to into the ugly stuff — trying to control DRAM bank and rank access patterns and working to improve the effectiveness of the memory controller prefetcher.

Background: Banks and Ranks

The DRAM installed in the system under test consists of 2 dual-rank 2GiB DIMMs in each channel of each chip. Each “rank” is composed of 9 DRAM chips, each with 1 Gbit capacity and each driving 8 bits of the 72-bit output of the DIMM (64 bits data + 8 bits ECC). Each of these 1 Gbit DRAM chips is divided into 8 “banks” of 128 Mbits each, and each of these banks has a 1 KiB “page size”. This “DRAM page size” is unrelated to the “virtual memory page size” discussed above, but it is easy to get confused! The DRAM page size defines the amount of information transferred from the DRAM array into the “open page” buffer amps in each DRAM bank as part of the two-step (row/column) addressing used to access the DRAM memory. In the system under test, the DRAM page size is thus 8 KiB– 8 DRAM chips * 1 KiB/DRAM chip — with contiguous cache lines distributed between the two DRAM channels (using a 6-bit hash function that I won’t go into here). Each DRAM chip has 8 banks, so each rank maps 128 KiB of contiguous addresses (2 channels * 8 banks * 8 KiB/bank).

Why does this matter?

  1. Every time a reference is made to a new DRAM page, the full 8 KiB is transferred from the DRAM array to the DRAM sense amps. This uses a fair amount of power, and it makes sense to try to read the entire 8 KiB while it is in the sense amps. Although it is beyond the scope of today’s discussion, reading data from “open pages” uses only about 1/4 to 1/5 of the power required to read the same amount of data in “closed page” mode, where only one cache line is retrieved from each DRAM page.
  2. Data in the sense amps can be accessed at significantly lower latency — this is called “open page” access (or a “page hit”), and the latency is referred to as the “CAS latency”. On most systems the CAS latency is between 12.5 ns and 15 ns, independent of the frequency of the DRAM bus. If the data is not in the sense amps (a “page miss”), loading data from the array to the sense amps takes an additional latency of 12.5 to 15 ns. If the sense amps were holding other valid data, it would be necessary to write that data back to the array, taking an additional 12.5 to 15 ns. If the DRAM latency is not completely overlapped with the cache coherence latency, these increases will reduce the sustainable bandwidth according to Little’s Law: Bandwidth = Concurrency / Latency
  3. The DRAM bus is a shared bus with multiple transmitters and multiple receivers — five of each in the system under test: the memory controller and four DRAM ranks. Every time the device driving the bus needs to be switched, the bus must be left idle for a short period of time to ensure that the receivers can synchronize with the next driver. When the memory controller switches from reading to writing this is called a “read/write turnaround”. When the memory controller switches from writing to reading this is called a “write/read turnaround”. When the memory controller switches from reading from one rank to reading from a different rank this is called a “chip select turnaround” or a “chip select stall”, or sometimes a “read-to-read” stall or delay or turnaround. These idle periods depend on the electrical properties of the bus, including the length of the traces on the motherboard, the number of DIMM sockets, and the number and type of DIMMs installed. The idle periods are only very weakly dependent on the bus frequency, so as the bus gets faster and the transfer time of a cache line gets faster, these delays become proportionately more expensive. It is common for the “chip select stall” period to be as large as the cache line transfer time, meaning that a code that performs consecutive reads from different banks will only be able to use about 50% of the DRAM bandwidth.
  4. Although it could have been covered in the previous post on prefetching, the memory controller prefetcher on AMD Family10h Opteron systems appears to only look for one address stream in each 4 KiB region. This suggests that interleaving fetches from different 4 KiB pages might allow the memory controller prefetcher to produce more outstanding prefetches. The extra loops that I introduce to allow control of DRAM rank access are also convenient for allowing interleaving of fetches from different 4 KiB pages.

Experiments

Starting with Version 007 (packed SSE arithmetic, large pages, no software prefetching) at 5.392 GB/s, I split the single loop over the array into a set of three loops — the innermost loop over the 64 cache lines in a 4 KiB page, a middle loop for the 32 4 KiB pages in a 128 KiB DRAM rank, and an outer loop for the (many) 128 KiB DRAM rank ranges in the full array. The resulting Version 008 loop structure looks like:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The resulting inner loop looks the same as before — but with the much shorter iteration count of 64 instead of 512,000:

..B1.30:
        addpd     (%r14,%rsi,8), %xmm3 
        addpd     16(%r14,%rsi,8), %xmm2 
        addpd     32(%r14,%rsi,8), %xmm1
        addpd     48(%r14,%rsi,8), %xmm0 
        addq      $8, %rsi 
        cmpq      %rcx, %rsi  
        jl        ..B1.30

The overall performance of Version 008 drops almost 9% compared to Version 007 — 4.918 GB/s vs 5.392 GB/s — for reasons that are unclear to me.

Interleaving Fetches Across Multiple 4 KiB Pages

The first set of optimizations based on Version 008 will be to interleave the accesses so that multiple 4 KiB pages are accessed concurrently. This is implemented by a technique that compiler writers call “unroll and jam”. I unroll the middle loop (the one that covers the 32 4 KiB pages in a rank) and interleave (“jam”) the iterations. This is done once for Version 013 (i.e., concurrently accessing 2 4 KiB pages), three times for Version 014 (i.e., concurrently accessing 4 4 KiB pages), and seven times for Version 015 (i.e., concurrently accessing 8 4 KiB pages).
To keep the listing short, I will just show the inner loop structure of the first of these — Version 013:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=2*PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                        x0 = _mm_load_pd(&a[i+PAGESIZE+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+PAGESIZE+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+PAGESIZE+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+PAGESIZE+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The assembly code for the inner loop looks like what one would expect:

..B1.30:
        addpd     (%r15,%r11,8), %xmm3  
        addpd     16(%r15,%r11,8), %xmm2 
        addpd     32(%r15,%r11,8), %xmm1
        addpd     48(%r15,%r11,8), %xmm0 
        addpd     4096(%r15,%r11,8), %xmm3 
        addpd     4112(%r15,%r11,8), %xmm2
        addpd     4128(%r15,%r11,8), %xmm1 
        addpd     4144(%r15,%r11,8), %xmm0  
        addq      $8, %r11
        cmpq      %r10, %r11 
        jl        ..B1.30

Performance for these three versions improves dramatically as the number of 4 KiB pages accessed increases:

  • Version 008: one 4 KiB page accessed: 4.918 GB/s
  • Version 013: two 4 KiB pages accessed: 5.864 GB/s
  • Version 014: four 4 KiB pages accessed: 6.743 GB/s
  • Version 015: eight 4 KiB pages accessed: 6.978 GB/s

Going one step further, to 16 pages accessed in the inner loop, causes a sharp drop in performance — down to 5.314 GB/s.

So this idea of accessing multiple 4 KiB pages seems to have significant merit for improving single-thread read bandwidth. Next we see if explicit prefetching can push performance even higher.

Multiple 4 KiB Pages Plus Explicit Software Prefetching

A number of new versions were produced

  • Version 008 (single page accessed with triple loops) + SW prefetch yields Version 009
  • Version 013 (two pages accessed) + SW prefetch –> Version 010
  • Version 014 (four pages accessed) + SW prefetch –> Version 011
  • Version 015 (eight pages accessed) + SW prefetch –> Version 012

In each case the prefetch was set AHEAD of the current pointer by a distance of 0 to 1024 8-Byte elements and all results were tabulated. Unlike the initial tests with SW prefetching, the results here are more variable and less easy to understand.
Starting with Version 009, the addition of SW prefetching restores the performance loss introduced by the triple-loop structure and provides a strong additional boost.

loop structure no SW prefetch with SW prefetch
single loop Version 007: 5.392 GB/s Version 005: 6.091 GB/s
triple loop Version 008: 4.917 GB/s Version 009: 6.173 GB/s

All of these are large page results except Version 005. The slight increase from Version 005 to Version 009 is consistent with the improvements seen in adding large pages from Version 006 to Version 007, so it looks like adding the SW prefetch negates the performance loss introduced by the triple-loop structure.

Combining SW prefetching with interleaved 4 KiB page access produces some intriguing patterns.
Version 010 — fetching 2 4KiB pages concurrently:
Version 010 bandwidth

Version 011 — fetching 4 4KiB pages concurrently:
Version 011 bandwidth

Version 012 — fetching 8 4KiB pages concurrently:
Version 012 bandwidth

When reviewing these figures, keep in mind that the baseline performance number is the 6.173 GB/s from Version 009, so all of the results are good. What is most unusual about these results (speaking from ~25 years experience in performance analysis) is that the last two figures show some fairly strong upward spikes in performance. It is entirely commonplace to see decreases in performance due to alignment issues, but it is quite rare to see increases in performance that depend sensitively on alignment, but which remain repeatable and usable.

Choosing a range of optimum AHEAD distances for each case allows me to summarize:

Pages Accessed in Inner Loop no SW prefetch with SW prefetch
1 Version 007: 5.392 GB/s Version 009: 6.173 GB/s
2 Version 013: 5.864 GB/s Version 010: 6.417 GB/s
4 Version 014: 6.743 GB/s Version 011: 7.063 GB/s
8 Version 015: 6.978 GB/s Version 012: 7.260 GB/s

It looks like this is about as far as I can go with a single thread. Performance started at 3.401 GB/s and gradually increased to 7.260 GB/s — an increase of 113%. The resulting code is not terribly long, but it is important to remember that I only implemented the case that starts on a 128 KiB boundary (which is necessary to ensure that the accesses to multiple 4 KiB pages are all in the same rank). Extra prolog and epilog code will be required for a general-purpose sum reduction routine.

So why did I spend all of this time? Why not just start by running multiple threads to increase performance?

First, the original code was slow enough that even perfect scaling across four threads would barely provide enough concurrency to reach the 12.8 GB/s read bandwidth of the chip.
Second, the original code is not set up to allow control over which pages are being accessed. Using multiple threads that are accessing different ranks will significantly increase the number of chip-select stalls, and may not produce the desired level of performance. (More on this soon.)
Third, the system under test has only two DDR2/800 channels per chip — not exactly state of the art. Newer systems have two or four channels of DDR3 at 1333 or even 1600 MHz. Given similar memory latencies, Little’s Law tells us that these systems will only deliver more sustained bandwidth if they are able to maintain more outstanding cache misses. The experiments presented here should provide some insights into how to structure code to increase the memory concurrency.

But that does not mean that I won’t look at the performance of multi-threaded implementations — that is coming up anon…

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, it is time to look at adding explicit prefetching to try to increase read bandwidth.

About Prefetching

The AMD Opteron Family10h processors have two different “hardware” prefetch mechanisms, and also allow “software” prefetch instructions. The “core prefetcher” is (as the name implies) located in the processor core, and monitors L1 Data Cache misses. (There is a similar prefetch engine for Instruction Cache misses, but that is not today’s topic.) This “core prefetcher” monitors a large number of data access streams and prefetches along detected address streams consisting of contiguous ascending or descending cache line addresses. I believe that the core prefetcher only fetches one cache line ahead of the most recent load to each stream, waiting until a cache line is returned from memory before sending out the next request, so it will not “automagically” create enough prefetches to fill the memory pipeline.
(Note that these sorts of details are hard to tease out of vendor documentation, and are subject to frequent change.) The core prefetcher is operational in all of the results presented here, with many of the code modifications intended to make it operate more effectively.
The Opteron Family10h processors also have a “Memory Controller Prefetcher” that prefetches from DRAM to special buffers in the memory controller. This can reduce the latency for subsequent memory accesses, and therefore increase the bandwidth available with a fixed level of concurrency. This part is really important, so I will repeat it:

Bandwidth = Concurrency / Latency

The concurrency is limited by the number of buffers available for outstanding cache misses (8 per core, in this case), while the latency is determined by where the data ends up getting found. In the system under test the latency is actually bounded below by the time required to probe the other caches in the system, not by the time required to obtain the data from the DRAMs. In the first generation of AMD Opteron Family10h processors (Revisions B0 & B3), the memory controller prefetcher simply read the data from the DRAMs to a set of buffers in the memory controller. In these systems the cache coherency check was not initiated until the processor actually sent a load for a particular cache line to the memory controller. Unfortunately in this case the time required to check the other caches to see if they had a copy of the line was considerably longer than the time required to get the data from the DRAM, so getting the data from the DRAM earlier did not help at all — it just meant that the processor had to wait longer after receiving the data before it could use it. In Revision C of the AMD Opteron Family10h processors, the memory controller prefetcher was enhanced to make “coherent” prefetches — the memory controller began the coherence transaction when it performed the prefetch. In the best case the coherence transaction will be complete by the time the load request from the processor arrives, which significantly reduces the latency observed by the processor. From the formula above, it should be clear that for a fixed level of concurrency, the only way to increase sustained bandwidth is to reduce the effective latency. In Revisions D and E of the AMD Opteron Family10h processors, this “coherent prefetcher” is retained with some additional improvements.

Finally we get to “Software Prefetch”. The AMD64/Intel64 instruction set includes a set of explicit prefetch instructions. There are two versions that matter, the “PREFETCH T0” instruction and the “PREFETCH NTA” instruction. The former fetches data from memory much like a load, while the latter fetches data from memory and marks it as “non-temporal” — meaning that it is unlikely to be reused. For Family10h Opterons “non-temporal” fetches go into the L1 cache but when they are chosen to be replaced in the L1 Cache, they are simply dropped (if clean) rather than being sent to the L2 Cache as “victims”. This allows the L2 Cache to be used more effectively for data that is likely to be reused. Since the array used in this benchmark is much larger than the cache, I could use the PREFETCH_NTA instruction when explicitly prefetching data. The choice of prefetch instruction does not make much difference in performance here, though it can sharply reduce performance if the code does end up reusing the data, so in these examples I use PREFETCH_T0 just as a habit.

Different revisions of hardware treat software prefetches differently — again it is hard to obtain this information from vendor documentation. There are several issues relating to software prefetch behavior that can influence how you want to use them:

  1. Do software prefetches cause access violations if the address is out-of-bounds?
    No they do not. This makes them easier to use since you can prefetch beyond the end of an array without worrying about access violations.
  2. Do software prefetches trigger the hardware page table walker in the event of a TLB miss?
    Yes, on Family10h Opterons. This is usually a good thing, though there is only one hardware page table walker per core. The SW prefetch will only cause a hardware table walk — if the page table entry is not found by the hardware table walker, the request will be silently dropped. This is different than a load, which will cause a trap to O/S software if the page table entry is not found by the hardware table walker (for example if the page has been swapped to disk).
  3. Do the addresses used in software prefetches trigger the hardware prefetchers like loads do?
    In earlier Opterons I think that the answer was “no”, and in Family10h Opterons I think that the answer is “yes”, but it does not matter in this particular test case.
  4. Do software prefetches combine in the cache miss buffers with load misses, or do they allocate separate buffers?
    For AMD Family10h processors the Miss Address Buffers combine all accesses to the same cache line, whether due to hardware prefetches, load misses, or software prefetches. This makes it less critical that the code avoid issuing both load misses and SW prefetches to the same cache line.

Prefetching Experiments

I repeated most of the previous experiments with explicit software prefetch instructions added. In each case I varied the distance between the current loop pointer and the target of the prefetch from 0 8-Byte elements to 1024 8-byte elements. The prefetch instructions were added using yet another compiler intrinsic function executed once per loop (which explains why I prefer to unroll the inner loop to handle a cache line at a time). The inner loop of Version 003 (with 4 scalar variables as partial sums) then becomes Version004:

            for (i=0; i<N; i+=8) {
                _mm_prefetch((char *)&a[i+AHEAD],_MM_HINT_T0);
                sum0 += a[i+0];
                sum1 += a[i+1];
                sum2 += a[i+2];
                sum3 += a[i+3];
                sum0 += a[i+4];
                sum1 += a[i+5];
                sum2 += a[i+6];
                sum3 += a[i+7];
            }

and the resulting assembly code for the inner loop is as expected:

..B1.10: 
        addsd     a(%rdx), %xmm3 
        addsd     8+a(%rdx), %xmm2
        addsd     16+a(%rdx), %xmm1
        addsd     24+a(%rdx), %xmm0
        addsd     32+a(%rdx), %xmm3
        addsd     40+a(%rdx), %xmm2 
        addsd     48+a(%rdx), %xmm1
        addsd     56+a(%rdx), %xmm0  
        prefetcht0 (%rax)
        addq      $64, %rax 
        addq      $64, %rdx 
        addq      $8, %rcx  
        cmpq      $32768000, %rcx 
        jl        ..B1.10

The performance of Version 004 is now dependent on the AHEAD distance, as shown in Figure 1.

Single thread bandwidth (Version 004) as a function of the distance between the current pointer and the software prefetch address.

Comparing to Version 003, the explicit software prefetching increases the bandwidth dramatically, from ~4.5 GB/s to over 6.0 GB/s. Without trying to understand the details, it is clear that it helps a great deal to prefetch at least 96 elements ahead, with 96 elements = 768 Bytes = 12 cache lines. There are nice wide ranges that show steady levels of performance, with (for example) the average of AHEAD=416 to AHEAD=448 coming to 6.083 GB/s. Given the 74 ns nominal memory latency, this corresponds to an effective concurrency of 450 Bytes, or slightly over 7.0 cache lines. Note that this is approaching the maximum value that a single thread should be able to attain unless the memory controller prefetcher is able to significantly reduce the effective memory latency.

Combining the explicit software prefetching of Version 004 with the packed double SSE of Version 003 produces Version 005. Unfortunately the performance of Version 004 and Version 005 is essentially identical — the reduction in pipeline latency provided in Version 005 overlaps with the improvement in latency due to software prefetching, producing no additional gain.

Putting Data on Large Pages

The AMD Opteron Family10h processors support a standard virtual memory page size of 4 KiB with large pages sizes of 2 MiB and 1 GiB. Most versions of Linux support only one option for large page sizes, typically the 2 MiB version. I configured the system under test to reserve 512 large pages behind each chip and modified the benchmark to use these large pages. Some compilers (notably the Open64 compilers) have compile/link options to put the data on large pages, but I prefer doing it a bit more manually using shared memory segments. This has the advantage of portability across all compilers and can be switched back to the default page size by a simple change to the parameters to the shmget() call. One slightly tricky issue is that when using large pages the size requested in the shmget call needs to be rounded up to the nearest multiple of the page size. The code to allocate the array on large pages looks like:

    i = total/(2*1024*1024);                                   // how many full 2MiB pages in "total" bytes?
    sum = ceil((double)total/(2.*1024.*1024.));     // round up to next integer if needed
    j = (int) sum;                                                    // ceil() returns a double -- convert to integer
    SEGSIZE = j*(2*1024*1024);                         // now the SEGSIZE is the smallest multiple of 2MiB needed to hold "total" bytes

    shmida = shmget(IPC_PRIVATE,SEGSIZE,IPC_CREAT|SHM_HUGETLB|0666);      // simply eliminate the SHM_HUGETLB to get the default page size
    a = (double *) shmat (shmida, NULL, 0);                                  // *real* code should check for error returns on both the shmget and shmat calls!

Taking the packed double SSE Version 006 and putting the data on large pages gives us Version 007. This version does not include software prefetching. Performance is improved slightly by the use of large pages, from 5.247 GB/s (Version 006) to 5.392 GB/s (Version 007) — a bit under 3% improvement. Not to worry — the main goal of using large pages is not to directly improve performance, but to allow control over which DRAM banks and ranks are being accessed.

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only

Optimizing AMD Opteron Memory Bandwidth, Part 2: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 8th November 2010

In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today I will see how far this single-thread performance can be enhanced.

The surprising result from the previous experiments was that the floating-point pipeline latency made visible by the dependent floating-point add operations was quite important in limiting the number of outstanding cache line fetches, and therefore constituted an important limiter in overall performance. The dependent operation latency of the floating-point pipeline in the AMD Opteron Family10h processor is 4 cycles, so four add operations must be operating concurrently to fill the pipeline.

Filling the Floating-Point Pipeline

Scalar SSE

The code was modified to produce “Version 003”, which declares four separate summation variables (sum0, sum1, sum2, sum3), and unrolls the inner loop to handle a cache line at a time:

        for (i=0; i<N; i+=8) {
            sum0 += a[i+0];
            sum1 += a[i+1];
            sum2 += a[i+2];
            sum3 += a[i+3];
            sum0 += a[i+4];
            sum1 += a[i+5];
            sum2 += a[i+6];
            sum3 += a[i+7];
        }
       sum = sum0 + sum1 + sum2 + sum3;

The modified code was compiled (again with the Intel version 11.1 compiler) at “-O2”. The assembly code for the inner loop was:

..B1.7:
        addsd     a(,%rax,8), %xmm3
        addsd     8+a(,%rax,8), %xmm2
        addsd     16+a(,%rax,8), %xmm1
        addsd     24+a(,%rax,8), %xmm0 
        addsd     32+a(,%rax,8), %xmm3  
        addsd     40+a(,%rax,8), %xmm2
        addsd     48+a(,%rax,8), %xmm1
        addsd     56+a(,%rax,8), %xmm0
        addq      $8, %rax
        cmpq      $32768000, %rax 
        jl        ..B1.7  

The assembly code follows the C source code exactly. I was a little surprised that the compiler did not combine these 8 scalar operations into 4 packed operations, but it is good to remember that compilers are unpredictable beasts, and need to be monitored closely. Performance for Version 003 was 4.511 GB/s — about 9.5% faster than Version 002. In terms of execution time per floating-point addition operation, this optimization saved about 0.5 cycles per element.

Vector SSE

Continuing further in this direction, it is time to force the generation of packed double SSE arithmetic operations. The floating-point add unit is two 64-bit elements wide, so to fill the pipeline the four add operations really need to be ADDPD — packed double adds. While it may be possible to convince the compiler to generate the desired code with portable code, I decided to bite the bullet here and use some compiler extensions to get what I wanted. Version 006 (don’t worry — I have not forgotten 004 & 005) includes these declarations that the compiler interprets as packed double floating-point variables:

    __m128d sum0,sum1,sum2,sum3;
    __m128d x0,x1,x2,x3;

Note that these variables can only be used in limited ways — primarily as sources or destinations of assignment functions or SSE intrinsic functions. For example, to set the initial values I use a compiler intrinsic:

        x0 = _mm_set_pd(0.0,0.0);
        x1 = _mm_set_pd(0.0,0.0);
        x2 = _mm_set_pd(0.0,0.0);
        x3 = _mm_set_pd(0.0,0.0);
        sum0 = _mm_set_pd(0.0,0.0);
        sum1 = _mm_set_pd(0.0,0.0);
        sum2 = _mm_set_pd(0.0,0.0);
        sum3 = _mm_set_pd(0.0,0.0);

The inner loop of the summation is also coded with special intrinsic functions (defined in and similarly-named files):

    for (i=0; i<N; i+=8) {
                x0 = _mm_load_pd(&a[i+0]);
                sum0 = _mm_add_pd(sum0,x0);
                x1 = _mm_load_pd(&a[i+2]);
                sum1 = _mm_add_pd(sum1,x1);
                x2 = _mm_load_pd(&a[i+4]);
                sum2 = _mm_add_pd(sum2,x2);
                x3 = _mm_load_pd(&a[i+6]);
                sum3 = _mm_add_pd(sum3,x3);
            }

The _mm_load_pd() intrinsic is read as “multi-media load packed double”. It expects a 16-byte aligned pointer as its argument, and returns a value into a variable declared as type __m128d. The _mm_add_pd() intrinsic is the “multi-media add packed double” instruction. It has two arguments of type __m128 which are added together and written back into the first argument — this behavior mimics the x86 ADDPD assembly language function. The left-hand-side of the assignment is also used for the output variable — I don’t know what happens if this does not match the first argument. Caveat Emptor.
The assembly code generated for this loop is exactly what I wanted:

..B1.10: 
        addpd     a(,%rax,8), %xmm3 
        addpd     16+a(,%rax,8), %xmm2 
        addpd     32+a(,%rax,8), %xmm1
        addpd     48+a(,%rax,8), %xmm0
        addq      $8, %rax 
        cmpq      $32768000, %rax  
        jl        ..B1.10

There are a couple of ways to “unpack” packed double variables in order to perform the final summation across the partial sums. In this case the vector is very long, so the time required to perform the last couple of summations is tiny and the code does not need to be efficient. I picked the first approach that I could figure out how to code:

            x0 = _mm_set_pd(0.0,0.0);
            x0 = _mm_add_pd(x0,sum0);
            x0 = _mm_add_pd(x0,sum1);
            x0 = _mm_add_pd(x0,sum2);
            x0 = _mm_add_pd(x0,sum3);
            _mm_storel_pd(&temp1, x0);
            _mm_storeh_pd(&temp2, x0);
            sum = temp1 + temp2;

This code clears a packed double variable, then adds the four (packed double) partial sums to generate a final pair of partial sums in the upper and lower halves of x0. The _mm_storel_pd() intrinsic stores the 64-bit double in the “low” half of x0 into the memory location pointed to by the first argument, while _mm_storeh_pd() stores the 64-bit double in the “high” half of x0 into the memory location pointed to by its first argument. These two doubles are then added together to build the final sum value.
The performance improvement provided by optimizing the inner loop was bigger than I expected — Version 006 delivered 5.246 GB/s — a full 27% faster than Version 002 (naive code compiled at “-O2”) and 16% faster than Version 004 (4 scalar partial sums). This optimization saved an addition 0.72 cycles per element relative to the scalar SSE Version 004. On the down side, this is still only about 41% of the peak memory bandwidth available to each processor chip, so there is a long way to go.

Next time — all about prefetching….

Posted in Computer Hardware | 1 Comment »

Optimizing AMD Opteron Memory Bandwidth, Part 1: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 3rd November 2010

Optimizing AMD Opteron Memory Bandwidth, Part 1: Single-Thread, Read-Only

Background

The memory hierarchy on modern computers is complex almost beyond belief.  For well over 20 years, I have been working on the subject of memory bandwidth in computer systems and despite this experience (including 12 years on the microprocessor & system design teams at SGI, IBM, and AMD) the complexity might be beyond what my brain can handle.   But since I enjoy a challenge, I am embarking on a set of posts to see if I can explain the memory bandwidth of a modern system and use my understanding of the details to create an implementation with superior performance.

System Under Test

For this set of posts the primary system under test is a Dell PowerEdge M605 blade with two quad-core AMD Opteron model 2389 processors.  Each processor chip has two channels of DDR2/800 DRAM, with two 2GiB dual-rank DIMM installed on each channel.  2 sockets * 2 channels/socket * 2 DIMMs/channel * 2 GiB/DIMM = 16 GiB total memory installed on the blade.
The peak memory bandwidth of each channel is 8 Bytes * 800 MHz = 6.4 GB/s, giving a peak memory bandwidth of 12.8 GB/s per socket and 25.6 GB/s for the blade.

Choice of Test Kernel

I will begin with what appears to be a ridiculously simple example — the summation of all the elements of a single contiguous array of 64-bit numbers stored in local memory using a single thread of execution.  By the time I am finished, I suspect you will agree that this simple start was a good choice….

In pseudo-code, one might write the basic kernel as:

sum = 0.0;
for (i=0; i<N; i++) sum += array[i];

The execution time of the computation is measured, and the data transfer rate is computed in MB/s.  Note that MB/s is 10^6 Bytes per second, which gives numerical values almost 5% higher than would be obtained if I were computing transfer rate in MiB/s (=2^20 Bytes per second).

The system under test makes use of a 64KiB Data Cache, a 512 KiB unified Level 2 cache, and a 6144 KiB shared Level 3 cache, for a total of 6720 KiB of cache.   Since I am interested in bandwidth from DRAM, the test uses N=32,768,000, which corresponds to an array size of 256,000 KiB — slightly over 38 times the total cache size.   The kernel is repeated 100 times to “flush” the cache, and the average bandwidth is computed and presented.

A Sequence of Implementations

Implementation 1: Simple with Serial Compilation

The following source code was compiled with the Intel version 11.1 C compiler, using the commands:

icc -O2 ReadOnly.c -S
as ReadOnly.s -o ReadOnly.o
icc -O2 ReadOnly.o -o ReadOnly.icc.serial

Splitting the compilation up like this allows me to see the assembly code every time I compile, so I can monitor what the compiler is doing.

— ReadOnly.c —

#define N 32768000
#define NTIMES 100

extern double mysecond();      // a simple wall-clock timer -- appended
double a[N];                   // the data array

int main()
{
    int i,j;
    double sum;
    double t0, bw, times[NTIMES];

    for (i=0; i<NTIMES; i++) {
        times[i] = 0.0;
    }
    for (i=0; i<N; i++) {
        a[i] = 1.0;
    }

    sum = 0.0;
    for (j=0; j<NTIMES; j++) {
        t0 = mysecond();
        for (i=0; i<N; i++) {
           sum += a[i];
        }
        times[j] = mysecond()-t0;
    }
    printf("sum = %f\n",sum);
    for (i=0; i<NTIMES; i++) {
        bw = sizeof(double)*(double) N / times[i]/1e6;
        printf("iter, time, bw (MB/s) %d, %f, %f\n",i,times[i],bw);
    }
}

/* A gettimeofday routine to give access to the wall
 clock timer on most UNIX-like systems.  */

#include <sys/time.h>
double mysecond()
{
    struct timeval tp;
    struct timezone tzp;
    int i;

    i = gettimeofday(&tp,&tzp);
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

— End of ReadOnly.c —

Running the code under the “time” command gives output like:

sum = 3276800000.000000
iter, time, bw (MB/s) 0, 0.063371, 4136.659284
iter, time, bw (MB/s) 1, 0.063181, 4149.100482
iter, time, bw (MB/s) 2, 0.063225, 4146.205961
iter, time, bw (MB/s) 3, 0.063187, 4148.693441
iter, time, bw (MB/s) 4, 0.063210, 4147.191209
iter, time, bw (MB/s) 5, 0.063176, 4149.429305
iter, time, bw (MB/s) 6, 0.063195, 4148.176926
iter, time, bw (MB/s) 7, 0.063240, 4145.221181
iter, time, bw (MB/s) 8, 0.063204, 4147.582311
           [...]
iter, time, bw (MB/s) 94, 0.063249, 4144.643036
iter, time, bw (MB/s) 95, 0.063239, 4145.283693
iter, time, bw (MB/s) 96, 0.063278, 4142.737862
iter, time, bw (MB/s) 97, 0.063261, 4143.846398
iter, time, bw (MB/s) 98, 0.063239, 4145.283693
iter, time, bw (MB/s) 99, 0.063240, 4145.236809
real    0m6.519s
user    0m6.412s
sys    0m0.105s

It is important to save and use the times for each iteration so that the compiler will actually execute them. It is also helpful to have a quick visual feedback on the iteration-to-iteration variability of the memory bandwidth, which is clearly small here.

So the system under test delivers a very steady 4.145 GB/s using this version of the code. This is only 32% of the peak memory bandwidth of 12.8 GB/s for the socket, which is an uninspiring result. Don’t worry — it will get a lot better before I am through!

Analysis of Implementation 1

So why does this sample program deliver such a small fraction of the peak memory bandwidth of the node?
Instead of looking at all the possible performance limiters (most of which we will get to in due time), I will cut to the chase and give you the answer:
The performance limit here is directly due to the limited number of outstanding cache misses available to a single core.
The relevant formula is often referred to as “Little’s Law”, which in this case reduces to the simple statement that

 Latency * Bandwidth = Concurrency

where Latency is the time required to load a cache line from memory (about 74 ns on the system under test, as I reported earlier), Bandwidth is the 12.8 GB/s peak transfer rate of the DRAM on one processor chip, and Concurrency is the quantity of data that must be “in flight” in order to “fill the pipeline” or “tolerate the latency”. For the system under test, the required concurrency is 74 ns * 12.8 GB/s = 947 bytes = 14.8 cache lines. Unfortunately, each core in the Opteron Family10h processor only supports a maximum of 8 cache read misses.

Rearranging the formula to Bandwidth = Concurrency/Latency allows us to estimate how much bandwidth we think a processor should be able to get for a given Latency and a given Concurrency. Using 8 cache lines (512 Bytes) and 74 ns suggests that the maximum sustainable bandwidth will be about 6.92 GB/s. Our observed result of 4.145 GB/s is well below this value.  Substituting the observed bandwidth allows us to compute the effective concurrency, which is 4.145 GB/s * 74 ns = 306 Bytes = 4.8 Cache Lines.

Some insight into the limited concurrency is available by re-compiling the code at optimization level “-O1”, which reduces the performance to 3.393 GB/s, corresponding to an effective concurrency of 251 Bytes or 3.9 cache lines.

The assembly code for the innermost loop at “-O1” is:

..B1.8:   
        addsd     a(,%rax,8), %xmm2  
        incq      %rax       
        cmpq      $32768000, %rax  
        jl        ..B1.8

while the assembly code for the same loop at “-O2” is:

..B1.7: 
        addpd     a(,%rax,8), %xmm0     
        addq      $2, %rax 
        cmpq      $32768000, %rax 
        jl        ..B1.7

In the first case the use of the “addsd” (Add Scalar, Double Precision) instruction shows that the compiler is using a single summation variable, while in the second case, the “addpd” (Add Packed, Double Precision) shows that the compiler is using two summation variables — the upper and lower halves of a “packed double” SSE register. Because of the data dependence on the sequence of summations, the code at “-O1” experiences 32,768,000 pipeline stalls (one per addition), while the code at “-O2” experiences 16,384,001 pipeline stalls — half as many (plus one at the end to add the two partial sums together). The floating-point add instructions used here have a dependent operation latency of four cycles. Some of this is overlapped with the pointer update, compare, and branch, but not all of it. The results at “-O1” correspond to about 6.84 CPU cycles per element, while the results at “-O2” correspond to about 5.60 CPU cycles per element, a difference of 1.24 cycles per element.
The surprising (and important) result here is that these extra floating point pipeline latencies are not overlapped with the memory latencies — after all a few extra stall cycles in the core should be negligible compared to the ~215 cycles of memory latency (74 ns * 2.9 GHz). The problem is that these floating-point pipeline stalls are delaying the execution of the subsequent memory load references that are necessary to allow additional hardware prefetches to be issued from the core to the memory system.

In my next entry, I will show now software prefetch instructions can directly increase the number of outstanding cache misses and how explicitly coding for more partial sum variables can indirectly allow more outstanding prefetches by eliminating the floating-point pipeline stalls and allowing subsequent memory references to be issued more quickly….

Posted in Computer Hardware | 6 Comments »

Opteron/PhenomII STREAM Bandwidth vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.

This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂

Single-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 10,101 MB/s 9,813 MB/s 9,081 MB/s
2.5 GHz 10,077 MB/s 9,784 MB/s 9,053 MB/s
2.1 GHz 10,075 MB/s 9,783 MB/s 9,051 MB/s

Two-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 13,683 MB/s 12,851 MB/s 11,490 MB/s
2.5 GHz 13,337 MB/s 12,546 MB/s 11,252 MB/s
2.1 GHz 13,132 MB/s 12,367 MB/s 11,112 MB/s

So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.

Posted in Computer Hardware, Reference | 4 Comments »

Welcome to Dr. Bandwidth’s Blog

Posted by John D. McCalpin, Ph.D. on 1st October 2010

Welcome to the University of Texas blog of John D. McCalpin, PhD — aka Dr. Bandwidth. JohnMcCalpin

This is the beginning of a serious of posts on the exciting topic of memory bandwidth in computer systems!   I hope these posts will serve as a reference for folks interested in the increasingly complex issues regarding memory bandwidth — it seems silly for me to write all this stuff down for my own use when I suspect that there is at least one person out there who may find this information useful.

Although I currently work at the Texas Advanced Computing Center of the University of Texas at Austin, my professional career has been split about 50-50 between academia and the computer hardware industry. These blog posts will primarily deal with concrete information about real systems that (in my experience) does not fit well with the priorities of most academic publications.  Most journals or conferences require some level of “research” or “novelty”, while most of these notes are more explanatory in nature.   That is not to say that none of this will ever be published, but much of the material to be posted here is valuable for other reasons.

My interest in memory bandwidth extends back to the late 1980’s, leading to the development of the STREAM Benchmark which I have maintained since 1991.   Here is what I have to say about STREAM:


streambench_logo

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.


Why should I care?

Computer CPUs are getting faster much more quickly than computer memory systems. As this progresses, more and more programs will be limited in performance by the memory bandwidth of the system, rather than by the computational performance of the CPU.

As an extreme example, most current high-end machines run simple arithmetic kernels for out-of-cache operands at 1-2% of their rated peak speeds — that means that they are spending 98-99% of their time idle and waiting for cache misses to be satisfied.

The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications.


Posted in Reference | Comments Off on Welcome to Dr. Bandwidth’s Blog