John McCalpin's blog

Dr. Bandwidth explains all….

Notes on “non-temporal” (aka “streaming”) stores

Posted by John D. McCalpin, Ph.D. on 1st January 2018

Memory systems using caches have a lot more potential flexibility than most implementations are able to exploit – you get the standard behavior all the time, even if an alternative behavior would be allowable and desirable in a specific circumstance.  One area in which many vendors have provided an alternative to the standard behavior is in “non-temporal” or “streaming” stores. These are sometimes also referred to as “cache-bypassing” or “non-allocating” stores, and even “Non-Globally-Ordered stores” in some cases.

The availability of streaming stores on various processors can often be tied to the improvements in performance that they provide on the STREAM benchmark, but there are several approaches to implementing this functionality, with some subtleties that may not be obvious.  (The underlying mechanism of write-combining was developed long before, and for other reasons, but exposing it to the user in standard cached memory space required the motivation provided by a highly visible benchmark.)

Before getting into details, it is helpful to review the standard behavior of a “typical” cached system for store operations.  Most recent systems use a “write-allocate” policy — when a store misses the cache, the cache line containing the store’s target address is read into the cache, then the parts of the line that are receiving the new data are updated.   These “write-allocate” cache policies have several advantages (mostly related to implementation correctness), but if the code overwrites all the data in the cache line, then reading the cache line from memory could be considered an unnecessary performance overhead.    It is this extra overhead of read transactions that makes the subject of streaming stores important for the STREAM benchmark.

As in many subjects, a lot of mistakes can be avoided by looking carefully at what the various terms mean — considering both similarities and differences.

  • “Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”).
    • So there is no benefit to keeping the data in the processor’s cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
  • “Streaming store” is suggestive of stores to contiguous addresses (i.e., high “spatial locality”).
  • “Cache-bypassing store” says that at least some aspects of the transaction bypass the cache(s).
  • “Non-allocating store” says that a store that misses in a cache will not load the corresponding cache line into the cache before performing the store.
    • This implies that there is some other sort of structure to hold the data from the stores until it is sent to memory.  This may be called a “store buffer”, a “write buffer”, or a “write-combining buffer”.
  • “Non-globally-ordered store” says that the results of an NGO store might appear in a different order (relative to ordinary stores or to other NGO stores) on other processors.
    • All stores must appear in program order on the processor performing the stores, but may only need to appear in the same order on other processors in some special cases, such as stores related to interprocessor communication and synchronization.

There are at least three issues raised by these terms:

  1. Caching: Do we want the result of the store to displace something else from the cache?
  2. Allocating: Can we tolerate the extra memory traffic incurred by reading the cache line before overwriting it?
  3. Ordering: Do the results of this store need to appear in order with respect to other stores?

Different applications may have very different requirements with respect to these issues and may be best served by different implementations.   For example, if the priority is to prevent the stored data from displacing other data in the processor cache, then it may suffice to put the data in the cache, but mark it as Least-Recently-Used, so that it will displace as little useful data as possible.   In the case of the STREAM benchmark, the usual priority is the question of allocation — we want to avoid reading the data before over-writing it.

While I am not quite apologizing for that, it is clear that STREAM over-represents stores in general, and store misses in particular, relative to the high-bandwidth applications that I intended it to represent.

To understand the benefit of non-temporal stores, you need to understand if you are operating in (A) a concurrency-limited regime or (B) a bandwidth-limited regime. (This is not a short story, but you can learn a lot about how real systems work from studying it.)

(Note: The performance numbers below are from the original private version of this post created on 2015-05-29.   Newer systems require more concurrency — numbers will be updated when I have a few minutes to dig them up….)

(A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s.

Store misses are essentially the same as load misses in this analysis.

L1 hardware prefetchers don’t help performance here because they share the same 10 L1 cache miss buffers. L2 hardware prefetchers do help because bringing the data closer reduces the occupancy required by each cache miss transaction. Unfortunately, L2 hardware prefetchers are not directly controllable, so experimentation is required. The best read bandwidth that I have been able to achieve on this system is about 17.8 GB/s, which corresponds to an “effective concurrency” of about 22 cache lines or an “effective latency” of about 36 ns (for each of the 10 concurrent L1 misses).

L2 hardware prefetchers on this system are also able to perform prefetches for store miss streams, thus reducing the occupancy for store misses and increasing the store miss bandwidth.

For non-temporal stores there is no concept corresponding to “prefetch”, so you are stuck with whatever buffer occupancy the hardware gives you. Note that since the non-temporal store does not bring data *to* the processor, there is no reason for its occupancy to be tied to the memory latency. One would expect a cache miss buffer holding a non-temporal store to have a significantly lower latency than a cache miss, since it is allocated, filled, and then transfers its contents out to the memory controller (which presumably has many more buffers than the 10 cache miss buffers that the core supports).

But, while non-temporal stores are expected to have a shorter buffer occupancy than that of a cache miss that goes all the way to memory, it is not at all clear whether they will have a shorter buffer occupancy than a store misses that activates an L2 hardware prefetch engine. It turns out that on the Xeon E5-2680 (Sandy Bridge EP), non-temporal stores have *higher* occupancy than store misses that activate the L2 hardware prefetcher, so using non-temporal stores slows down the performance of each of the four STREAM kernels. I don’t have all the detailed numbers in front of me, but IIRC, STREAM Triad runs at about 10 GB/s with one thread on a Xeon E5-2680 when using non-temporal stores and at between 12-14 GB/s when *not* using non-temporal stores.

This result is not a general rule. On the Intel Xeon E3-1270 (also a Sandy Bridge core, but with the “client” L3 and memory controller), the occupancy of non-temporal stores in the L1 cache miss buffers appears to be much shorter, so there is not an occupancy-induced slowdown. On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses. In this case the use of non-temporal stores allowed one thread to have more concurrent
memory transactions, so it almost always helped performance.

(B) STREAM is typically run using as many cores as is necessary to achieve the maximum sustainable memory bandwidth. In this bandwidth-limited regime non-temporal stores play a very different role. When using all the cores there are no longer concurrency limits, but there is a big difference in bulk memory traffic. Store misses must read the target lines into the L1 cache before overwriting it, while non-temporal stores avoid this (semantically unnecessary) load from memory. Thus for the STREAM Copy and Scale kernels, using cached stores results in two memory reads and one memory write, while using non-temporal stores requires only one memory read and one memory write — a 3:2 ratio. Similarly, the STREAM Add and Triad kernels transfer 4:3 as much data when using cached stores compared to non-temporal stores.

On most systems the reported performance ratios for STREAM (using all processors) with and without non-temporal stores are very close to these 3:2 and 4:3 ratios. It is also typically the case that STREAM results (again using all processors) are about the same for all four kernels when non-temporal stores are used, while (due to the differing ratios of extra read traffic) the Add and Triad kernels are typically ~1/3 faster on systems that use cached stores.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Computer Hardware, Reference | Comments Off on Notes on “non-temporal” (aka “streaming”) stores

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

Memory Latency Components

Posted by John D. McCalpin, Ph.D. on 10th March 2011

A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….

Preliminary Comments:

It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.

On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.

Memory Latency Components: Abridged

Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).

  1. The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
    Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
  2. Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
  3. While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
  4. Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
    I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
  5. Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
  6. The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
  7. At this point the memory controller begins to do two things in parallel.  (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
    • probe the other caches on the chip, and
    • begin to set up the DRAM access.
  8. For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
  9. For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
  10. There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
  11. I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
  12. On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
  13. Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.

The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.

Let’s see what they add up to:

  • Core + System Request Interface: outbound: ~9 ns
  • Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
  • Memory Access Asynchronous interface crossings: ~21 ns
  • DRAM CAS latency: 11.25 ns
  • Core data forwarding: ~1.5 ns

This gives:

  • Total non-overlapped: ~43 ns
  • Measured latency: 51.6 ns
  • Unaccounted: ~9 ns = 18 memory controller clock cycles (assuming 2.0 GHz)

Final Comments:

  • I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
  • The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
  • I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.

Comments and corrections are always welcome.  In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me!   🙂

Posted in Computer Hardware | 4 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, and Part 3, it is time to into the ugly stuff — trying to control DRAM bank and rank access patterns and working to improve the effectiveness of the memory controller prefetcher.

Background: Banks and Ranks

The DRAM installed in the system under test consists of 2 dual-rank 2GiB DIMMs in each channel of each chip. Each “rank” is composed of 9 DRAM chips, each with 1 Gbit capacity and each driving 8 bits of the 72-bit output of the DIMM (64 bits data + 8 bits ECC). Each of these 1 Gbit DRAM chips is divided into 8 “banks” of 128 Mbits each, and each of these banks has a 1 KiB “page size”. This “DRAM page size” is unrelated to the “virtual memory page size” discussed above, but it is easy to get confused! The DRAM page size defines the amount of information transferred from the DRAM array into the “open page” buffer amps in each DRAM bank as part of the two-step (row/column) addressing used to access the DRAM memory. In the system under test, the DRAM page size is thus 8 KiB– 8 DRAM chips * 1 KiB/DRAM chip — with contiguous cache lines distributed between the two DRAM channels (using a 6-bit hash function that I won’t go into here). Each DRAM chip has 8 banks, so each rank maps 128 KiB of contiguous addresses (2 channels * 8 banks * 8 KiB/bank).

Why does this matter?

  1. Every time a reference is made to a new DRAM page, the full 8 KiB is transferred from the DRAM array to the DRAM sense amps. This uses a fair amount of power, and it makes sense to try to read the entire 8 KiB while it is in the sense amps. Although it is beyond the scope of today’s discussion, reading data from “open pages” uses only about 1/4 to 1/5 of the power required to read the same amount of data in “closed page” mode, where only one cache line is retrieved from each DRAM page.
  2. Data in the sense amps can be accessed at significantly lower latency — this is called “open page” access (or a “page hit”), and the latency is referred to as the “CAS latency”. On most systems the CAS latency is between 12.5 ns and 15 ns, independent of the frequency of the DRAM bus. If the data is not in the sense amps (a “page miss”), loading data from the array to the sense amps takes an additional latency of 12.5 to 15 ns. If the sense amps were holding other valid data, it would be necessary to write that data back to the array, taking an additional 12.5 to 15 ns. If the DRAM latency is not completely overlapped with the cache coherence latency, these increases will reduce the sustainable bandwidth according to Little’s Law: Bandwidth = Concurrency / Latency
  3. The DRAM bus is a shared bus with multiple transmitters and multiple receivers — five of each in the system under test: the memory controller and four DRAM ranks. Every time the device driving the bus needs to be switched, the bus must be left idle for a short period of time to ensure that the receivers can synchronize with the next driver. When the memory controller switches from reading to writing this is called a “read/write turnaround”. When the memory controller switches from writing to reading this is called a “write/read turnaround”. When the memory controller switches from reading from one rank to reading from a different rank this is called a “chip select turnaround” or a “chip select stall”, or sometimes a “read-to-read” stall or delay or turnaround. These idle periods depend on the electrical properties of the bus, including the length of the traces on the motherboard, the number of DIMM sockets, and the number and type of DIMMs installed. The idle periods are only very weakly dependent on the bus frequency, so as the bus gets faster and the transfer time of a cache line gets faster, these delays become proportionately more expensive. It is common for the “chip select stall” period to be as large as the cache line transfer time, meaning that a code that performs consecutive reads from different banks will only be able to use about 50% of the DRAM bandwidth.
  4. Although it could have been covered in the previous post on prefetching, the memory controller prefetcher on AMD Family10h Opteron systems appears to only look for one address stream in each 4 KiB region. This suggests that interleaving fetches from different 4 KiB pages might allow the memory controller prefetcher to produce more outstanding prefetches. The extra loops that I introduce to allow control of DRAM rank access are also convenient for allowing interleaving of fetches from different 4 KiB pages.

Experiments

Starting with Version 007 (packed SSE arithmetic, large pages, no software prefetching) at 5.392 GB/s, I split the single loop over the array into a set of three loops — the innermost loop over the 64 cache lines in a 4 KiB page, a middle loop for the 32 4 KiB pages in a 128 KiB DRAM rank, and an outer loop for the (many) 128 KiB DRAM rank ranges in the full array. The resulting Version 008 loop structure looks like:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The resulting inner loop looks the same as before — but with the much shorter iteration count of 64 instead of 512,000:

..B1.30:
        addpd     (%r14,%rsi,8), %xmm3 
        addpd     16(%r14,%rsi,8), %xmm2 
        addpd     32(%r14,%rsi,8), %xmm1
        addpd     48(%r14,%rsi,8), %xmm0 
        addq      $8, %rsi 
        cmpq      %rcx, %rsi  
        jl        ..B1.30

The overall performance of Version 008 drops almost 9% compared to Version 007 — 4.918 GB/s vs 5.392 GB/s — for reasons that are unclear to me.

Interleaving Fetches Across Multiple 4 KiB Pages

The first set of optimizations based on Version 008 will be to interleave the accesses so that multiple 4 KiB pages are accessed concurrently. This is implemented by a technique that compiler writers call “unroll and jam”. I unroll the middle loop (the one that covers the 32 4 KiB pages in a rank) and interleave (“jam”) the iterations. This is done once for Version 013 (i.e., concurrently accessing 2 4 KiB pages), three times for Version 014 (i.e., concurrently accessing 4 4 KiB pages), and seven times for Version 015 (i.e., concurrently accessing 8 4 KiB pages).
To keep the listing short, I will just show the inner loop structure of the first of these — Version 013:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=2*PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                        x0 = _mm_load_pd(&a[i+PAGESIZE+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+PAGESIZE+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+PAGESIZE+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+PAGESIZE+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The assembly code for the inner loop looks like what one would expect:

..B1.30:
        addpd     (%r15,%r11,8), %xmm3  
        addpd     16(%r15,%r11,8), %xmm2 
        addpd     32(%r15,%r11,8), %xmm1
        addpd     48(%r15,%r11,8), %xmm0 
        addpd     4096(%r15,%r11,8), %xmm3 
        addpd     4112(%r15,%r11,8), %xmm2
        addpd     4128(%r15,%r11,8), %xmm1 
        addpd     4144(%r15,%r11,8), %xmm0  
        addq      $8, %r11
        cmpq      %r10, %r11 
        jl        ..B1.30

Performance for these three versions improves dramatically as the number of 4 KiB pages accessed increases:

  • Version 008: one 4 KiB page accessed: 4.918 GB/s
  • Version 013: two 4 KiB pages accessed: 5.864 GB/s
  • Version 014: four 4 KiB pages accessed: 6.743 GB/s
  • Version 015: eight 4 KiB pages accessed: 6.978 GB/s

Going one step further, to 16 pages accessed in the inner loop, causes a sharp drop in performance — down to 5.314 GB/s.

So this idea of accessing multiple 4 KiB pages seems to have significant merit for improving single-thread read bandwidth. Next we see if explicit prefetching can push performance even higher.

Multiple 4 KiB Pages Plus Explicit Software Prefetching

A number of new versions were produced

  • Version 008 (single page accessed with triple loops) + SW prefetch yields Version 009
  • Version 013 (two pages accessed) + SW prefetch –> Version 010
  • Version 014 (four pages accessed) + SW prefetch –> Version 011
  • Version 015 (eight pages accessed) + SW prefetch –> Version 012

In each case the prefetch was set AHEAD of the current pointer by a distance of 0 to 1024 8-Byte elements and all results were tabulated. Unlike the initial tests with SW prefetching, the results here are more variable and less easy to understand.
Starting with Version 009, the addition of SW prefetching restores the performance loss introduced by the triple-loop structure and provides a strong additional boost.

loop structure no SW prefetch with SW prefetch
single loop Version 007: 5.392 GB/s Version 005: 6.091 GB/s
triple loop Version 008: 4.917 GB/s Version 009: 6.173 GB/s

All of these are large page results except Version 005. The slight increase from Version 005 to Version 009 is consistent with the improvements seen in adding large pages from Version 006 to Version 007, so it looks like adding the SW prefetch negates the performance loss introduced by the triple-loop structure.

Combining SW prefetching with interleaved 4 KiB page access produces some intriguing patterns.
Version 010 — fetching 2 4KiB pages concurrently:
Version 010 bandwidth

Version 011 — fetching 4 4KiB pages concurrently:
Version 011 bandwidth

Version 012 — fetching 8 4KiB pages concurrently:
Version 012 bandwidth

When reviewing these figures, keep in mind that the baseline performance number is the 6.173 GB/s from Version 009, so all of the results are good. What is most unusual about these results (speaking from ~25 years experience in performance analysis) is that the last two figures show some fairly strong upward spikes in performance. It is entirely commonplace to see decreases in performance due to alignment issues, but it is quite rare to see increases in performance that depend sensitively on alignment, but which remain repeatable and usable.

Choosing a range of optimum AHEAD distances for each case allows me to summarize:

Pages Accessed in Inner Loop no SW prefetch with SW prefetch
1 Version 007: 5.392 GB/s Version 009: 6.173 GB/s
2 Version 013: 5.864 GB/s Version 010: 6.417 GB/s
4 Version 014: 6.743 GB/s Version 011: 7.063 GB/s
8 Version 015: 6.978 GB/s Version 012: 7.260 GB/s

It looks like this is about as far as I can go with a single thread. Performance started at 3.401 GB/s and gradually increased to 7.260 GB/s — an increase of 113%. The resulting code is not terribly long, but it is important to remember that I only implemented the case that starts on a 128 KiB boundary (which is necessary to ensure that the accesses to multiple 4 KiB pages are all in the same rank). Extra prolog and epilog code will be required for a general-purpose sum reduction routine.

So why did I spend all of this time? Why not just start by running multiple threads to increase performance?

First, the original code was slow enough that even perfect scaling across four threads would barely provide enough concurrency to reach the 12.8 GB/s read bandwidth of the chip.
Second, the original code is not set up to allow control over which pages are being accessed. Using multiple threads that are accessing different ranks will significantly increase the number of chip-select stalls, and may not produce the desired level of performance. (More on this soon.)
Third, the system under test has only two DDR2/800 channels per chip — not exactly state of the art. Newer systems have two or four channels of DDR3 at 1333 or even 1600 MHz. Given similar memory latencies, Little’s Law tells us that these systems will only deliver more sustained bandwidth if they are able to maintain more outstanding cache misses. The experiments presented here should provide some insights into how to structure code to increase the memory concurrency.

But that does not mean that I won’t look at the performance of multi-threaded implementations — that is coming up anon…

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only