John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

TACC Ranger Node Local and Remote Memory Latency Tables

Posted by John D. McCalpin, Ph.D. on 26th July 2012

In the previous post, I published my best set of numbers for local memory latency on a variety of AMD Opteron system configurations. Here I expand that to include remote memory latency on some of the systems that I have available for testing.

Ranger is the oldest system still operational here at TACC.  It was brought on-line in February 2008 and is currently scheduled to be decommissioned in early 2013.  Each of the 3936 SunBlade X6420 nodes contains four AMD “Barcelona” quad-core Opteron processors (model 8356), running at a core frequency of 2.3 GHz and a NorthBridge frequency of 1.6 GHz.  (The Opteron 8356 processor supports a higher NorthBridge frequency, but this requires a different motherboard with  “split-plane” power supply support — not available on the SunBlade X6420.)

The on-node interconnect topology of the SunBlade X6420 is asymmetric, making maximum use of the three HyperTransport links on each Opteron processor while still allowing 2 HyperTransport links to be used for I/O.

As seen in the figure below, chips 1 & 2 on each node are directly connected to each of the other three chips, while chips 0 & 3 are only connected to two other chips — requiring two “hops” on the HyperTransport network to access the third remote chip.  Memory latency on this system is bounded below by the time required to “snoop” the caches on the other chips.  Chips 1 & 2 are directly connected to the other chips, so they get their snoop responses back more quickly and therefore have lower memory latency.

Ranger compute node inter-processor topology.

Ranger Compute node processor interconnect.

A variant of the “lat_mem_rd.c” program from “lmbench” (version 2) was used to measure the memory access latency.  The benchmark chases a chain of pointers that have been set up with a fixed stride of 128 Bytes (so that the core hardware prefetchers are not activated) and with a total size that significantly exceeds the size of the 2MiB L3 cache.  For the table below, array sizes of 32MiB to 1024MiB were used, with negligible variations in observed latency.    For this particular system, the memory controller prefetchers were active with the stride of 128 used, but since the effective latency is limited by the snoop response time, there is no change to the effective latency even when the memory controller prefetchers fetch the data early.  (I.e., the processors might get the data earlier due to memory controller prefetch, but they cannot use the data until all the snoop responses have been received.)

Memory latency for all combinations of (chip making request) and (chip holding data) are shown in the table below:

Memory Latency (ns) Data on Chip 0 Data on Chip 1 Data on Chip 2 Data on Chip 3
Request from Chip 0 133.2 136.9 136.4 145.4
Request from Chip 1 140.3 100.3 122.8 139.3
Request from Chip 2 140.4 122.2 100.4 139.3
Request from Chip 3 146.4 137.4 137.4 134.9
Cache latency and local and remote memory latency for Ranger compute nodes.

Cache latency and local and remote memory latency for Ranger compute nodes.

Posted in Computer Hardware | Comments Off on TACC Ranger Node Local and Remote Memory Latency Tables

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.


Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?

Is “ordered summation” a hard problem to speed up?

Posted by John D. McCalpin, Ph.D. on 15th February 2012

Sometimes things that seem incredibly difficult aren’t really that bad….

I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.

In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.

Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).

For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.

I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)

Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.

Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.

I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….

Posted in Algorithms, Computer Hardware | Comments Off on Is “ordered summation” a hard problem to speed up?

AMD Opteron Processor models, families, and revisions

Posted by John D. McCalpin, Ph.D. on 2nd April 2011

Opteron Processor models, families, and revisions/steppings

Opteron naming is not that confusing, but AMD seems intent on making it difficult by rearranging their web site in mysterious ways….

I am creating this blog entry to make it easier for me to find my own notes on the topic!

The Wikipedia page is has a pretty good listing:
List of AMD Opteron microprocessors

AMD has useful product comparison reference pages at:
AMD Opteron Processor Solutions
AMD Desktop Processor Solutions
AMD Opteron First Generation Reference (pdf)

Borrowing from those pages, a simple summary is:

First Generation Opteron: models 1xx, 2xx, 8xx.

  • These are all Family K8, and are described in AMD pub 26094.
  • They are usually referred to as “Rev E” or “K8, Rev E” processors.
    This is usually OK since most of the 130 nm parts are gone, but there is a new Family 10h rev E (below).
  • They are characterized by having DDR DRAM interfaces, supporting DDR 266, 333, and (Revision E) 400 MHz.
  • This also includes Athlon 64 and Athlon 64 X2 in sockets 754 and 939.
  • Versions:
    • Single core, 130 nm process: K8 revisions B3, C0, CG
    • Single core, 90 nm process: K8 revisions E4, E6
    • Dual core, 90 nm process: K8 revisions E1, E6

Second Generation Opteron: models 12xx, 22xx, 82xx

  • These are upgraded Family K8 cores, with a DDR2 memory controller.
  • They are usually referred to as “Revision F”, or “K8, Rev F”, and are described in AMD pub 32559 (where they are referred to as “Family NPT 0Fh”, with NPT meaning “New Platform Technology” and referring to the infrastructure related to socket F (aka socket 1207), and socket AM2 )
  • This also includes socket AM2 models of Athlon and most Athlon X2 processors (some are Family 11h, described below).
  • There is only one server version, with two steppings:
    • Dual core, 90 nm process: K8 revisions F2, F3

Upgraded Second Generation Opteron: Athlon X2, Sempron, Turion, Turion X2

  • These are very similar to Family 0Fh, revision G (not used in server parts), and are described in AMD document 41256.
  • The memory controller has less functionality.
  • The HyperTransport interface is upgraded to support HyperTransport generation 3.
    This allows a higher frequency connection between the processor chip and the external PCIe controller, so that PCIe gen2 speeds can be supported.

Third Generation Opteron: models 13xx, 23xx, 83xx

  • These are Family 10h cores with an enhanced DDR2 memory controller and are described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • This also includes Phenom X2, X3, and X4 (Rev B3) and Phenom II X2, X3, X4 (Rev C)
  • Versions:
    • Barcelona: Dual core & Quad core, 65 nm process: Family 10h revisions B0, B2, B3, BA
    • Shanghai: Dual core & Quad core, 45 nm process: Family 10h revision C2
    • Istanbul: Up to 6-core, 45 nm process: Family 10h, revision D0
  • Revision D (“Istanbul”) introduced the “HT Assist” probe filter feature to improve scalability in 4-socket and 8-socket systems.

Upgraded Third Generation Opteron: models 41xx & 61xx

  • These are Family 10h cores with an enhanced DDR3-capable memory controller and are also described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • It does not appear that any of the desktop parts use this same stepping as the server parts (D1).
  • There are two versions — both manufactured using a 45nm process:
    • Lisbon: 41xx series have one Family10h revision D1 die per package (socket C32).
    • Magny-Cours: 61xx series have two Family10h revision D1 dice per package (socket G34).
  • Family 10h, Revision E0 is used in the Phenom II X6 products.
    • This revision is the first to offer the “Core Performance Boost” feature.
    • It is also the first to generate confusion about the label “Rev E”.
    • It should be referred to as “Family 10h, Revision E” to avoid ambiguity.

Fourth Generation Opteron: server processor models 42xx & 62xx, and “AMD FX” desktop processors

  • These are socket-compatible with the 41xx and 61xx series, but with the “Bulldozer” core rather than the Family 10h core.
  • The Bulldozer core adds support for:
    • AVX — the extension of SSE from 128 bits wide to 256 bits wide, plus many other improvements. (First introduced in Intel “Sandy Bridge” processors.)
    • AES — additional instructions to dramatically improve performance of AES encryption/descryption. (First introduced in Intel “Westmere” processors.)
    • FMA4 — AMD’s 4-operand multiply-accumulate instructions. (32-bit & 64-bit arithmetic, with 64b, 128b, or 256b vectors.)
    • XOP — AMD’s set of extra integer instructions that were not included in AVX: multiply/accumulate, shift/rotate/permute, etc.
  • All current parts are produced in a 32 nm semiconductor process.
  • Valencia: 42xx series have one Bulldozer revision B2 die per package (socket C32)
  • Interlagos: 62xx series have two Bulldozer revision B2 dice per package (socket G34)
  • “AMD FX”: desktop processors have one Bulldozer revision B2 die per package (socket AM3+)
  • Counting cores and chips is getting more confusing…
    • Each die has 1, 2, 3, or 4 “Bulldozer modules”.
    • Each “Bulldozer module” has two processor cores.
    • The two processor cores in a module share the instruction cache (64kB), some of the instruction fetch logic, the pair of floating-point units, and the 2MB L2 cache.
    • The two processor cores in a module each have a private data cache (16kB), private fixed point functional and address generation units, and schedulers.
    • All modules on a die share an 8 MB L3 cache and the dual-channel DDR3 memory controller.
  • Bulldozer-based systems are characterized by a much larger “turbo” boost frequency increase than previous processors, with almost models supporting an automatic frequency boost of over 20% when not using all the cores, and some models supporting frequency boosts of more than 30%.

Posted in Computer Hardware, Reference | 4 Comments »

Memory Latency Components

Posted by John D. McCalpin, Ph.D. on 10th March 2011

A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….

Preliminary Comments:

It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.

On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.

Memory Latency Components: Abridged

Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).

  1. The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
    Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
  2. Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
  3. While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
  4. Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
    I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
  5. Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
  6. The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
  7. At this point the memory controller begins to do two things in parallel.  (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
    • probe the other caches on the chip, and
    • begin to set up the DRAM access.
  8. For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
  9. For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
  10. There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
  11. I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
  12. On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
  13. Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.

The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.

Let’s see what they add up to:

  • Core + System Request Interface: outbound: ~9 ns
  • Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
  • Memory Access Asynchronous interface crossings: ~21 ns
  • DRAM CAS latency: 11.25 ns
  • Core data forwarding: ~1.5 ns

This gives:

  • Total non-overlapped: ~43 ns
  • Measured latency: 51.6 ns
  • Unaccounted: ~9 ns = 18 memory controller clock cycles (assuming 2.0 GHz)

Final Comments:

  • I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
  • The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
  • I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.

Comments and corrections are always welcome.  In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me!   🙂

Posted in Computer Hardware | 4 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 5: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 11th November 2010

Single Thread, Read Only Results Comparison Across Systems

In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (“Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.

Test Systems

  1. 2-socket AMD Family10h Opteron Revision C2 (“Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
  2. 2-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  3. 4-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  4. 4-socket AMD Family10h Opteron 6174, Revision E0 (“Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
  5. 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333

All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.

The source code, scripts, and results are all available in a tar file: ReadOnly_2010-11-12.tar.bz2

Results

Code Version Notes Vector SSE Large Page SW Prefetch 4 KiB pages accessed Ref System (2p Shanghai) 2-socket Istanbul 4-socket Istanbul 4-socket Magny-Cours 1-socket PhenomII
Version001 “-O1” 1 3.401 GB/s 3.167 GB/s 4..311 GB/s 3.734GB/s 4.586 GB/s
Version002 “-O2” 1 4.122 GB/s 4.035 GB/s 5.719 GB/s 5.120 GB/s 5.688 GB/s
Version003 8 partial sums 1 4.512 GB/s 4.373 GB/s 5.946 GB/s 5.476 GB/s 6.207 GB/s
Version004 add SW prefetch Y 1 6.083 GB/s 5.732 GB/s 6.489 GB/s 6.389 GB/s 7.571 GB/s
Version005 add vector SSE Y Y 1 6.091 GB/s 5.765 GB/s 6.600 GB/s 6.398 GB/s 7.580 GB/s
Version006 remove prefetch Y 1 5.247 GB/s 5.159 GB/s 6.787 GB/s 6.403 GB/s 6.976 GB/s
Version007 add large pages Y Y 1 5.392 GB/s 5.234 GB/s 7.149 GB/s 6.653 GB/s 7.117 GB/s
Version008 split into triply-nested loop Y Y 1 4.918 GB/s 4.914 GB/s 6.661 GB/s 6.180 GB/s 6.616 GB/s
Version009 add SW prefetch Y Y Y 1 6.173 GB/s 5.901 GB/s 6.646 GB/s 6.568 GB/s 7.736 GB/s
Version010 multiple pages/loop Y Y Y 2 6.417 GB/s 6.174 GB/s 7.569 GB/s 6.895 GB/s 7.913 GB/s
Version011 multiple pages/loop Y Y Y 4 7.063 GB/s 6.804 GB/s 8.319 GB/s 7.245 GB/s 8.583 GB/s
Version012 multiple pages/loop Y Y Y 8 7.260 GB/s 6.960 GB/s 8.378 GB/s 7.205 GB/s 8.642 GB/s
Version013 Version010 minus SW prefetch Y Y 2 5.864 GB/s 6.009 GB/s 7.667 GB/s 6.676 GB/s 7.469 GB/s
Version014 Version011 minus SW prefetch Y Y 4 6.743 GB/s 6.483 GB/s 8.136 GB/s 6.946 GB/s 8.291 GB/s
Version015 Version012 minus SW prefetch Y Y 8 6.978 GB/s 6.578 GB/s 8.112 GB/s 6.937 GB/s 8.463 GB/s

Comments

There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….

  • For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
  • The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
    ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large.
  • This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
  • The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.

Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:

Posted in Computer Hardware | 5 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, and Part 3, it is time to into the ugly stuff — trying to control DRAM bank and rank access patterns and working to improve the effectiveness of the memory controller prefetcher.

Background: Banks and Ranks

The DRAM installed in the system under test consists of 2 dual-rank 2GiB DIMMs in each channel of each chip. Each “rank” is composed of 9 DRAM chips, each with 1 Gbit capacity and each driving 8 bits of the 72-bit output of the DIMM (64 bits data + 8 bits ECC). Each of these 1 Gbit DRAM chips is divided into 8 “banks” of 128 Mbits each, and each of these banks has a 1 KiB “page size”. This “DRAM page size” is unrelated to the “virtual memory page size” discussed above, but it is easy to get confused! The DRAM page size defines the amount of information transferred from the DRAM array into the “open page” buffer amps in each DRAM bank as part of the two-step (row/column) addressing used to access the DRAM memory. In the system under test, the DRAM page size is thus 8 KiB– 8 DRAM chips * 1 KiB/DRAM chip — with contiguous cache lines distributed between the two DRAM channels (using a 6-bit hash function that I won’t go into here). Each DRAM chip has 8 banks, so each rank maps 128 KiB of contiguous addresses (2 channels * 8 banks * 8 KiB/bank).

Why does this matter?

  1. Every time a reference is made to a new DRAM page, the full 8 KiB is transferred from the DRAM array to the DRAM sense amps. This uses a fair amount of power, and it makes sense to try to read the entire 8 KiB while it is in the sense amps. Although it is beyond the scope of today’s discussion, reading data from “open pages” uses only about 1/4 to 1/5 of the power required to read the same amount of data in “closed page” mode, where only one cache line is retrieved from each DRAM page.
  2. Data in the sense amps can be accessed at significantly lower latency — this is called “open page” access (or a “page hit”), and the latency is referred to as the “CAS latency”. On most systems the CAS latency is between 12.5 ns and 15 ns, independent of the frequency of the DRAM bus. If the data is not in the sense amps (a “page miss”), loading data from the array to the sense amps takes an additional latency of 12.5 to 15 ns. If the sense amps were holding other valid data, it would be necessary to write that data back to the array, taking an additional 12.5 to 15 ns. If the DRAM latency is not completely overlapped with the cache coherence latency, these increases will reduce the sustainable bandwidth according to Little’s Law: Bandwidth = Concurrency / Latency
  3. The DRAM bus is a shared bus with multiple transmitters and multiple receivers — five of each in the system under test: the memory controller and four DRAM ranks. Every time the device driving the bus needs to be switched, the bus must be left idle for a short period of time to ensure that the receivers can synchronize with the next driver. When the memory controller switches from reading to writing this is called a “read/write turnaround”. When the memory controller switches from writing to reading this is called a “write/read turnaround”. When the memory controller switches from reading from one rank to reading from a different rank this is called a “chip select turnaround” or a “chip select stall”, or sometimes a “read-to-read” stall or delay or turnaround. These idle periods depend on the electrical properties of the bus, including the length of the traces on the motherboard, the number of DIMM sockets, and the number and type of DIMMs installed. The idle periods are only very weakly dependent on the bus frequency, so as the bus gets faster and the transfer time of a cache line gets faster, these delays become proportionately more expensive. It is common for the “chip select stall” period to be as large as the cache line transfer time, meaning that a code that performs consecutive reads from different banks will only be able to use about 50% of the DRAM bandwidth.
  4. Although it could have been covered in the previous post on prefetching, the memory controller prefetcher on AMD Family10h Opteron systems appears to only look for one address stream in each 4 KiB region. This suggests that interleaving fetches from different 4 KiB pages might allow the memory controller prefetcher to produce more outstanding prefetches. The extra loops that I introduce to allow control of DRAM rank access are also convenient for allowing interleaving of fetches from different 4 KiB pages.

Experiments

Starting with Version 007 (packed SSE arithmetic, large pages, no software prefetching) at 5.392 GB/s, I split the single loop over the array into a set of three loops — the innermost loop over the 64 cache lines in a 4 KiB page, a middle loop for the 32 4 KiB pages in a 128 KiB DRAM rank, and an outer loop for the (many) 128 KiB DRAM rank ranges in the full array. The resulting Version 008 loop structure looks like:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The resulting inner loop looks the same as before — but with the much shorter iteration count of 64 instead of 512,000:

..B1.30:
        addpd     (%r14,%rsi,8), %xmm3 
        addpd     16(%r14,%rsi,8), %xmm2 
        addpd     32(%r14,%rsi,8), %xmm1
        addpd     48(%r14,%rsi,8), %xmm0 
        addq      $8, %rsi 
        cmpq      %rcx, %rsi  
        jl        ..B1.30

The overall performance of Version 008 drops almost 9% compared to Version 007 — 4.918 GB/s vs 5.392 GB/s — for reasons that are unclear to me.

Interleaving Fetches Across Multiple 4 KiB Pages

The first set of optimizations based on Version 008 will be to interleave the accesses so that multiple 4 KiB pages are accessed concurrently. This is implemented by a technique that compiler writers call “unroll and jam”. I unroll the middle loop (the one that covers the 32 4 KiB pages in a rank) and interleave (“jam”) the iterations. This is done once for Version 013 (i.e., concurrently accessing 2 4 KiB pages), three times for Version 014 (i.e., concurrently accessing 4 4 KiB pages), and seven times for Version 015 (i.e., concurrently accessing 8 4 KiB pages).
To keep the listing short, I will just show the inner loop structure of the first of these — Version 013:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=2*PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                        x0 = _mm_load_pd(&a[i+PAGESIZE+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+PAGESIZE+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+PAGESIZE+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+PAGESIZE+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                    }
                }
            }

The assembly code for the inner loop looks like what one would expect:

..B1.30:
        addpd     (%r15,%r11,8), %xmm3  
        addpd     16(%r15,%r11,8), %xmm2 
        addpd     32(%r15,%r11,8), %xmm1
        addpd     48(%r15,%r11,8), %xmm0 
        addpd     4096(%r15,%r11,8), %xmm3 
        addpd     4112(%r15,%r11,8), %xmm2
        addpd     4128(%r15,%r11,8), %xmm1 
        addpd     4144(%r15,%r11,8), %xmm0  
        addq      $8, %r11
        cmpq      %r10, %r11 
        jl        ..B1.30

Performance for these three versions improves dramatically as the number of 4 KiB pages accessed increases:

  • Version 008: one 4 KiB page accessed: 4.918 GB/s
  • Version 013: two 4 KiB pages accessed: 5.864 GB/s
  • Version 014: four 4 KiB pages accessed: 6.743 GB/s
  • Version 015: eight 4 KiB pages accessed: 6.978 GB/s

Going one step further, to 16 pages accessed in the inner loop, causes a sharp drop in performance — down to 5.314 GB/s.

So this idea of accessing multiple 4 KiB pages seems to have significant merit for improving single-thread read bandwidth. Next we see if explicit prefetching can push performance even higher.

Multiple 4 KiB Pages Plus Explicit Software Prefetching

A number of new versions were produced

  • Version 008 (single page accessed with triple loops) + SW prefetch yields Version 009
  • Version 013 (two pages accessed) + SW prefetch –> Version 010
  • Version 014 (four pages accessed) + SW prefetch –> Version 011
  • Version 015 (eight pages accessed) + SW prefetch –> Version 012

In each case the prefetch was set AHEAD of the current pointer by a distance of 0 to 1024 8-Byte elements and all results were tabulated. Unlike the initial tests with SW prefetching, the results here are more variable and less easy to understand.
Starting with Version 009, the addition of SW prefetching restores the performance loss introduced by the triple-loop structure and provides a strong additional boost.

loop structure no SW prefetch with SW prefetch
single loop Version 007: 5.392 GB/s Version 005: 6.091 GB/s
triple loop Version 008: 4.917 GB/s Version 009: 6.173 GB/s

All of these are large page results except Version 005. The slight increase from Version 005 to Version 009 is consistent with the improvements seen in adding large pages from Version 006 to Version 007, so it looks like adding the SW prefetch negates the performance loss introduced by the triple-loop structure.

Combining SW prefetching with interleaved 4 KiB page access produces some intriguing patterns.
Version 010 — fetching 2 4KiB pages concurrently:
Version 010 bandwidth

Version 011 — fetching 4 4KiB pages concurrently:
Version 011 bandwidth

Version 012 — fetching 8 4KiB pages concurrently:
Version 012 bandwidth

When reviewing these figures, keep in mind that the baseline performance number is the 6.173 GB/s from Version 009, so all of the results are good. What is most unusual about these results (speaking from ~25 years experience in performance analysis) is that the last two figures show some fairly strong upward spikes in performance. It is entirely commonplace to see decreases in performance due to alignment issues, but it is quite rare to see increases in performance that depend sensitively on alignment, but which remain repeatable and usable.

Choosing a range of optimum AHEAD distances for each case allows me to summarize:

Pages Accessed in Inner Loop no SW prefetch with SW prefetch
1 Version 007: 5.392 GB/s Version 009: 6.173 GB/s
2 Version 013: 5.864 GB/s Version 010: 6.417 GB/s
4 Version 014: 6.743 GB/s Version 011: 7.063 GB/s
8 Version 015: 6.978 GB/s Version 012: 7.260 GB/s

It looks like this is about as far as I can go with a single thread. Performance started at 3.401 GB/s and gradually increased to 7.260 GB/s — an increase of 113%. The resulting code is not terribly long, but it is important to remember that I only implemented the case that starts on a 128 KiB boundary (which is necessary to ensure that the accesses to multiple 4 KiB pages are all in the same rank). Extra prolog and epilog code will be required for a general-purpose sum reduction routine.

So why did I spend all of this time? Why not just start by running multiple threads to increase performance?

First, the original code was slow enough that even perfect scaling across four threads would barely provide enough concurrency to reach the 12.8 GB/s read bandwidth of the chip.
Second, the original code is not set up to allow control over which pages are being accessed. Using multiple threads that are accessing different ranks will significantly increase the number of chip-select stalls, and may not produce the desired level of performance. (More on this soon.)
Third, the system under test has only two DDR2/800 channels per chip — not exactly state of the art. Newer systems have two or four channels of DDR3 at 1333 or even 1600 MHz. Given similar memory latencies, Little’s Law tells us that these systems will only deliver more sustained bandwidth if they are able to maintain more outstanding cache misses. The experiments presented here should provide some insights into how to structure code to increase the memory concurrency.

But that does not mean that I won’t look at the performance of multi-threaded implementations — that is coming up anon…

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, it is time to look at adding explicit prefetching to try to increase read bandwidth.

About Prefetching

The AMD Opteron Family10h processors have two different “hardware” prefetch mechanisms, and also allow “software” prefetch instructions. The “core prefetcher” is (as the name implies) located in the processor core, and monitors L1 Data Cache misses. (There is a similar prefetch engine for Instruction Cache misses, but that is not today’s topic.) This “core prefetcher” monitors a large number of data access streams and prefetches along detected address streams consisting of contiguous ascending or descending cache line addresses. I believe that the core prefetcher only fetches one cache line ahead of the most recent load to each stream, waiting until a cache line is returned from memory before sending out the next request, so it will not “automagically” create enough prefetches to fill the memory pipeline.
(Note that these sorts of details are hard to tease out of vendor documentation, and are subject to frequent change.) The core prefetcher is operational in all of the results presented here, with many of the code modifications intended to make it operate more effectively.
The Opteron Family10h processors also have a “Memory Controller Prefetcher” that prefetches from DRAM to special buffers in the memory controller. This can reduce the latency for subsequent memory accesses, and therefore increase the bandwidth available with a fixed level of concurrency. This part is really important, so I will repeat it:

Bandwidth = Concurrency / Latency

The concurrency is limited by the number of buffers available for outstanding cache misses (8 per core, in this case), while the latency is determined by where the data ends up getting found. In the system under test the latency is actually bounded below by the time required to probe the other caches in the system, not by the time required to obtain the data from the DRAMs. In the first generation of AMD Opteron Family10h processors (Revisions B0 & B3), the memory controller prefetcher simply read the data from the DRAMs to a set of buffers in the memory controller. In these systems the cache coherency check was not initiated until the processor actually sent a load for a particular cache line to the memory controller. Unfortunately in this case the time required to check the other caches to see if they had a copy of the line was considerably longer than the time required to get the data from the DRAM, so getting the data from the DRAM earlier did not help at all — it just meant that the processor had to wait longer after receiving the data before it could use it. In Revision C of the AMD Opteron Family10h processors, the memory controller prefetcher was enhanced to make “coherent” prefetches — the memory controller began the coherence transaction when it performed the prefetch. In the best case the coherence transaction will be complete by the time the load request from the processor arrives, which significantly reduces the latency observed by the processor. From the formula above, it should be clear that for a fixed level of concurrency, the only way to increase sustained bandwidth is to reduce the effective latency. In Revisions D and E of the AMD Opteron Family10h processors, this “coherent prefetcher” is retained with some additional improvements.

Finally we get to “Software Prefetch”. The AMD64/Intel64 instruction set includes a set of explicit prefetch instructions. There are two versions that matter, the “PREFETCH T0” instruction and the “PREFETCH NTA” instruction. The former fetches data from memory much like a load, while the latter fetches data from memory and marks it as “non-temporal” — meaning that it is unlikely to be reused. For Family10h Opterons “non-temporal” fetches go into the L1 cache but when they are chosen to be replaced in the L1 Cache, they are simply dropped (if clean) rather than being sent to the L2 Cache as “victims”. This allows the L2 Cache to be used more effectively for data that is likely to be reused. Since the array used in this benchmark is much larger than the cache, I could use the PREFETCH_NTA instruction when explicitly prefetching data. The choice of prefetch instruction does not make much difference in performance here, though it can sharply reduce performance if the code does end up reusing the data, so in these examples I use PREFETCH_T0 just as a habit.

Different revisions of hardware treat software prefetches differently — again it is hard to obtain this information from vendor documentation. There are several issues relating to software prefetch behavior that can influence how you want to use them:

  1. Do software prefetches cause access violations if the address is out-of-bounds?
    No they do not. This makes them easier to use since you can prefetch beyond the end of an array without worrying about access violations.
  2. Do software prefetches trigger the hardware page table walker in the event of a TLB miss?
    Yes, on Family10h Opterons. This is usually a good thing, though there is only one hardware page table walker per core. The SW prefetch will only cause a hardware table walk — if the page table entry is not found by the hardware table walker, the request will be silently dropped. This is different than a load, which will cause a trap to O/S software if the page table entry is not found by the hardware table walker (for example if the page has been swapped to disk).
  3. Do the addresses used in software prefetches trigger the hardware prefetchers like loads do?
    In earlier Opterons I think that the answer was “no”, and in Family10h Opterons I think that the answer is “yes”, but it does not matter in this particular test case.
  4. Do software prefetches combine in the cache miss buffers with load misses, or do they allocate separate buffers?
    For AMD Family10h processors the Miss Address Buffers combine all accesses to the same cache line, whether due to hardware prefetches, load misses, or software prefetches. This makes it less critical that the code avoid issuing both load misses and SW prefetches to the same cache line.

Prefetching Experiments

I repeated most of the previous experiments with explicit software prefetch instructions added. In each case I varied the distance between the current loop pointer and the target of the prefetch from 0 8-Byte elements to 1024 8-byte elements. The prefetch instructions were added using yet another compiler intrinsic function executed once per loop (which explains why I prefer to unroll the inner loop to handle a cache line at a time). The inner loop of Version 003 (with 4 scalar variables as partial sums) then becomes Version004:

            for (i=0; i<N; i+=8) {
                _mm_prefetch((char *)&a[i+AHEAD],_MM_HINT_T0);
                sum0 += a[i+0];
                sum1 += a[i+1];
                sum2 += a[i+2];
                sum3 += a[i+3];
                sum0 += a[i+4];
                sum1 += a[i+5];
                sum2 += a[i+6];
                sum3 += a[i+7];
            }

and the resulting assembly code for the inner loop is as expected:

..B1.10: 
        addsd     a(%rdx), %xmm3 
        addsd     8+a(%rdx), %xmm2
        addsd     16+a(%rdx), %xmm1
        addsd     24+a(%rdx), %xmm0
        addsd     32+a(%rdx), %xmm3
        addsd     40+a(%rdx), %xmm2 
        addsd     48+a(%rdx), %xmm1
        addsd     56+a(%rdx), %xmm0  
        prefetcht0 (%rax)
        addq      $64, %rax 
        addq      $64, %rdx 
        addq      $8, %rcx  
        cmpq      $32768000, %rcx 
        jl        ..B1.10

The performance of Version 004 is now dependent on the AHEAD distance, as shown in Figure 1.

Comparing to Version 003, the explicit software prefetching increases the bandwidth dramatically, from ~4.5 GB/s to over 6.0 GB/s. Without trying to understand the details, it is clear that it helps a great deal to prefetch at least 96 elements ahead, with 96 elements = 768 Bytes = 12 cache lines. There are nice wide ranges that show steady levels of performance, with (for example) the average of AHEAD=416 to AHEAD=448 coming to 6.083 GB/s. Given the 74 ns nominal memory latency, this corresponds to an effective concurrency of 450 Bytes, or slightly over 7.0 cache lines. Note that this is approaching the maximum value that a single thread should be able to attain unless the memory controller prefetcher is able to significantly reduce the effective memory latency.

Combining the explicit software prefetching of Version 004 with the packed double SSE of Version 003 produces Version 005. Unfortunately the performance of Version 004 and Version 005 is essentially identical — the reduction in pipeline latency provided in Version 005 overlaps with the improvement in latency due to software prefetching, producing no additional gain.

Putting Data on Large Pages

The AMD Opteron Family10h processors support a standard virtual memory page size of 4 KiB with large pages sizes of 2 MiB and 1 GiB. Most versions of Linux support only one option for large page sizes, typically the 2 MiB version. I configured the system under test to reserve 512 large pages behind each chip and modified the benchmark to use these large pages. Some compilers (notably the Open64 compilers) have compile/link options to put the data on large pages, but I prefer doing it a bit more manually using shared memory segments. This has the advantage of portability across all compilers and can be switched back to the default page size by a simple change to the parameters to the shmget() call. One slightly tricky issue is that when using large pages the size requested in the shmget call needs to be rounded up to the nearest multiple of the page size. The code to allocate the array on large pages looks like:

    i = total/(2*1024*1024);                                   // how many full 2MiB pages in "total" bytes?
    sum = ceil((double)total/(2.*1024.*1024.));     // round up to next integer if needed
    j = (int) sum;                                                    // ceil() returns a double -- convert to integer
    SEGSIZE = j*(2*1024*1024);                         // now the SEGSIZE is the smallest multiple of 2MiB needed to hold "total" bytes

    shmida = shmget(IPC_PRIVATE,SEGSIZE,IPC_CREAT|SHM_HUGETLB|0666);      // simply eliminate the SHM_HUGETLB to get the default page size
    a = (double *) shmat (shmida, NULL, 0);                                  // *real* code should check for error returns on both the shmget and shmat calls!

Taking the packed double SSE Version 006 and putting the data on large pages gives us Version 007. This version does not include software prefetching. Performance is improved slightly by the use of large pages, from 5.247 GB/s (Version 006) to 5.392 GB/s (Version 007) — a bit under 3% improvement. Not to worry — the main goal of using large pages is not to directly improve performance, but to allow control over which DRAM banks and ranks are being accessed.

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only

Optimizing AMD Opteron Memory Bandwidth, Part 2: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 8th November 2010

In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today I will see how far this single-thread performance can be enhanced.

The surprising result from the previous experiments was that the floating-point pipeline latency made visible by the dependent floating-point add operations was quite important in limiting the number of outstanding cache line fetches, and therefore constituted an important limiter in overall performance. The dependent operation latency of the floating-point pipeline in the AMD Opteron Family10h processor is 4 cycles, so four add operations must be operating concurrently to fill the pipeline.

Filling the Floating-Point Pipeline

Scalar SSE

The code was modified to produce “Version 003”, which declares four separate summation variables (sum0, sum1, sum2, sum3), and unrolls the inner loop to handle a cache line at a time:

        for (i=0; i<N; i+=8) {
            sum0 += a[i+0];
            sum1 += a[i+1];
            sum2 += a[i+2];
            sum3 += a[i+3];
            sum0 += a[i+4];
            sum1 += a[i+5];
            sum2 += a[i+6];
            sum3 += a[i+7];
        }
       sum = sum0 + sum1 + sum2 + sum3;

The modified code was compiled (again with the Intel version 11.1 compiler) at “-O2”. The assembly code for the inner loop was:

..B1.7:
        addsd     a(,%rax,8), %xmm3
        addsd     8+a(,%rax,8), %xmm2
        addsd     16+a(,%rax,8), %xmm1
        addsd     24+a(,%rax,8), %xmm0 
        addsd     32+a(,%rax,8), %xmm3  
        addsd     40+a(,%rax,8), %xmm2
        addsd     48+a(,%rax,8), %xmm1
        addsd     56+a(,%rax,8), %xmm0
        addq      $8, %rax
        cmpq      $32768000, %rax 
        jl        ..B1.7  

The assembly code follows the C source code exactly. I was a little surprised that the compiler did not combine these 8 scalar operations into 4 packed operations, but it is good to remember that compilers are unpredictable beasts, and need to be monitored closely. Performance for Version 003 was 4.511 GB/s — about 9.5% faster than Version 002. In terms of execution time per floating-point addition operation, this optimization saved about 0.5 cycles per element.

Vector SSE

Continuing further in this direction, it is time to force the generation of packed double SSE arithmetic operations. The floating-point add unit is two 64-bit elements wide, so to fill the pipeline the four add operations really need to be ADDPD — packed double adds. While it may be possible to convince the compiler to generate the desired code with portable code, I decided to bite the bullet here and use some compiler extensions to get what I wanted. Version 006 (don’t worry — I have not forgotten 004 & 005) includes these declarations that the compiler interprets as packed double floating-point variables:

    __m128d sum0,sum1,sum2,sum3;
    __m128d x0,x1,x2,x3;

Note that these variables can only be used in limited ways — primarily as sources or destinations of assignment functions or SSE intrinsic functions. For example, to set the initial values I use a compiler intrinsic:

        x0 = _mm_set_pd(0.0,0.0);
        x1 = _mm_set_pd(0.0,0.0);
        x2 = _mm_set_pd(0.0,0.0);
        x3 = _mm_set_pd(0.0,0.0);
        sum0 = _mm_set_pd(0.0,0.0);
        sum1 = _mm_set_pd(0.0,0.0);
        sum2 = _mm_set_pd(0.0,0.0);
        sum3 = _mm_set_pd(0.0,0.0);

The inner loop of the summation is also coded with special intrinsic functions (defined in and similarly-named files):

    for (i=0; i<N; i+=8) {
                x0 = _mm_load_pd(&a[i+0]);
                sum0 = _mm_add_pd(sum0,x0);
                x1 = _mm_load_pd(&a[i+2]);
                sum1 = _mm_add_pd(sum1,x1);
                x2 = _mm_load_pd(&a[i+4]);
                sum2 = _mm_add_pd(sum2,x2);
                x3 = _mm_load_pd(&a[i+6]);
                sum3 = _mm_add_pd(sum3,x3);
            }

The _mm_load_pd() intrinsic is read as “multi-media load packed double”. It expects a 16-byte aligned pointer as its argument, and returns a value into a variable declared as type __m128d. The _mm_add_pd() intrinsic is the “multi-media add packed double” instruction. It has two arguments of type __m128 which are added together and written back into the first argument — this behavior mimics the x86 ADDPD assembly language function. The left-hand-side of the assignment is also used for the output variable — I don’t know what happens if this does not match the first argument. Caveat Emptor.
The assembly code generated for this loop is exactly what I wanted:

..B1.10: 
        addpd     a(,%rax,8), %xmm3 
        addpd     16+a(,%rax,8), %xmm2 
        addpd     32+a(,%rax,8), %xmm1
        addpd     48+a(,%rax,8), %xmm0
        addq      $8, %rax 
        cmpq      $32768000, %rax  
        jl        ..B1.10

There are a couple of ways to “unpack” packed double variables in order to perform the final summation across the partial sums. In this case the vector is very long, so the time required to perform the last couple of summations is tiny and the code does not need to be efficient. I picked the first approach that I could figure out how to code:

            x0 = _mm_set_pd(0.0,0.0);
            x0 = _mm_add_pd(x0,sum0);
            x0 = _mm_add_pd(x0,sum1);
            x0 = _mm_add_pd(x0,sum2);
            x0 = _mm_add_pd(x0,sum3);
            _mm_storel_pd(&temp1, x0);
            _mm_storeh_pd(&temp2, x0);
            sum = temp1 + temp2;

This code clears a packed double variable, then adds the four (packed double) partial sums to generate a final pair of partial sums in the upper and lower halves of x0. The _mm_storel_pd() intrinsic stores the 64-bit double in the “low” half of x0 into the memory location pointed to by the first argument, while _mm_storeh_pd() stores the 64-bit double in the “high” half of x0 into the memory location pointed to by its first argument. These two doubles are then added together to build the final sum value.
The performance improvement provided by optimizing the inner loop was bigger than I expected — Version 006 delivered 5.246 GB/s — a full 27% faster than Version 002 (naive code compiled at “-O2”) and 16% faster than Version 004 (4 scalar partial sums). This optimization saved an addition 0.72 cycles per element relative to the scalar SSE Version 004. On the down side, this is still only about 41% of the peak memory bandwidth available to each processor chip, so there is a long way to go.

Next time — all about prefetching….

Posted in Computer Hardware | 1 Comment »

Optimizing AMD Opteron Memory Bandwidth, Part 1: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 3rd November 2010

Optimizing AMD Opteron Memory Bandwidth, Part 1: Single-Thread, Read-Only

Background

The memory hierarchy on modern computers is complex almost beyond belief.  For well over 20 years, I have been working on the subject of memory bandwidth in computer systems and despite this experience (including 12 years on the microprocessor & system design teams at SGI, IBM, and AMD) the complexity might be beyond what my brain can handle.   But since I enjoy a challenge, I am embarking on a set of posts to see if I can explain the memory bandwidth of a modern system and use my understanding of the details to create an implementation with superior performance.

System Under Test

For this set of posts the primary system under test is a Dell PowerEdge M605 blade with two quad-core AMD Opteron model 2389 processors.  Each processor chip has two channels of DDR2/800 DRAM, with two 2GiB dual-rank DIMM installed on each channel.  2 sockets * 2 channels/socket * 2 DIMMs/channel * 2 GiB/DIMM = 16 GiB total memory installed on the blade.
The peak memory bandwidth of each channel is 8 Bytes * 800 MHz = 6.4 GB/s, giving a peak memory bandwidth of 12.8 GB/s per socket and 25.6 GB/s for the blade.

Choice of Test Kernel

I will begin with what appears to be a ridiculously simple example — the summation of all the elements of a single contiguous array of 64-bit numbers stored in local memory using a single thread of execution.  By the time I am finished, I suspect you will agree that this simple start was a good choice….

In pseudo-code, one might write the basic kernel as:

sum = 0.0;
for (i=0; i<N; i++) sum += array[i];

The execution time of the computation is measured, and the data transfer rate is computed in MB/s.  Note that MB/s is 10^6 Bytes per second, which gives numerical values almost 5% higher than would be obtained if I were computing transfer rate in MiB/s (=2^20 Bytes per second).

The system under test makes use of a 64KiB Data Cache, a 512 KiB unified Level 2 cache, and a 6144 KiB shared Level 3 cache, for a total of 6720 KiB of cache.   Since I am interested in bandwidth from DRAM, the test uses N=32,768,000, which corresponds to an array size of 256,000 KiB — slightly over 38 times the total cache size.   The kernel is repeated 100 times to “flush” the cache, and the average bandwidth is computed and presented.

A Sequence of Implementations

Implementation 1: Simple with Serial Compilation

The following source code was compiled with the Intel version 11.1 C compiler, using the commands:

icc -O2 ReadOnly.c -S
as ReadOnly.s -o ReadOnly.o
icc -O2 ReadOnly.o -o ReadOnly.icc.serial

Splitting the compilation up like this allows me to see the assembly code every time I compile, so I can monitor what the compiler is doing.

— ReadOnly.c —

#define N 32768000
#define NTIMES 100

extern double mysecond();      // a simple wall-clock timer -- appended
double a[N];                   // the data array

int main()
{
    int i,j;
    double sum;
    double t0, bw, times[NTIMES];

    for (i=0; i<NTIMES; i++) {
        times[i] = 0.0;
    }
    for (i=0; i<N; i++) {
        a[i] = 1.0;
    }

    sum = 0.0;
    for (j=0; j<NTIMES; j++) {
        t0 = mysecond();
        for (i=0; i<N; i++) {
           sum += a[i];
        }
        times[j] = mysecond()-t0;
    }
    printf("sum = %f\n",sum);
    for (i=0; i<NTIMES; i++) {
        bw = sizeof(double)*(double) N / times[i]/1e6;
        printf("iter, time, bw (MB/s) %d, %f, %f\n",i,times[i],bw);
    }
}

/* A gettimeofday routine to give access to the wall
 clock timer on most UNIX-like systems.  */

#include <sys/time.h>
double mysecond()
{
    struct timeval tp;
    struct timezone tzp;
    int i;

    i = gettimeofday(&tp,&tzp);
    return ( (double) tp.tv_sec + (double) tp.tv_usec * 1.e-6 );
}

— End of ReadOnly.c —

Running the code under the “time” command gives output like:

sum = 3276800000.000000
iter, time, bw (MB/s) 0, 0.063371, 4136.659284
iter, time, bw (MB/s) 1, 0.063181, 4149.100482
iter, time, bw (MB/s) 2, 0.063225, 4146.205961
iter, time, bw (MB/s) 3, 0.063187, 4148.693441
iter, time, bw (MB/s) 4, 0.063210, 4147.191209
iter, time, bw (MB/s) 5, 0.063176, 4149.429305
iter, time, bw (MB/s) 6, 0.063195, 4148.176926
iter, time, bw (MB/s) 7, 0.063240, 4145.221181
iter, time, bw (MB/s) 8, 0.063204, 4147.582311
           [...]
iter, time, bw (MB/s) 94, 0.063249, 4144.643036
iter, time, bw (MB/s) 95, 0.063239, 4145.283693
iter, time, bw (MB/s) 96, 0.063278, 4142.737862
iter, time, bw (MB/s) 97, 0.063261, 4143.846398
iter, time, bw (MB/s) 98, 0.063239, 4145.283693
iter, time, bw (MB/s) 99, 0.063240, 4145.236809
real    0m6.519s
user    0m6.412s
sys    0m0.105s

It is important to save and use the times for each iteration so that the compiler will actually execute them. It is also helpful to have a quick visual feedback on the iteration-to-iteration variability of the memory bandwidth, which is clearly small here.

So the system under test delivers a very steady 4.145 GB/s using this version of the code. This is only 32% of the peak memory bandwidth of 12.8 GB/s for the socket, which is an uninspiring result. Don’t worry — it will get a lot better before I am through!

Analysis of Implementation 1

So why does this sample program deliver such a small fraction of the peak memory bandwidth of the node?
Instead of looking at all the possible performance limiters (most of which we will get to in due time), I will cut to the chase and give you the answer:
The performance limit here is directly due to the limited number of outstanding cache misses available to a single core.
The relevant formula is often referred to as “Little’s Law”, which in this case reduces to the simple statement that

 Latency * Bandwidth = Concurrency

where Latency is the time required to load a cache line from memory (about 74 ns on the system under test, as I reported earlier), Bandwidth is the 12.8 GB/s peak transfer rate of the DRAM on one processor chip, and Concurrency is the quantity of data that must be “in flight” in order to “fill the pipeline” or “tolerate the latency”. For the system under test, the required concurrency is 74 ns * 12.8 GB/s = 947 bytes = 14.8 cache lines. Unfortunately, each core in the Opteron Family10h processor only supports a maximum of 8 cache read misses.

Rearranging the formula to Bandwidth = Concurrency/Latency allows us to estimate how much bandwidth we think a processor should be able to get for a given Latency and a given Concurrency. Using 8 cache lines (512 Bytes) and 74 ns suggests that the maximum sustainable bandwidth will be about 6.92 GB/s. Our observed result of 4.145 GB/s is well below this value.  Substituting the observed bandwidth allows us to compute the effective concurrency, which is 4.145 GB/s * 74 ns = 306 Bytes = 4.8 Cache Lines.

Some insight into the limited concurrency is available by re-compiling the code at optimization level “-O1”, which reduces the performance to 3.393 GB/s, corresponding to an effective concurrency of 251 Bytes or 3.9 cache lines.

The assembly code for the innermost loop at “-O1” is:

..B1.8:   
        addsd     a(,%rax,8), %xmm2  
        incq      %rax       
        cmpq      $32768000, %rax  
        jl        ..B1.8

while the assembly code for the same loop at “-O2” is:

..B1.7: 
        addpd     a(,%rax,8), %xmm0     
        addq      $2, %rax 
        cmpq      $32768000, %rax 
        jl        ..B1.7

In the first case the use of the “addsd” (Add Scalar, Double Precision) instruction shows that the compiler is using a single summation variable, while in the second case, the “addpd” (Add Packed, Double Precision) shows that the compiler is using two summation variables — the upper and lower halves of a “packed double” SSE register. Because of the data dependence on the sequence of summations, the code at “-O1” experiences 32,768,000 pipeline stalls (one per addition), while the code at “-O2” experiences 16,384,001 pipeline stalls — half as many (plus one at the end to add the two partial sums together). The floating-point add instructions used here have a dependent operation latency of four cycles. Some of this is overlapped with the pointer update, compare, and branch, but not all of it. The results at “-O1” correspond to about 6.84 CPU cycles per element, while the results at “-O2” correspond to about 5.60 CPU cycles per element, a difference of 1.24 cycles per element.
The surprising (and important) result here is that these extra floating point pipeline latencies are not overlapped with the memory latencies — after all a few extra stall cycles in the core should be negligible compared to the ~215 cycles of memory latency (74 ns * 2.9 GHz). The problem is that these floating-point pipeline stalls are delaying the execution of the subsequent memory load references that are necessary to allow additional hardware prefetches to be issued from the core to the memory system.

In my next entry, I will show now software prefetch instructions can directly increase the number of outstanding cache misses and how explicitly coding for more partial sum variables can indirectly allow more outstanding prefetches by eliminating the floating-point pipeline stalls and allowing subsequent memory references to be issued more quickly….

Posted in Computer Hardware | 6 Comments »