John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

Notes on Cached Access to Memory-Mapped IO Regions

Posted by John D. McCalpin, Ph.D. on 29th May 2013

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices.  These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device.   This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device.  The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance.   As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges.   Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput.  When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references.   So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses?   The answer is a resounding, “yes, but….“.    The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

  • Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
    BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
  • Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
    This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
    The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
  • Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
    The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
    It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
  • Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
    Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

  • Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining).  In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.
  • Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
    For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
    These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region.   This software based coherence functionality can be implemented at many different levels of complexity, for example:

  • For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
  • If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic.  In such a case, a simple “directory-based coherence” mechanism can be used.  The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”.  The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated.   Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….


Posted in Accelerated Computing, Computer Hardware, Linux | Comments Off on Notes on Cached Access to Memory-Mapped IO Regions

Some comments on the Xeon Phi coprocessor

Posted by John D. McCalpin, Ph.D. on 17th November 2012

As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.

In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.

I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.

Local Memory Bandwidth

The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.

Background: Memory Balance

Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.

Some real-world examples from TACC's systems:

  • TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
  • TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
  • TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.

Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)

The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.

Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.

PCIe Interface Bandwidth

There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.

Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.

  • MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
  • If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
  • If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
  • Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….

Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:

  • Run on host and offload loops/functions to the Xeon Phi.
  • Run on Xeon Phi and offload loops/functions to the host.
  • Run on Xeon Phi and host as peers, for example with MPI.
  • Run only on the host and ignore the Xeon Phi.
  • Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.

Lots of things to try….


Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.

Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.

Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)

Speedup relative to Xeon E5

Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.

Looking at some specific microarchitectural performance factors:

  1. Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
  2. Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
  3. Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
  4. Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
  5. Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
  6. Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.

None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)

The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.

Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.

On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.

Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!

Posted in Computer Hardware | Comments Off on Some comments on the Xeon Phi coprocessor

TACC Ranger Node Local and Remote Memory Latency Tables

Posted by John D. McCalpin, Ph.D. on 26th July 2012

In the previous post, I published my best set of numbers for local memory latency on a variety of AMD Opteron system configurations. Here I expand that to include remote memory latency on some of the systems that I have available for testing.

Ranger is the oldest system still operational here at TACC.  It was brought on-line in February 2008 and is currently scheduled to be decommissioned in early 2013.  Each of the 3936 SunBlade X6420 nodes contains four AMD “Barcelona” quad-core Opteron processors (model 8356), running at a core frequency of 2.3 GHz and a NorthBridge frequency of 1.6 GHz.  (The Opteron 8356 processor supports a higher NorthBridge frequency, but this requires a different motherboard with  “split-plane” power supply support — not available on the SunBlade X6420.)

The on-node interconnect topology of the SunBlade X6420 is asymmetric, making maximum use of the three HyperTransport links on each Opteron processor while still allowing 2 HyperTransport links to be used for I/O.

As seen in the figure below, chips 1 & 2 on each node are directly connected to each of the other three chips, while chips 0 & 3 are only connected to two other chips — requiring two “hops” on the HyperTransport network to access the third remote chip.  Memory latency on this system is bounded below by the time required to “snoop” the caches on the other chips.  Chips 1 & 2 are directly connected to the other chips, so they get their snoop responses back more quickly and therefore have lower memory latency.

Ranger compute node inter-processor topology.

Ranger Compute node processor interconnect.

A variant of the “lat_mem_rd.c” program from “lmbench” (version 2) was used to measure the memory access latency.  The benchmark chases a chain of pointers that have been set up with a fixed stride of 128 Bytes (so that the core hardware prefetchers are not activated) and with a total size that significantly exceeds the size of the 2MiB L3 cache.  For the table below, array sizes of 32MiB to 1024MiB were used, with negligible variations in observed latency.    For this particular system, the memory controller prefetchers were active with the stride of 128 used, but since the effective latency is limited by the snoop response time, there is no change to the effective latency even when the memory controller prefetchers fetch the data early.  (I.e., the processors might get the data earlier due to memory controller prefetch, but they cannot use the data until all the snoop responses have been received.)

Memory latency for all combinations of (chip making request) and (chip holding data) are shown in the table below:

Memory Latency (ns) Data on Chip 0 Data on Chip 1 Data on Chip 2 Data on Chip 3
Request from Chip 0 133.2 136.9 136.4 145.4
Request from Chip 1 140.3 100.3 122.8 139.3
Request from Chip 2 140.4 122.2 100.4 139.3
Request from Chip 3 146.4 137.4 137.4 134.9
Cache latency and local and remote memory latency for Ranger compute nodes.

Cache latency and local and remote memory latency for Ranger compute nodes.

Posted in Computer Hardware | Comments Off on TACC Ranger Node Local and Remote Memory Latency Tables

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.

Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?

Is “ordered summation” a hard problem to speed up?

Posted by John D. McCalpin, Ph.D. on 15th February 2012

Sometimes things that seem incredibly difficult aren’t really that bad….

I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.

In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.

Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).

For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.

I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)

Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.

Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.

I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….

Posted in Algorithms, Computer Hardware | Comments Off on Is “ordered summation” a hard problem to speed up?

AMD Opteron Processor models, families, and revisions

Posted by John D. McCalpin, Ph.D. on 2nd April 2011

Opteron Processor models, families, and revisions/steppings

Opteron naming is not that confusing, but AMD seems intent on making it difficult by rearranging their web site in mysterious ways….

I am creating this blog entry to make it easier for me to find my own notes on the topic!

The Wikipedia page is has a pretty good listing:
List of AMD Opteron microprocessors

AMD has useful product comparison reference pages at:
AMD Opteron Processor Solutions
AMD Desktop Processor Solutions
AMD Opteron First Generation Reference (pdf)

Borrowing from those pages, a simple summary is:

First Generation Opteron: models 1xx, 2xx, 8xx.

  • These are all Family K8, and are described in AMD pub 26094.
  • They are usually referred to as “Rev E” or “K8, Rev E” processors.
    This is usually OK since most of the 130 nm parts are gone, but there is a new Family 10h rev E (below).
  • They are characterized by having DDR DRAM interfaces, supporting DDR 266, 333, and (Revision E) 400 MHz.
  • This also includes Athlon 64 and Athlon 64 X2 in sockets 754 and 939.
  • Versions:
    • Single core, 130 nm process: K8 revisions B3, C0, CG
    • Single core, 90 nm process: K8 revisions E4, E6
    • Dual core, 90 nm process: K8 revisions E1, E6

Second Generation Opteron: models 12xx, 22xx, 82xx

  • These are upgraded Family K8 cores, with a DDR2 memory controller.
  • They are usually referred to as “Revision F”, or “K8, Rev F”, and are described in AMD pub 32559 (where they are referred to as “Family NPT 0Fh”, with NPT meaning “New Platform Technology” and referring to the infrastructure related to socket F (aka socket 1207), and socket AM2 )
  • This also includes socket AM2 models of Athlon and most Athlon X2 processors (some are Family 11h, described below).
  • There is only one server version, with two steppings:
    • Dual core, 90 nm process: K8 revisions F2, F3

Upgraded Second Generation Opteron: Athlon X2, Sempron, Turion, Turion X2

  • These are very similar to Family 0Fh, revision G (not used in server parts), and are described in AMD document 41256.
  • The memory controller has less functionality.
  • The HyperTransport interface is upgraded to support HyperTransport generation 3.
    This allows a higher frequency connection between the processor chip and the external PCIe controller, so that PCIe gen2 speeds can be supported.

Third Generation Opteron: models 13xx, 23xx, 83xx

  • These are Family 10h cores with an enhanced DDR2 memory controller and are described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • This also includes Phenom X2, X3, and X4 (Rev B3) and Phenom II X2, X3, X4 (Rev C)
  • Versions:
    • Barcelona: Dual core & Quad core, 65 nm process: Family 10h revisions B0, B2, B3, BA
    • Shanghai: Dual core & Quad core, 45 nm process: Family 10h revision C2
    • Istanbul: Up to 6-core, 45 nm process: Family 10h, revision D0
  • Revision D (“Istanbul”) introduced the “HT Assist” probe filter feature to improve scalability in 4-socket and 8-socket systems.

Upgraded Third Generation Opteron: models 41xx & 61xx

  • These are Family 10h cores with an enhanced DDR3-capable memory controller and are also described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • It does not appear that any of the desktop parts use this same stepping as the server parts (D1).
  • There are two versions — both manufactured using a 45nm process:
    • Lisbon: 41xx series have one Family10h revision D1 die per package (socket C32).
    • Magny-Cours: 61xx series have two Family10h revision D1 dice per package (socket G34).
  • Family 10h, Revision E0 is used in the Phenom II X6 products.
    • This revision is the first to offer the “Core Performance Boost” feature.
    • It is also the first to generate confusion about the label “Rev E”.
    • It should be referred to as “Family 10h, Revision E” to avoid ambiguity.

Fourth Generation Opteron: server processor models 42xx & 62xx, and “AMD FX” desktop processors

  • These are socket-compatible with the 41xx and 61xx series, but with the “Bulldozer” core rather than the Family 10h core.
  • The Bulldozer core adds support for:
    • AVX — the extension of SSE from 128 bits wide to 256 bits wide, plus many other improvements. (First introduced in Intel “Sandy Bridge” processors.)
    • AES — additional instructions to dramatically improve performance of AES encryption/descryption. (First introduced in Intel “Westmere” processors.)
    • FMA4 — AMD’s 4-operand multiply-accumulate instructions. (32-bit & 64-bit arithmetic, with 64b, 128b, or 256b vectors.)
    • XOP — AMD’s set of extra integer instructions that were not included in AVX: multiply/accumulate, shift/rotate/permute, etc.
  • All current parts are produced in a 32 nm semiconductor process.
  • Valencia: 42xx series have one Bulldozer revision B2 die per package (socket C32)
  • Interlagos: 62xx series have two Bulldozer revision B2 dice per package (socket G34)
  • “AMD FX”: desktop processors have one Bulldozer revision B2 die per package (socket AM3+)
  • Counting cores and chips is getting more confusing…
    • Each die has 1, 2, 3, or 4 “Bulldozer modules”.
    • Each “Bulldozer module” has two processor cores.
    • The two processor cores in a module share the instruction cache (64kB), some of the instruction fetch logic, the pair of floating-point units, and the 2MB L2 cache.
    • The two processor cores in a module each have a private data cache (16kB), private fixed point functional and address generation units, and schedulers.
    • All modules on a die share an 8 MB L3 cache and the dual-channel DDR3 memory controller.
  • Bulldozer-based systems are characterized by a much larger “turbo” boost frequency increase than previous processors, with almost models supporting an automatic frequency boost of over 20% when not using all the cores, and some models supporting frequency boosts of more than 30%.

Posted in Computer Hardware, Reference | 4 Comments »

Memory Latency Components

Posted by John D. McCalpin, Ph.D. on 10th March 2011

A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….

Preliminary Comments:

It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.

On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.

Memory Latency Components: Abridged

Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).

  1. The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
    Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
  2. Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
  3. While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
  4. Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
    I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
  5. Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
  6. The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
  7. At this point the memory controller begins to do two things in parallel.  (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
    • probe the other caches on the chip, and
    • begin to set up the DRAM access.
  8. For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
  9. For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
  10. There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
  11. I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
  12. On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
  13. Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.

The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.

Let’s see what they add up to:

  • Core + System Request Interface: outbound: ~9 ns
  • Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
  • Memory Access Asynchronous interface crossings: ~21 ns
  • DRAM CAS latency: 11.25 ns
  • Core data forwarding: ~1.5 ns

This gives:

  • Total non-overlapped: ~43 ns
  • Measured latency: 51.6 ns
  • Unaccounted: ~9 ns = 18 memory controller clock cycles (assuming 2.0 GHz)

Final Comments:

  • I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
  • The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
  • I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.

Comments and corrections are always welcome.  In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me!   🙂

Posted in Computer Hardware | 4 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 5: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 11th November 2010

Single Thread, Read Only Results Comparison Across Systems

In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (“Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.

Test Systems

  1. 2-socket AMD Family10h Opteron Revision C2 (“Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
  2. 2-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  3. 4-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
  4. 4-socket AMD Family10h Opteron 6174, Revision E0 (“Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
  5. 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333

All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.

The source code, scripts, and results are all available in a tar file: ReadOnly_2010-11-12.tar.bz2


Code Version Notes Vector SSE Large Page SW Prefetch 4 KiB pages accessed Ref System (2p Shanghai) 2-socket Istanbul 4-socket Istanbul 4-socket Magny-Cours 1-socket PhenomII
Version001 “-O1” 1 3.401 GB/s 3.167 GB/s 4..311 GB/s 3.734GB/s 4.586 GB/s
Version002 “-O2” 1 4.122 GB/s 4.035 GB/s 5.719 GB/s 5.120 GB/s 5.688 GB/s
Version003 8 partial sums 1 4.512 GB/s 4.373 GB/s 5.946 GB/s 5.476 GB/s 6.207 GB/s
Version004 add SW prefetch Y 1 6.083 GB/s 5.732 GB/s 6.489 GB/s 6.389 GB/s 7.571 GB/s
Version005 add vector SSE Y Y 1 6.091 GB/s 5.765 GB/s 6.600 GB/s 6.398 GB/s 7.580 GB/s
Version006 remove prefetch Y 1 5.247 GB/s 5.159 GB/s 6.787 GB/s 6.403 GB/s 6.976 GB/s
Version007 add large pages Y Y 1 5.392 GB/s 5.234 GB/s 7.149 GB/s 6.653 GB/s 7.117 GB/s
Version008 split into triply-nested loop Y Y 1 4.918 GB/s 4.914 GB/s 6.661 GB/s 6.180 GB/s 6.616 GB/s
Version009 add SW prefetch Y Y Y 1 6.173 GB/s 5.901 GB/s 6.646 GB/s 6.568 GB/s 7.736 GB/s
Version010 multiple pages/loop Y Y Y 2 6.417 GB/s 6.174 GB/s 7.569 GB/s 6.895 GB/s 7.913 GB/s
Version011 multiple pages/loop Y Y Y 4 7.063 GB/s 6.804 GB/s 8.319 GB/s 7.245 GB/s 8.583 GB/s
Version012 multiple pages/loop Y Y Y 8 7.260 GB/s 6.960 GB/s 8.378 GB/s 7.205 GB/s 8.642 GB/s
Version013 Version010 minus SW prefetch Y Y 2 5.864 GB/s 6.009 GB/s 7.667 GB/s 6.676 GB/s 7.469 GB/s
Version014 Version011 minus SW prefetch Y Y 4 6.743 GB/s 6.483 GB/s 8.136 GB/s 6.946 GB/s 8.291 GB/s
Version015 Version012 minus SW prefetch Y Y 8 6.978 GB/s 6.578 GB/s 8.112 GB/s 6.937 GB/s 8.463 GB/s


There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….

  • For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
  • The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
    ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large.
  • This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
  • The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.

Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:

Posted in Computer Hardware | 5 Comments »

Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, and Part 3, it is time to into the ugly stuff — trying to control DRAM bank and rank access patterns and working to improve the effectiveness of the memory controller prefetcher.

Background: Banks and Ranks

The DRAM installed in the system under test consists of 2 dual-rank 2GiB DIMMs in each channel of each chip. Each “rank” is composed of 9 DRAM chips, each with 1 Gbit capacity and each driving 8 bits of the 72-bit output of the DIMM (64 bits data + 8 bits ECC). Each of these 1 Gbit DRAM chips is divided into 8 “banks” of 128 Mbits each, and each of these banks has a 1 KiB “page size”. This “DRAM page size” is unrelated to the “virtual memory page size” discussed above, but it is easy to get confused! The DRAM page size defines the amount of information transferred from the DRAM array into the “open page” buffer amps in each DRAM bank as part of the two-step (row/column) addressing used to access the DRAM memory. In the system under test, the DRAM page size is thus 8 KiB– 8 DRAM chips * 1 KiB/DRAM chip — with contiguous cache lines distributed between the two DRAM channels (using a 6-bit hash function that I won’t go into here). Each DRAM chip has 8 banks, so each rank maps 128 KiB of contiguous addresses (2 channels * 8 banks * 8 KiB/bank).

Why does this matter?

  1. Every time a reference is made to a new DRAM page, the full 8 KiB is transferred from the DRAM array to the DRAM sense amps. This uses a fair amount of power, and it makes sense to try to read the entire 8 KiB while it is in the sense amps. Although it is beyond the scope of today’s discussion, reading data from “open pages” uses only about 1/4 to 1/5 of the power required to read the same amount of data in “closed page” mode, where only one cache line is retrieved from each DRAM page.
  2. Data in the sense amps can be accessed at significantly lower latency — this is called “open page” access (or a “page hit”), and the latency is referred to as the “CAS latency”. On most systems the CAS latency is between 12.5 ns and 15 ns, independent of the frequency of the DRAM bus. If the data is not in the sense amps (a “page miss”), loading data from the array to the sense amps takes an additional latency of 12.5 to 15 ns. If the sense amps were holding other valid data, it would be necessary to write that data back to the array, taking an additional 12.5 to 15 ns. If the DRAM latency is not completely overlapped with the cache coherence latency, these increases will reduce the sustainable bandwidth according to Little’s Law: Bandwidth = Concurrency / Latency
  3. The DRAM bus is a shared bus with multiple transmitters and multiple receivers — five of each in the system under test: the memory controller and four DRAM ranks. Every time the device driving the bus needs to be switched, the bus must be left idle for a short period of time to ensure that the receivers can synchronize with the next driver. When the memory controller switches from reading to writing this is called a “read/write turnaround”. When the memory controller switches from writing to reading this is called a “write/read turnaround”. When the memory controller switches from reading from one rank to reading from a different rank this is called a “chip select turnaround” or a “chip select stall”, or sometimes a “read-to-read” stall or delay or turnaround. These idle periods depend on the electrical properties of the bus, including the length of the traces on the motherboard, the number of DIMM sockets, and the number and type of DIMMs installed. The idle periods are only very weakly dependent on the bus frequency, so as the bus gets faster and the transfer time of a cache line gets faster, these delays become proportionately more expensive. It is common for the “chip select stall” period to be as large as the cache line transfer time, meaning that a code that performs consecutive reads from different banks will only be able to use about 50% of the DRAM bandwidth.
  4. Although it could have been covered in the previous post on prefetching, the memory controller prefetcher on AMD Family10h Opteron systems appears to only look for one address stream in each 4 KiB region. This suggests that interleaving fetches from different 4 KiB pages might allow the memory controller prefetcher to produce more outstanding prefetches. The extra loops that I introduce to allow control of DRAM rank access are also convenient for allowing interleaving of fetches from different 4 KiB pages.


Starting with Version 007 (packed SSE arithmetic, large pages, no software prefetching) at 5.392 GB/s, I split the single loop over the array into a set of three loops — the innermost loop over the 64 cache lines in a 4 KiB page, a middle loop for the 32 4 KiB pages in a 128 KiB DRAM rank, and an outer loop for the (many) 128 KiB DRAM rank ranges in the full array. The resulting Version 008 loop structure looks like:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);

The resulting inner loop looks the same as before — but with the much shorter iteration count of 64 instead of 512,000:

        addpd     (%r14,%rsi,8), %xmm3 
        addpd     16(%r14,%rsi,8), %xmm2 
        addpd     32(%r14,%rsi,8), %xmm1
        addpd     48(%r14,%rsi,8), %xmm0 
        addq      $8, %rsi 
        cmpq      %rcx, %rsi  
        jl        ..B1.30

The overall performance of Version 008 drops almost 9% compared to Version 007 — 4.918 GB/s vs 5.392 GB/s — for reasons that are unclear to me.

Interleaving Fetches Across Multiple 4 KiB Pages

The first set of optimizations based on Version 008 will be to interleave the accesses so that multiple 4 KiB pages are accessed concurrently. This is implemented by a technique that compiler writers call “unroll and jam”. I unroll the middle loop (the one that covers the 32 4 KiB pages in a rank) and interleave (“jam”) the iterations. This is done once for Version 013 (i.e., concurrently accessing 2 4 KiB pages), three times for Version 014 (i.e., concurrently accessing 4 4 KiB pages), and seven times for Version 015 (i.e., concurrently accessing 8 4 KiB pages).
To keep the listing short, I will just show the inner loop structure of the first of these — Version 013:

            for (k=0; k<N; k+=RANKSIZE) {
                for (j=k; j<k+RANKSIZE; j+=2*PAGESIZE) {
                    for (i=j; i<j+PAGESIZE; i+=8) {
                        x0 = _mm_load_pd(&a[i+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+6]);
                        sum3 = _mm_add_pd(sum3,x3);
                        x0 = _mm_load_pd(&a[i+PAGESIZE+0]);
                        sum0 = _mm_add_pd(sum0,x0);
                        x1 = _mm_load_pd(&a[i+PAGESIZE+2]);
                        sum1 = _mm_add_pd(sum1,x1);
                        x2 = _mm_load_pd(&a[i+PAGESIZE+4]);
                        sum2 = _mm_add_pd(sum2,x2);
                        x3 = _mm_load_pd(&a[i+PAGESIZE+6]);
                        sum3 = _mm_add_pd(sum3,x3);

The assembly code for the inner loop looks like what one would expect:

        addpd     (%r15,%r11,8), %xmm3  
        addpd     16(%r15,%r11,8), %xmm2 
        addpd     32(%r15,%r11,8), %xmm1
        addpd     48(%r15,%r11,8), %xmm0 
        addpd     4096(%r15,%r11,8), %xmm3 
        addpd     4112(%r15,%r11,8), %xmm2
        addpd     4128(%r15,%r11,8), %xmm1 
        addpd     4144(%r15,%r11,8), %xmm0  
        addq      $8, %r11
        cmpq      %r10, %r11 
        jl        ..B1.30

Performance for these three versions improves dramatically as the number of 4 KiB pages accessed increases:

  • Version 008: one 4 KiB page accessed: 4.918 GB/s
  • Version 013: two 4 KiB pages accessed: 5.864 GB/s
  • Version 014: four 4 KiB pages accessed: 6.743 GB/s
  • Version 015: eight 4 KiB pages accessed: 6.978 GB/s

Going one step further, to 16 pages accessed in the inner loop, causes a sharp drop in performance — down to 5.314 GB/s.

So this idea of accessing multiple 4 KiB pages seems to have significant merit for improving single-thread read bandwidth. Next we see if explicit prefetching can push performance even higher.

Multiple 4 KiB Pages Plus Explicit Software Prefetching

A number of new versions were produced

  • Version 008 (single page accessed with triple loops) + SW prefetch yields Version 009
  • Version 013 (two pages accessed) + SW prefetch –> Version 010
  • Version 014 (four pages accessed) + SW prefetch –> Version 011
  • Version 015 (eight pages accessed) + SW prefetch –> Version 012

In each case the prefetch was set AHEAD of the current pointer by a distance of 0 to 1024 8-Byte elements and all results were tabulated. Unlike the initial tests with SW prefetching, the results here are more variable and less easy to understand.
Starting with Version 009, the addition of SW prefetching restores the performance loss introduced by the triple-loop structure and provides a strong additional boost.

loop structure no SW prefetch with SW prefetch
single loop Version 007: 5.392 GB/s Version 005: 6.091 GB/s
triple loop Version 008: 4.917 GB/s Version 009: 6.173 GB/s

All of these are large page results except Version 005. The slight increase from Version 005 to Version 009 is consistent with the improvements seen in adding large pages from Version 006 to Version 007, so it looks like adding the SW prefetch negates the performance loss introduced by the triple-loop structure.

Combining SW prefetching with interleaved 4 KiB page access produces some intriguing patterns.
Version 010 — fetching 2 4KiB pages concurrently:
Version 010 bandwidth

Version 011 — fetching 4 4KiB pages concurrently:
Version 011 bandwidth

Version 012 — fetching 8 4KiB pages concurrently:
Version 012 bandwidth

When reviewing these figures, keep in mind that the baseline performance number is the 6.173 GB/s from Version 009, so all of the results are good. What is most unusual about these results (speaking from ~25 years experience in performance analysis) is that the last two figures show some fairly strong upward spikes in performance. It is entirely commonplace to see decreases in performance due to alignment issues, but it is quite rare to see increases in performance that depend sensitively on alignment, but which remain repeatable and usable.

Choosing a range of optimum AHEAD distances for each case allows me to summarize:

Pages Accessed in Inner Loop no SW prefetch with SW prefetch
1 Version 007: 5.392 GB/s Version 009: 6.173 GB/s
2 Version 013: 5.864 GB/s Version 010: 6.417 GB/s
4 Version 014: 6.743 GB/s Version 011: 7.063 GB/s
8 Version 015: 6.978 GB/s Version 012: 7.260 GB/s

It looks like this is about as far as I can go with a single thread. Performance started at 3.401 GB/s and gradually increased to 7.260 GB/s — an increase of 113%. The resulting code is not terribly long, but it is important to remember that I only implemented the case that starts on a 128 KiB boundary (which is necessary to ensure that the accesses to multiple 4 KiB pages are all in the same rank). Extra prolog and epilog code will be required for a general-purpose sum reduction routine.

So why did I spend all of this time? Why not just start by running multiple threads to increase performance?

First, the original code was slow enough that even perfect scaling across four threads would barely provide enough concurrency to reach the 12.8 GB/s read bandwidth of the chip.
Second, the original code is not set up to allow control over which pages are being accessed. Using multiple threads that are accessing different ranks will significantly increase the number of chip-select stalls, and may not produce the desired level of performance. (More on this soon.)
Third, the system under test has only two DDR2/800 channels per chip — not exactly state of the art. Newer systems have two or four channels of DDR3 at 1333 or even 1600 MHz. Given similar memory latencies, Little’s Law tells us that these systems will only deliver more sustained bandwidth if they are able to maintain more outstanding cache misses. The experiments presented here should provide some insights into how to structure code to increase the memory concurrency.

But that does not mean that I won’t look at the performance of multi-threaded implementations — that is coming up anon…

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 4: single-thread, read-only

Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only

Posted by John D. McCalpin, Ph.D. on 9th November 2010

Following up on Part 1 and Part 2, it is time to look at adding explicit prefetching to try to increase read bandwidth.

About Prefetching

The AMD Opteron Family10h processors have two different “hardware” prefetch mechanisms, and also allow “software” prefetch instructions. The “core prefetcher” is (as the name implies) located in the processor core, and monitors L1 Data Cache misses. (There is a similar prefetch engine for Instruction Cache misses, but that is not today’s topic.) This “core prefetcher” monitors a large number of data access streams and prefetches along detected address streams consisting of contiguous ascending or descending cache line addresses. I believe that the core prefetcher only fetches one cache line ahead of the most recent load to each stream, waiting until a cache line is returned from memory before sending out the next request, so it will not “automagically” create enough prefetches to fill the memory pipeline.
(Note that these sorts of details are hard to tease out of vendor documentation, and are subject to frequent change.) The core prefetcher is operational in all of the results presented here, with many of the code modifications intended to make it operate more effectively.
The Opteron Family10h processors also have a “Memory Controller Prefetcher” that prefetches from DRAM to special buffers in the memory controller. This can reduce the latency for subsequent memory accesses, and therefore increase the bandwidth available with a fixed level of concurrency. This part is really important, so I will repeat it:

Bandwidth = Concurrency / Latency

The concurrency is limited by the number of buffers available for outstanding cache misses (8 per core, in this case), while the latency is determined by where the data ends up getting found. In the system under test the latency is actually bounded below by the time required to probe the other caches in the system, not by the time required to obtain the data from the DRAMs. In the first generation of AMD Opteron Family10h processors (Revisions B0 & B3), the memory controller prefetcher simply read the data from the DRAMs to a set of buffers in the memory controller. In these systems the cache coherency check was not initiated until the processor actually sent a load for a particular cache line to the memory controller. Unfortunately in this case the time required to check the other caches to see if they had a copy of the line was considerably longer than the time required to get the data from the DRAM, so getting the data from the DRAM earlier did not help at all — it just meant that the processor had to wait longer after receiving the data before it could use it. In Revision C of the AMD Opteron Family10h processors, the memory controller prefetcher was enhanced to make “coherent” prefetches — the memory controller began the coherence transaction when it performed the prefetch. In the best case the coherence transaction will be complete by the time the load request from the processor arrives, which significantly reduces the latency observed by the processor. From the formula above, it should be clear that for a fixed level of concurrency, the only way to increase sustained bandwidth is to reduce the effective latency. In Revisions D and E of the AMD Opteron Family10h processors, this “coherent prefetcher” is retained with some additional improvements.

Finally we get to “Software Prefetch”. The AMD64/Intel64 instruction set includes a set of explicit prefetch instructions. There are two versions that matter, the “PREFETCH T0” instruction and the “PREFETCH NTA” instruction. The former fetches data from memory much like a load, while the latter fetches data from memory and marks it as “non-temporal” — meaning that it is unlikely to be reused. For Family10h Opterons “non-temporal” fetches go into the L1 cache but when they are chosen to be replaced in the L1 Cache, they are simply dropped (if clean) rather than being sent to the L2 Cache as “victims”. This allows the L2 Cache to be used more effectively for data that is likely to be reused. Since the array used in this benchmark is much larger than the cache, I could use the PREFETCH_NTA instruction when explicitly prefetching data. The choice of prefetch instruction does not make much difference in performance here, though it can sharply reduce performance if the code does end up reusing the data, so in these examples I use PREFETCH_T0 just as a habit.

Different revisions of hardware treat software prefetches differently — again it is hard to obtain this information from vendor documentation. There are several issues relating to software prefetch behavior that can influence how you want to use them:

  1. Do software prefetches cause access violations if the address is out-of-bounds?
    No they do not. This makes them easier to use since you can prefetch beyond the end of an array without worrying about access violations.
  2. Do software prefetches trigger the hardware page table walker in the event of a TLB miss?
    Yes, on Family10h Opterons. This is usually a good thing, though there is only one hardware page table walker per core. The SW prefetch will only cause a hardware table walk — if the page table entry is not found by the hardware table walker, the request will be silently dropped. This is different than a load, which will cause a trap to O/S software if the page table entry is not found by the hardware table walker (for example if the page has been swapped to disk).
  3. Do the addresses used in software prefetches trigger the hardware prefetchers like loads do?
    In earlier Opterons I think that the answer was “no”, and in Family10h Opterons I think that the answer is “yes”, but it does not matter in this particular test case.
  4. Do software prefetches combine in the cache miss buffers with load misses, or do they allocate separate buffers?
    For AMD Family10h processors the Miss Address Buffers combine all accesses to the same cache line, whether due to hardware prefetches, load misses, or software prefetches. This makes it less critical that the code avoid issuing both load misses and SW prefetches to the same cache line.

Prefetching Experiments

I repeated most of the previous experiments with explicit software prefetch instructions added. In each case I varied the distance between the current loop pointer and the target of the prefetch from 0 8-Byte elements to 1024 8-byte elements. The prefetch instructions were added using yet another compiler intrinsic function executed once per loop (which explains why I prefer to unroll the inner loop to handle a cache line at a time). The inner loop of Version 003 (with 4 scalar variables as partial sums) then becomes Version004:

            for (i=0; i<N; i+=8) {
                _mm_prefetch((char *)&a[i+AHEAD],_MM_HINT_T0);
                sum0 += a[i+0];
                sum1 += a[i+1];
                sum2 += a[i+2];
                sum3 += a[i+3];
                sum0 += a[i+4];
                sum1 += a[i+5];
                sum2 += a[i+6];
                sum3 += a[i+7];

and the resulting assembly code for the inner loop is as expected:

        addsd     a(%rdx), %xmm3 
        addsd     8+a(%rdx), %xmm2
        addsd     16+a(%rdx), %xmm1
        addsd     24+a(%rdx), %xmm0
        addsd     32+a(%rdx), %xmm3
        addsd     40+a(%rdx), %xmm2 
        addsd     48+a(%rdx), %xmm1
        addsd     56+a(%rdx), %xmm0  
        prefetcht0 (%rax)
        addq      $64, %rax 
        addq      $64, %rdx 
        addq      $8, %rcx  
        cmpq      $32768000, %rcx 
        jl        ..B1.10

The performance of Version 004 is now dependent on the AHEAD distance, as shown in Figure 1.

Comparing to Version 003, the explicit software prefetching increases the bandwidth dramatically, from ~4.5 GB/s to over 6.0 GB/s. Without trying to understand the details, it is clear that it helps a great deal to prefetch at least 96 elements ahead, with 96 elements = 768 Bytes = 12 cache lines. There are nice wide ranges that show steady levels of performance, with (for example) the average of AHEAD=416 to AHEAD=448 coming to 6.083 GB/s. Given the 74 ns nominal memory latency, this corresponds to an effective concurrency of 450 Bytes, or slightly over 7.0 cache lines. Note that this is approaching the maximum value that a single thread should be able to attain unless the memory controller prefetcher is able to significantly reduce the effective memory latency.

Combining the explicit software prefetching of Version 004 with the packed double SSE of Version 003 produces Version 005. Unfortunately the performance of Version 004 and Version 005 is essentially identical — the reduction in pipeline latency provided in Version 005 overlaps with the improvement in latency due to software prefetching, producing no additional gain.

Putting Data on Large Pages

The AMD Opteron Family10h processors support a standard virtual memory page size of 4 KiB with large pages sizes of 2 MiB and 1 GiB. Most versions of Linux support only one option for large page sizes, typically the 2 MiB version. I configured the system under test to reserve 512 large pages behind each chip and modified the benchmark to use these large pages. Some compilers (notably the Open64 compilers) have compile/link options to put the data on large pages, but I prefer doing it a bit more manually using shared memory segments. This has the advantage of portability across all compilers and can be switched back to the default page size by a simple change to the parameters to the shmget() call. One slightly tricky issue is that when using large pages the size requested in the shmget call needs to be rounded up to the nearest multiple of the page size. The code to allocate the array on large pages looks like:

    i = total/(2*1024*1024);                                   // how many full 2MiB pages in "total" bytes?
    sum = ceil((double)total/(2.*1024.*1024.));     // round up to next integer if needed
    j = (int) sum;                                                    // ceil() returns a double -- convert to integer
    SEGSIZE = j*(2*1024*1024);                         // now the SEGSIZE is the smallest multiple of 2MiB needed to hold "total" bytes

    shmida = shmget(IPC_PRIVATE,SEGSIZE,IPC_CREAT|SHM_HUGETLB|0666);      // simply eliminate the SHM_HUGETLB to get the default page size
    a = (double *) shmat (shmida, NULL, 0);                                  // *real* code should check for error returns on both the shmget and shmat calls!

Taking the packed double SSE Version 006 and putting the data on large pages gives us Version 007. This version does not include software prefetching. Performance is improved slightly by the use of large pages, from 5.247 GB/s (Version 006) to 5.392 GB/s (Version 007) — a bit under 3% improvement. Not to worry — the main goal of using large pages is not to directly improve performance, but to allow control over which DRAM banks and ranks are being accessed.

Posted in Computer Hardware | Comments Off on Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only