John McCalpin's blog

Dr. Bandwidth explains all….

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on September 11, 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Algorithms, Performance | Tagged: , , , | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Counting Stall Cycles on the Intel Sandy Bridge Processor

Posted by John D. McCalpin, Ph.D. on June 4, 2014

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

  • Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
    queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)

    1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
    2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
    1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
      • I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
    2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
        HyperThreading enabled, and I don’t have any systems configured that way to test right now.
      • These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Counting Stall Cycles on the Intel Sandy Bridge Processor

Memory Bandwidth on Xeon Phi (Knights Corner)

Posted by John D. McCalpin, Ph.D. on December 5, 2013

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

  • Official STREAM submission at:
  • Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
  • Configured with an array size of 64 million elements per array and 10 iterations.
  • Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.


Function Best Rate MB/s Avg time (sec) Min time (sec) Max time (sec)
Copy 169446.8 0.0062 0.0060 0.0063
Scale 169173.1 0.0062 0.0061 0.0063
Add 174824.3 0.0090 0.0088 0.0091
Triad 174663.2 0.0089 0.0088 0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels.  Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface.   Each of the 32 GDDR5 DRAM chips has 16 banks.  This gives a *raw* total of 512 DRAM banks.  BUT:

  • The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface.  (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface).  This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
  • The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface.  Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource.   Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines.  (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks.  On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1).    Some data from the STREAM benchmark supports this hypothesis:

Kernel Reads Writes 2/core 3/core 4/core
Copy 1 1 -0.8% -5.2% -7.3%
Scale 1 1 -1.0% -3.3% -6.7%
Add 2 1 -3.1% -12.0% -13.6%
Triad 2 1 -3.6% -11.2% -13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization.  The default mode is to give priority to reads while buffering stores.  Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads.  The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012).      So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams.   For STREAM Triad, the observed bandwidth is ~175 GB/s.  Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time.  This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller.  For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times.   If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , , , | Comments Off on Memory Bandwidth on Xeon Phi (Knights Corner)

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on July 14, 2013

In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….

For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware, Performance, Performance Counters | Tagged: , | Comments Off on Notes on the mystery of hardware cache performance counters

Coherence with Cached Memory-Mapped IO

Posted by John D. McCalpin, Ph.D. on May 30, 2013

In response to my previous blog entry, a question was asked about how to manage coherence for cached memory-mapped IO regions.   Here are some more details…

Maintaining Coherence with Cached Memory-Mapped IO

For the “read-only” range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy.   Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line.  Not all of these will work on all implementations!

  1. Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
    A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
    The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the “flushing” reads.   There are also some systems for which the pseudo-LRU algorithm has “features” that can break this approach.  (HyperThreading and shared caches can both add complexity in this dimension.)
  2. The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
    Whether it will evict the line from the L3 depends on the implementation, and I don’t have enough information to speculate on whether this will work on Xeon processors.   For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
  3. For memory mapped my the MTRRs as WP (“Write Protect”), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches.  This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
  4. The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer’s Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed.  Additional details are discussed in the SW Developer’s Guide, Volume 3.    Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer’s Guide, Vol 2.
  5. The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost).   This is very likely to crash your system, and is only mentioned here for completeness.
  6. AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO).  This is *almost* useful for supporting cached MMIO, but (at least on the Family 10h processors), the specific mode that I wanted to set up (see addendum below) is disallowed for ugly microarchitectural reasons that I can’t discuss.

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate.   There are no guarantees, but this seems like the approach most likely to work.


Addendum: The AMD almost-solution using MTRR extensions.

The AMD64 architecture provides extensions to the MTRR mechanism called IORRs that allow the system programmer to independently specify whether reads to a certain region go to system memory or MMIO and whether writes to that region go to system memory or MMIO.   This is discussed in the “AMD64 Architecture Programmers Manual, Volume 2: System Programming” (publication number 24593).
I am using version 3.22 from September 2012, where this is described in section 7.9.

The idea was to use this to modify the behavior of the “read-only” MMIO mapping so that reads would go to MMIO while writes would go to system memory.  At first glance this seems strange — I would be creating a “write-only” region of system memory that could never be read (because reads to that address range would go to MMIO).

So why would this help?

It would help because sending the writes to system memory would cause the cache coherence mechanisms to be activated.   A streaming store (for example) to this region would be sent to the memory controller for that physical address range.  The memory controller treats streaming stores in the same way as DMA stores from IO devices to system memory, and it sends out invalidate messages to all caches in the system.  This would invalidate the cached MMIO line in all caches, which would eliminate both the need to pin the thread to a specific core and the problem of the CLFLUSH not reaching the L3 cache.

At least in the AMD Family 10h processors, this IORR function works, but due to some implementation issues in this particular use case it forces the region to the MTRR UC (uncached) type, which defeats my purpose in the exercise.   I think that the implementation issues could be either fixed or worked around, but since this is a fix to a mode that is not entirely supported, it is easy to understand that this never showed up as a high priority to “fix”.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Accelerated Computing, Computer Hardware, Linux | Tagged: , , , | Comments Off on Coherence with Cached Memory-Mapped IO

Notes on Cached Access to Memory-Mapped IO Regions

Posted by John D. McCalpin, Ph.D. on May 29, 2013

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices.  These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device.   This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device.  The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance.   As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges.   Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput.  When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references.   So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses?   The answer is a resounding, “yes, but….“.    The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

  • Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
    BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
  • Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
    This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
    The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
  • Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
    The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
    It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
  • Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
    Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

  • Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining).  In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.
  • Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
    For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
    These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region.   This software based coherence functionality can be implemented at many different levels of complexity, for example:

  • For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
  • If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic.  In such a case, a simple “directory-based coherence” mechanism can be used.  The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”.  The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated.   Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Accelerated Computing, Computer Hardware, Linux | Tagged: , , , , | Comments Off on Notes on Cached Access to Memory-Mapped IO Regions

STREAM version 5.10 released

Posted by John D. McCalpin, Ph.D. on January 17, 2013

After much too long a delay, version 5.10 of the STREAM benchmark has been released (at least in the C language version).

Although version 5.10 of the benchmark still measures exactly the same thing as previous versions, a number of long-awaited features have finally been integrated.

  • Updated Validation Code
  • Array indexing now allows arrays with more than 2 billion elements
  • Data type used can now be overridden from the default “double” to “float” with a single compile flag
  • Many small output formatting changes to account for computers getting bigger and faster

The validation code update is the biggest change to version 5.10 of stream.c.
With previous versions, the validation code was subject to accumulated round-off error that could cause the code to report that validation failed with large array sizes — even if nothing was actually wrong. The revised code eliminates this problem and has been tested to array sizes of 10 billion elements with no problems.

Previous version of STREAM were limited to 32-bit array indices. Version 5.10 defines the array indices using a type that will map to a 64-bit integer on 64-bit machine — thus allowing arrays with more than 2 billion elements. Most compilers require an additional command-line flag like “-mcmodel=medium” to allow full 64-bit addressing. The changes to STREAM in version 5.10 are required in addition to the extra command-line flag.

Dr. Bandwidth also found eight older submissions (from 2009 through early 2012) that somehow got lost in my mailbox and never posted to the site. These are listed on the STREAM benchmark What’s New page.

Along with these older submissions, four new submissions have just been added to the site, ranging from a Raspberry Pi delivering about 200 MB/s to a Xeon Phi SE10P coprocessor delivering over 160,000 MB/s — that’s an 800 to 1 ratio of sustained memory bandwidths measured for single-chip systems!

Users of systems at the Texas Advanced Computing Center will be interested in seeing new results posted for three different components of the Stampede system:

  • Stampede Compute Nodes: Dell_DCS8000 servers with two Xeon E5-2680 (8-core, 2.7 GHz) processors
  • Stampede Coprocessors: Intel_XeonPhi_SE10P Coprocessor (61-core, 1.1 GHz)
  • Stampede Large Memory Nodes: Dell_PowerEdge_820 servers with four Xeon E5-4650 (8-core, 2.7 GHz) processors
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance | Tagged: | 3 Comments »

Counting binary vs decimal powers in the STREAM benchmark

Posted by John D. McCalpin, Ph.D. on January 5, 2013

A question came up recently about my choice of definitions for “MB” used in the computation of memory bandwidth (in “MB/s”) in the STREAM benchmark.

According to this reference from NIST, the convention is:

Binary Powers Value abbreviation full name
2^10 1,024 KiB kibibyte
2^20 1,048,576 MiB mebibyte
2^30 1,073,741,824 GiB gibibyte
Decimal Powers Value abbreviation full name
10^3 1,000 kB kilobyte
10^6 1,000,000 MB megabyte
10^9 1,000,000,000 GB gigabyte

Since its inception in 1991, the STREAM benchmark has reported the amount of memory used in MiB (2^20) and (more recently) GiB (2^30), but always reports the transfer rates in MB/s (10^6).

An example may make my motivation more clear.

Suppose a computer system reads 524,288 Bytes and writes 524,288 Bytes in 1.000 seconds, for a total of 1,048,576 Bytes transferred in 1.000 seconds.
The corresponding performance could be reported in a variety of ways:

  • Option 1: report as 1,048,576 Bytes/s
  • Option 2: report as 1.000000 MiB/s
  • Option 3: report as 1.049 MB/s

From my perspective:

  • Option 1 gives inconveniently large numbers.
  • Option 2 is consistent with typical units for memory storage, but:
    • it is not consistent with typical units for counting arithmetic operations (more on that below), and
    • it would allow unscrupulous parties (or simply parties with different opinions about how to “properly” count) to change the definition of “MB” from 2^20 to 10^6, allowing them to report values that were almost 4.9% higher than the *same* performance on other systems.
  • Option 3 is what I chose. It is consistent with how FLOPS are counted and it preempts the potential “performance inflation” from abusing Option 2.

Note that if floating-point arithmetic operation counts define “MFLOPS” as 10^6 FP Ops/s (as is typical), then “balance” ratios of (MB/s)/MFLOPS require that (MB/s) also be defined using a decimal base.
These “balance” ratios are an important output of the STREAM benchmark project.
(Aside: I would not encourage anyone to consider a 5% difference in “balance” to mean very much — these are intended as relatively coarse scaling estimates.)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance | Tagged: | Comments Off on Counting binary vs decimal powers in the STREAM benchmark

Some comments on the Xeon Phi coprocessor

Posted by John D. McCalpin, Ph.D. on November 17, 2012

As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.

In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.

I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.

Local Memory Bandwidth

The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.

Background: Memory Balance

Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.

Some real-world examples from TACC's systems:

  • TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
  • TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
  • TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.

Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)

The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.

Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.

PCIe Interface Bandwidth

There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.

Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.

  • MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
  • If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
  • If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
  • Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….

Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:

  • Run on host and offload loops/functions to the Xeon Phi.
  • Run on Xeon Phi and offload loops/functions to the host.
  • Run on Xeon Phi and host as peers, for example with MPI.
  • Run only on the host and ignore the Xeon Phi.
  • Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.

Lots of things to try….


Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.

Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.

Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)

Speedup relative to Xeon E5

Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.

Looking at some specific microarchitectural performance factors:

  1. Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
  2. Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
  3. Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
  4. Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
  5. Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
  6. Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.

None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)

The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.

Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.

On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.

Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware | Tagged: , , , , , | Comments Off on Some comments on the Xeon Phi coprocessor

AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies

Posted by John D. McCalpin, Ph.D. on July 27, 2012

In an earlier post, I documented the local and remote memory latencies for the SunBlade X6420 compute nodes in the TACC Ranger supercomputer, using AMD Opteron “Barcelona” (model 8356) processors running at 2.3 GHz.

Similar latency tests were run on other systems based on AMD Opteron processors in the TACC “Discovery” benchmarking cluster.  The systems reported here include

  • 2-socket node with AMD Opteron “Shanghai” processors (model 2389, quad-core, revision C2, 2.9 GHz)
  • 2-socket node with AMD Opteron “Istanbul” processors (model 2435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Istanbul” processors (model 8435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Magny-Cours” processors (model 6174, twelve-core, revision D1, 2.2 GHz)

Compared to the previous results with the AMD Opteron “Barcelona” processors on the TACC Ranger system, the “Shanghai” and “Istanbul” processors have a more advanced memory controller prefetcher, and the “Istanbul” processor also supports “HT Assist”, which allocates a portion of the L3 cache to serve as a “snoop filter” in 4-socket configurations.  The “Magny-Cours” processor uses 2 “Istanbul” die in a single package.

Note that for the “Barcelona” processors, the hardware prefetcher in the memory controller did not perform cache coherence “snoops” — it just loaded data from memory into a buffer.  When a core subsequently issued a load for that address, missing in the L3 cache initiated a coherence snoop.   In both 2-socket and 4-socket systems, this snoop took longer than obtaining the data from DRAM, so the memory prefetcher had no impact on the effective latency seen by a processor core.  “Shanghai” and later Opteron processors include a coherent prefetcher, so prefetched lines could be loaded with lower effective latency.   This difference means that latency testing on “Shanghai” and later processors needs to be slightly more sophisticated to prevent memory controller prefetching from biasing the latency measurements.  In practice, using a pointer-chasing code with a stride of 512 Bytes was sufficient to avoid hardware prefetch in “Shanghai”, “Istanbul”, and “Magny-Cours”.

Results for 2-socket systems

Processorcores/packageFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)
Opteron 2222230000Fh6095
Opteron 23564230010hB3Barcelona85
Opteron 23894290010hC2Shanghai73
Opteron 24356260010hD0Istanbul78
Opteron 617412220010hD1Magny-Cours

Notes for 2-socket results:

  1. The values for the Opteron 2222 are from memory, but should be fairly accurate.
  2. The local latency value for the Opteron 2356 is from memory, but should be in the right ballpark.  The latency is higher than the earlier processors because of the lower core frequency, the lower “Northbridge” frequency, the presence of an L3 cache, and the asynchronous clock boundary between the core(s) and the Northbridge.
  3. The script used for the Opteron 2389 (“Shanghai”) did not correctly bind the threads, so no remote latency data was collected.
  4. The script used for the Opteron 2435 (“Istanbul”) did not correctly bind the threads, so no remote latency data was collected.
  5. The Opteron 6174 was not tested in the 2-socket configuration.

Results for 4-socket systems

ProcessorFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)NOTES1-hop median Latency (ns)2-hop median Latency (ns)
Opteron 822230000Fh90
Opteron 8356230010hB3Barcelona100/133122-1461,2
Opteron 8389290010hC2Shanghai
Opteron 8435260010hD0Istanbul561183,4
Opteron 6174220010hD1Magny-Cours56121-1795124179

Notes for 4-socket results:

  1. On the 4-socket Opteron 8356 (“Barcelona”) system, 2 of the 4 sockets have a local latency of 100ns, while the other 2 have a local latency of 133ns.  This is due to the asymmetric/anisotropic HyperTransport interconnect, in which only 2 of sockets have direct connections to all other sockets, while the other 2 sockets require two hops to reach one of the remote sockets.
  2. On the 4-socket Opteron 8356 (“Barcelona”) system, the asymmetric/anisotropic HyperTransport interconnect gives rise to several different latencies for various combinations of requestor and target socket.  This is discussed in more detail at AMD “Barcelona” Memory Latency.
  3. Starting with “Istanbul”, 4-socket systems have lower local latency than 2-sockets systems because “HyperTransport Assist” (a probe filter) is activated.  Enabling this feature reduces the L3 cache size from 6MiB to 5MiB, but enables the processor to avoid sending probes to the other chips in many cases (including this one).
  4. On the 4-socket Opteron 8435 (“Istanbul”) system, the scripts I ran had an error causing them to only measure local latency and remote latency on 1 of the 3 remote sockets.  Based on other system results, it looks like the remote value was measured for a single “hop”, with no values available for the 2-hop case.
  5. On the 4-socket Opteron 6140 (“Magny-Cours”) system, each package has 2 die, each constituting a separate NUMA node.  The HyperTransport interconnect is asymmetric/anisotropic, with 2 die having direct links to 6 other die (with 1 die requiring 2 hops), and the other 6 die having direct links to 4 other die (with 3 die requiring 2 hops).  The average latency for globally uniform accesses (local and remote) is 133ns, while the average latency for uniformly distributed remote accesses is 144ns.


This disappointingly incomplete dataset still shows a few important features….

  • Local latency is not improving, and shows occasional degradations
    • Processor frequencies are not increasing.
    • DRAM latencies are essentially unchanged over this period — about 15 ns for DRAM page hits, 30 ns for DRAM pages misses, 45 ns for DRAM page conflicts, and 60 ns for DRAM bank cycle time.   The latency benchmark is configured to allow open page hits for the majority of accesses, but these results did not include instrumentation of the DRAM open page hit rate.
    • Many design changes act to increase memory latency.
      • Major factors include the increased number of cores per chip, the addition of a shared L3 cache, the increase in the number of die per package, and the addition of a separate clock frequency domain for the Northbridge.
    • There have been no architectural changes to move away from the transparent, “flat” shared-memory paradigm.
    • Instead, overcoming these latency adders required the introduction of “probe filters” – a useful feature, but one that significantly complicates the implementation, uses 1/6th of the L3 cache, and significantly complicates performance analysis.
  • Remote latency is getting slowly worse
    • This is primarily due to the addition of the L3 cache, the increase in the number of cores, and the increase in the number of die per package.

Magny-Cours Topology

The pointer-chasing latency code was run for all 64 combinations of data binding (NUMA nodes 0..7) and thread binding (1 core chosen from each of NUMA nodes 0..7).   It was not initially clear which topology was used in the system under test, but the observed latency pattern showed very clearly that 2 NUMA nodes had 6 1-hop neighbors, while the other 6 NUMA nodes had only 4 1-hop neighbors.   This property is also shown by the “Max Perf” configuration from Figure 3(c) of the 2010 IEEE Micro paper “Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor” by Conway, et al. (which I highly recommend for its discussion of the cache coherence protocol and probe filter).   The figure below corrects an error from the figure in the paper (which is missing the x16 link between the upper and lower chips in the package on the left), and renumbers the die so that they correspond to the NUMA nodes of the system I was testing.

The latency values are quite easy to understand: all the local values are the same, all the 1-hop values are almost the same, and all the 2-hop values are the same.  (The 1-hop values that are within the same package are about 3.3ns faster than the 1-hop values between packages, but this is a difference of less than 3%, so it will not impact “real-world” performance.)

The bandwidth patterns are much less pretty, but that is a much longer topic for another day….

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware | Tagged: , | Comments Off on AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies