John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

Notes on “non-temporal” (aka “streaming”) stores

Posted by John D. McCalpin, Ph.D. on 1st January 2018

Memory systems using caches have a lot more potential flexibility than most implementations are able to exploit – you get the standard behavior all the time, even if an alternative behavior would be allowable and desirable in a specific circumstance.  One area in which many vendors have provided an alternative to the standard behavior is in “non-temporal” or “streaming” stores. These are sometimes also referred to as “cache-bypassing” or “non-allocating” stores, and even “Non-Globally-Ordered stores” in some cases.

The availability of streaming stores on various processors can often be tied to the improvements in performance that they provide on the STREAM benchmark, but there are several approaches to implementing this functionality, with some subtleties that may not be obvious.  (The underlying mechanism of write-combining was developed long before, and for other reasons, but exposing it to the user in standard cached memory space required the motivation provided by a highly visible benchmark.)

Before getting into details, it is helpful to review the standard behavior of a “typical” cached system for store operations.  Most recent systems use a “write-allocate” policy — when a store misses the cache, the cache line containing the store’s target address is read into the cache, then the parts of the line that are receiving the new data are updated.   These “write-allocate” cache policies have several advantages (mostly related to implementation correctness), but if the code overwrites all the data in the cache line, then reading the cache line from memory could be considered an unnecessary performance overhead.    It is this extra overhead of read transactions that makes the subject of streaming stores important for the STREAM benchmark.

As in many subjects, a lot of mistakes can be avoided by looking carefully at what the various terms mean — considering both similarities and differences.

  • “Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”).
    • So there is no benefit to keeping the data in the processor’s cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
  • “Streaming store” is suggestive of stores to contiguous addresses (i.e., high “spatial locality”).
  • “Cache-bypassing store” says that at least some aspects of the transaction bypass the cache(s).
  • “Non-allocating store” says that a store that misses in a cache will not load the corresponding cache line into the cache before performing the store.
    • This implies that there is some other sort of structure to hold the data from the stores until it is sent to memory.  This may be called a “store buffer”, a “write buffer”, or a “write-combining buffer”.
  • “Non-globally-ordered store” says that the results of an NGO store might appear in a different order (relative to ordinary stores or to other NGO stores) on other processors.
    • All stores must appear in program order on the processor performing the stores, but may only need to appear in the same order on other processors in some special cases, such as stores related to interprocessor communication and synchronization.

There are at least three issues raised by these terms:

  1. Caching: Do we want the result of the store to displace something else from the cache?
  2. Allocating: Can we tolerate the extra memory traffic incurred by reading the cache line before overwriting it?
  3. Ordering: Do the results of this store need to appear in order with respect to other stores?

Different applications may have very different requirements with respect to these issues and may be best served by different implementations.   For example, if the priority is to prevent the stored data from displacing other data in the processor cache, then it may suffice to put the data in the cache, but mark it as Least-Recently-Used, so that it will displace as little useful data as possible.   In the case of the STREAM benchmark, the usual priority is the question of allocation — we want to avoid reading the data before over-writing it.

While I am not quite apologizing for that, it is clear that STREAM over-represents stores in general, and store misses in particular, relative to the high-bandwidth applications that I intended it to represent.

To understand the benefit of non-temporal stores, you need to understand if you are operating in (A) a concurrency-limited regime or (B) a bandwidth-limited regime. (This is not a short story, but you can learn a lot about how real systems work from studying it.)

(Note: The performance numbers below are from the original private version of this post created on 2015-05-29.   Newer systems require more concurrency — numbers will be updated when I have a few minutes to dig them up….)

(A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s.

Store misses are essentially the same as load misses in this analysis.

L1 hardware prefetchers don’t help performance here because they share the same 10 L1 cache miss buffers. L2 hardware prefetchers do help because bringing the data closer reduces the occupancy required by each cache miss transaction. Unfortunately, L2 hardware prefetchers are not directly controllable, so experimentation is required. The best read bandwidth that I have been able to achieve on this system is about 17.8 GB/s, which corresponds to an “effective concurrency” of about 22 cache lines or an “effective latency” of about 36 ns (for each of the 10 concurrent L1 misses).

L2 hardware prefetchers on this system are also able to perform prefetches for store miss streams, thus reducing the occupancy for store misses and increasing the store miss bandwidth.

For non-temporal stores there is no concept corresponding to “prefetch”, so you are stuck with whatever buffer occupancy the hardware gives you. Note that since the non-temporal store does not bring data *to* the processor, there is no reason for its occupancy to be tied to the memory latency. One would expect a cache miss buffer holding a non-temporal store to have a significantly lower latency than a cache miss, since it is allocated, filled, and then transfers its contents out to the memory controller (which presumably has many more buffers than the 10 cache miss buffers that the core supports).

But, while non-temporal stores are expected to have a shorter buffer occupancy than that of a cache miss that goes all the way to memory, it is not at all clear whether they will have a shorter buffer occupancy than a store misses that activates an L2 hardware prefetch engine. It turns out that on the Xeon E5-2680 (Sandy Bridge EP), non-temporal stores have *higher* occupancy than store misses that activate the L2 hardware prefetcher, so using non-temporal stores slows down the performance of each of the four STREAM kernels. I don’t have all the detailed numbers in front of me, but IIRC, STREAM Triad runs at about 10 GB/s with one thread on a Xeon E5-2680 when using non-temporal stores and at between 12-14 GB/s when *not* using non-temporal stores.

This result is not a general rule. On the Intel Xeon E3-1270 (also a Sandy Bridge core, but with the “client” L3 and memory controller), the occupancy of non-temporal stores in the L1 cache miss buffers appears to be much shorter, so there is not an occupancy-induced slowdown. On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses. In this case the use of non-temporal stores allowed one thread to have more concurrent
memory transactions, so it almost always helped performance.

(B) STREAM is typically run using as many cores as is necessary to achieve the maximum sustainable memory bandwidth. In this bandwidth-limited regime non-temporal stores play a very different role. When using all the cores there are no longer concurrency limits, but there is a big difference in bulk memory traffic. Store misses must read the target lines into the L1 cache before overwriting it, while non-temporal stores avoid this (semantically unnecessary) load from memory. Thus for the STREAM Copy and Scale kernels, using cached stores results in two memory reads and one memory write, while using non-temporal stores requires only one memory read and one memory write — a 3:2 ratio. Similarly, the STREAM Add and Triad kernels transfer 4:3 as much data when using cached stores compared to non-temporal stores.

On most systems the reported performance ratios for STREAM (using all processors) with and without non-temporal stores are very close to these 3:2 and 4:3 ratios. It is also typically the case that STREAM results (again using all processors) are about the same for all four kernels when non-temporal stores are used, while (due to the differing ratios of extra read traffic) the Add and Triad kernels are typically ~1/3 faster on systems that use cached stores.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Computer Hardware, Reference | Comments Off on Notes on “non-temporal” (aka “streaming”) stores

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on 6th December 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)1.01.00.93.1

Caveats:

  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)1.22.41.13.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

SC16 Invited Talk: Memory Bandwidth and System Balance in HPC Systems

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

Slide01

I have been involved in HPC for over 30 years:

  • 12 years as student & faculty user in ocean modeling,
  • 12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
  • over 7 years as a research scientist at TACC.

Slide02

Slide03
  • This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.

Slide04
  • Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
    • The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
    • The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
  • RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
    • The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
    • The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
    • In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
  • x86-based systems rapidly replaced RISC systems starting in around 2003.
    • The first big x86 system on the list was ASCI Red in 1996.
    • The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
    • The earliest of these systems were 32-bit Intel processors.
    • The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
    • Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
    • AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
  • My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
    • Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
    • The big exception is the new #1 system, the Sunway Taihulight system in China.
  • Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
    • “RoadRunner” was the only significant system using the IBM Cell processor.
    • Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
      • The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
      • The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
    • Note the rapid growth, then contraction, of the accelerated systems Rmax.
      • More on this topic below in the discussion of “clusters of clusters”.

Slide05
  • Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
  • With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
  • The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
  • One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.

Slide06
  • Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
  • So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
  • Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
  • A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.

Slide07

Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.


Slide08

Slide09
  • This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
  • I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.

Slide10
  • Note that the output of each kernel is used as an input to the next.
    • The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
    • Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
    • Another way to say this is that STREAM is resistant to undetected over-optimization.
  • OpenMP directives were added in 1996 or 1997.
  • STREAM in C was made fully 64-bit capable in 2013.
    • The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
  • Output print formats keep getting wider as systems get faster.

Slide11
  • STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
    • For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
    • I count the reads that I asked for and the writes that I asked for.
    • If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
      • The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
      • The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
  • One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
    • A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
    • Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
      • For example, GCC will not generate streaming stores.
    • In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.

Slide12

Back to the main topic….

For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.


Slide13
  • Using the same performance axes as on the previous slide….
  • All balances are shifting to make data motion relatively more expensive than arithmetic.

Slide14
  • The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
    • This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
    • I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
  • The data points here are from a set of real systems.
    • The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
      • ~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
      • ~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
      • ~1996: SGI Origin2000 (MIPS R10000 processor)
      • ~1999: DEC AlphaServer DS20 (DEC Alpha processor)
      • ~2005: AMD Opteron 244 (single-core, DDR1 memory)
      • ~2006: AMD Opteron 275 (dual-core, DDR1 memory)
      • ~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
      • ~2011: Intel Xeon X5680 (6-core Westmere EP)
      • ~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
      • ~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
      • ~2014: Intel Xeon E5 v3 (12-core Haswell EP)
      • (future: Intel Xeon E5 v5)
  • Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
    • Unfortunately every other metric is much worse….

Slide15
  • ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
  • Memory latency is becoming expensive much more rapidly than memory bandwidth!
    • Memory latency is dominated by the time required for cache coherence in most recent systems.
    • Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
  • Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
    • More on this below — slide 38.
    • The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.

Slide16
  • Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
  • Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart.  Don’t believe me?  OK…

Slide17
  • Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
  • Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
    • The details depend on the scaling characteristics of your application and are left as an exercise….

Slide18
  • Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low.   This is no longer the case.
  • In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced.  We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
  • In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
  • Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
    • These rapid changes require more design cost which results in higher product cost.
Slide19

Slide20
  • This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
    • The specific values are based on the median-priced 2-socket server processor in each time frame.
    • The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
    • Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
      • This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
  • Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
    • Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
  • How do Intel’s ManyCore (Xeon Phi) processors compare?

Slide21
  • Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
  • The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
  • The basic physics being applied here is based on several complementary principles:
    • Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
    • Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
    • For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
  • These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).

Slide22
  • The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
  • Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.

Slide23
  • In 1990, the original “Killer Micros” had a single level of cache.
  • Since about 2008, most x86 processors have had 3 levels of cache.
    • Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
      • E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
      • The extra lookups can actually reduce performance if they delay “real” transactions.
  • Asynchronous clock crossings are hard to implement with low latency.
    • A big, and under-appreciated, topic for another forum…
  • Intel Xeon processors evolution:
    • Monolithic L3: Nehalem/Westmere — 1 coherence protocol supported
    • Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
    • Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported

Slide24

Slide25
  • Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
  • Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
  • Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
    • It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.

Slide26

Slide27
  • Thanks to Burton Smith for the Little’s Law reference!
  • Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….

Slide28
  • Simple graphical illustration of Little’s Law.
  • I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
  • The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
  • As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
  • The second set of six loads can re-use the same buffers, so only six buffers are needed

Slide29
  • In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
  • In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
  • By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
    • Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
    • It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
  • The “Mainstream” machines are
    • SGI Origin2000 (MIPS R10000)
    • DEC Alpha DS20 (DEC Alpha EV5)
    • AMD Opteron 244 (single-core, DDR1 memory)
    • AMD Opteron 275 (dual-core, DDR1 memory)
    • AMD Opteron 2352 (dual-core, DDR2 memory)
    • Intel Xeon X5680 (6-core Westmere EP)
    • Intel Xeon E5 (8-core Sandy Bridge EP)
    • Intel Xeon E5 v2 (10-core Ivy Bridge EP)
    • Intel Xeon E5 v3 (12-core Haswell EP)
    • Intel Xeon E5 v4 (14-core Broadwell EP)
    • (future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
  • The GPU/Manycore machines are:
    • NVIDIA M2050
    • Intel Xeon Phi (KNC)
    • NVIDIA K20
    • NVIDIA K40
    • Intel Xeon Phi/KNL
    • NVIDIA P100 (Latency estimated).

Slide30

Slide31

Slide32

Slide33

Slide34
  • Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
  • For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
  • Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
  • Electrical cost is unlikely to increase by large factors in locations attached to power grids.
  • So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
  • Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
  • Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
  • In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.

Slide35
  • Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
    • 25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
    • (Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
  • Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
  • So if we get cheaper processors, they must be more energy-efficient as well.
  • This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
    • Not enough time for that topic today, but there are some speculations in the bonus slides.

Slide36
  • For the purposes of this talk, my speculations focus on the “most likely” scenarios.
  • Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.

Slide37
  • Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
    • (This is fairly widespread in practice.)
  • If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
    • (This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
  • Next we will look at examples of “application balance” from various application areas.

Slide38
  • CRITICAL!  Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
  • The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
  • TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
  • This chart is based on a sensitivity-based performance model with additive performance components.
    • In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
    • More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
    • But the modeling approach is still in active use at TACC, and was the subject of my keynote talk at the
      2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)

Slide39
  • Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!

Slide40
  • It is very hard for vendors to back off of existing performance levels….
  • As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
  • Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
  • Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.

Slide41
  • This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
  • There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
    • As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
    • If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.

Slide42
  • Here I hit the 45-minute mark and ended the talk.
  • I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.

 

Posted in Computer Architecture, Computer Hardware | 1 Comment »

Some notes on producer/consumer communication in cached processors

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

In a recent Intel Software Developer Forum discussion (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477), I put together a few notes on the steps required for a single-producer, single-consumer communication using separate cache lines for “data” and “flag” values.

Although this was not a carefully-considered formal analysis, I think it is worth re-posting here as a reminder of the ridiculous complexity of implementing even the simplest communication operations in shared memory on cached systems.


I usually implement a producer/consumer code using “data” and “flag” in separate cache lines.  This enables the consumer to spin on the flag while the producer updates the data.   When the data is ready, the producer writes the flag variable.   At a low level, the steps are:

  1. The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.
    • The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
    • The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
  2. The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
  3. The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
    • The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
    • The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable.  If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
    • The L3 may send the data for the cache line containing the flag to the producer’s cache at this time, but that data cannot be used until the coherence transactions are complete.
  4. The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
    • Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
  5. The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
    • If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
    • If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache.  I don’t know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
  6. Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
    • Depending on the implementation, several different things might happen.
    • One option is for the L3 to hold the line containing the flag variable in a “transient” state while it waits for an acknowledgement from the Producer core that it has received the write permission message.  In this case the L3 will either:
      • Stall the read request from the consumer core, or
      • NACK the read request from the consumer core (i.e., tell it to try again).
    • Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core’s cache.
  7. In the “lucky” case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
    • Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
    • Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data.  (This is more common in the case of a store than a read.)
  8. Finally the consumer core gets the new value of the flag variable and sees that it has changed!  The data is now ready!
  9. The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
  10. The consumer core now executes a load instruction to get the data.  This misses in the consumer’s L1 and L2 caches and generates a read request on the ring.
  11. The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
  12. The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
  13. The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now “read-only” in the producer’s caches.
  14. As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
  15. The cache line containing the data arrives in the consumer core’s cache, and the load instruction is allowed to complete.
  16. Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
    • This requires the consumer to make an “upgrade” request on the cache line containing the flag, so it can write to it.   This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again — it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
    • Double-buffering can be used to avoid this extra transaction — if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation.  These often involve extra round trips required to ensure ordering in ugly corner cases.  A common example is maintaining global ordering across stores that are handled by different coherence controllers.  This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.

There are fewer steps in the case where the “data” and “flag” are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line.  This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Some notes on producer/consumer communication in cached processors

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

AMD Opteron Memory Configuration notes

Posted by John D. McCalpin, Ph.D. on 16th December 2015

(These are old notes for relatively old systems — I just found this in my “drafts” folder and decided to switch the status to “public” so I can find it again!)

Some notes on how to determine the DRAM and Memory Controller configuration for a system using AMD Opteron/Phenom or other Family 10h processors.  All of this information is available in AMD’s publication: “BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors”, document number 31116. I am using “Rev 3.48 – April 22, 2010”

Background: How to Read and Interpret the PCI Configuration Bits

The processor configuration bits are available in PCI configuration space and can be read with the “lspci” program.  Unfortunately it requires a relatively new kernel (2.26 or newer) to read the extended configuration bits (i.e., those with offsets greater than 256 Bytes) — I will try to mark the problematic configuration bits as I go along.   To get the configuration bits, run “lspci -xxxx” (as root) and save the text output.

The “lspci” program prints out the output by bus, device, function, and offset.  Here we are only interested in the processor configurations, so we look through the output until we get to a line that looks like:

00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration

The initial characters of the line are interpreted as “bus 0”, “device 18”, “function 0”, followed by a text label for this PCI configuration space region.

The following lines will look like:

00: 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00

These are hexadecimal dumps of the values at various “offsets” into the PCI configuration space.

The values are organized with the higher addresses and most significant bits to the right (except that within each 2-digit hexadecimal number the least significant bits are to the right!).  These PCI configuration space values are organized into 32-bit “registers”, so the first line above corresponds to

Offset 00 04 08 0C
bits 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24
value (hex) 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00

This first line of output corresponds to four 32-bit PCI configuration space registers, shown as offsets 00, 04, 08, and 0C in the table. The BKDG describes these as:
F0x00 Device/Vendor ID Register
Reset: 1200 1022h. <– note the “little-endian” ordering
F0x04 Status/Command Register
Reset: 0010 0000h. Bit[20] is set to indicate the existence of a PCI-defined capability block.
F0x08 Class Code/Revision ID Register
Reset: 0600 0000h.
F0x0C Header Type Register
Reset: 0080 0000h.

Important Memory Controller and DRAM Configuration Bits

Information About Installed Hardware

The first thing to look at are the properties of the installed hardware. The DIMM configuration information is contained in F2x[1,0]80 DRAM Bank Address Mapping Register. The Family 10h Opterons have two memory controllers — the information for controller 0 is located at Function 2, Offset 080, while the information for controller 1 is located at Function 2, Offset 180. Note that these offsets are specified in hexadecimal, so that offsets greater than 100 (=256 decimal) are located in the extended PCI configuration area and will not be included in the output of “lspci” on systems running 2.6.25 or earlier Linux kernels. For this configuration bit the inability to read the controller 1 values is not likely to be a proble — if the DIMMs have been properly installed in matching pairs, then the two memory controllers will be configured identically by the BIOS.

For the system I am looking at today, the output of lspci for Function 2, offset 80 is

80:	55	00	00	00

Comparing the description of the bit field mappings from the BKDG (page 238) with my data for this configuration register and the device information in Table 85 (page 239) of the BKDG gives:

Bits Field Name Value Meaning
31:16 Reserved 00 00h (it is nice to see that these reserved bits are actually zero)
15:12 Dimm3AddrMap 0h ignored because this DIMM is not populated
11:8 Dimm2AddrMap 0h ignored because this DIMM is not populated
7:4 Dimm1AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13
3:0 Dimm0AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13

These interpretations make sense. The DIMMs are composed of 1 Gbit DRAM chips, each with 8 output bits (“x8”). To create a 64-bit DIMM, 8 of these DRAM chips work together as a single “rank” (actually there are 9 chips in a rank, to provide extra bits for ECC error correction). This “rank” will then have a capacity of 1 Gbit/chip * 8 chips = 1 GiB. (Here I am using the newer notation to distinguish between binary and decimal sizes — see http://en.wikipedia.org/wiki/Gibibyte for more information.)

A few things to note:

  • The bits in this configuration register don’t tell whether or not a DIMM is installed. The value of ‘0’ for Dimm2AddrMap and Dimm3AddrMap could correspond to a 128 MiB chip select size composed of 256 Mb parts with x16 width. It is quite unlikely that anyone will run across such a DIMM in their system — 256 Mbit parts are old technology and x16 width is quite unusual outside of the embedded processor space — but there may be no guarantee that the BIOS will set the bits here to such an easily identified value in the event that no DIMM is installed in that slot, so you do need to look elsewhere to be sure of the configuration.
  • The bits in this configuration register don’t tell how many “ranks” are included on each DIMM. A DIMM can be constructed with 1, 2, or 4 ranks (though 1 and 2 are by far the most common), so you need to check elsewhere to find the number of ranks.
  • In any system using both DRAM channels (either ganged or unganged-but-interleaved — see below) the address bit numbers in above must be incremented by one. The BKDG includes a comment on the bottom of page 238 that the address bits only need to be incremented when running in ganged mode. I think this is incorrect — the “effective” bank size is doubled (corresponding to incrementing the bank address bits) by the use of two DRAM channels, whether the interleaving is within a cache line (as in ganged mode) or between cache lines (as in unganged-but-interleaved) mode.

Information About Configuration of Hardware

Memory Controller Channel Interleave or Ganging

There are a number of common configurations choices that determine how the hardware makes use of the two 64-bit DRAM channels.

  • One feature is called “DRAM controller ganging”, which sets up the two DRAM channels to work in lockstep, with each cache line being split between the two channels. This feature is typically activated when the strongest error-correction features are desired. The downside of this approach is that each memory controller is only transferring 32 Bytes per cache line, which corresponds to 4 8-Byte bursts. This is the shortest burst length supported by DDR2 memory and makes the DRAM bus overheads relatively larger. DRAM controller ganging is enabled if F2x110 DRAM Controller Select Low Register Bit 4: DctGangEn: “DRAM controller ganging enable”, is set.
  • If the memory controllers are not ganged, the BIOS will attempt to set up “DRAM channel interleaving”. In this mode, each channel transfers full cache lines, and cache lines are distributed evenly between the two channels.
    DRAM channel interleaving is enabled if F2x110 DRAM Controller Select Low Register Bit 5: DctDatIntLv: “DRAM controller data interleave enable” is set. There are a number of options controlling exactly how the cache lines are mapped to the two DRAM channels. These options are controlled by F2x110 DRAM Controller Select Low Register Bits 7:6 DctSelIntLvAddr: “DRAM controller select channel interleave address bit”. The recommended option is the value “10”, which causes the DRAM channel select bit to be computed as the Exclusive-OR of address bits 20:16 & 6. When the value is “1”, channel 1 is selected, otherwise channel 0 is selected. Bit 6 is the bit above the cache line address, so using bit 6 alone would cause cache lines to be mapped odd/even to the two DRAM channels. By computing the channel select bit using the Exclusive-OR of six bits it is much less likely that an access pattern will repeatedly access only one of the two DRAM channels.

Posted in Computer Hardware | Comments Off on AMD Opteron Memory Configuration notes

Counting Stall Cycles on the Intel Sandy Bridge Processor

Posted by John D. McCalpin, Ph.D. on 4th June 2014

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

  • Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
    queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)

    1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
    2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
    1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
      • I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
    2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
        HyperThreading enabled, and I don’t have any systems configured that way to test right now.
      • These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Counting Stall Cycles on the Intel Sandy Bridge Processor

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on 14th July 2013


In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….


For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on Notes on the mystery of hardware cache performance counters

Coherence with Cached Memory-Mapped IO

Posted by John D. McCalpin, Ph.D. on 30th May 2013

In response to my previous blog entry, a question was asked about how to manage coherence for cached memory-mapped IO regions.   Here are some more details…

Maintaining Coherence with Cached Memory-Mapped IO

For the “read-only” range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy.   Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line.  Not all of these will work on all implementations!

  1. Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
    A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
    The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the “flushing” reads.   There are also some systems for which the pseudo-LRU algorithm has “features” that can break this approach.  (HyperThreading and shared caches can both add complexity in this dimension.)
  2. The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
    Whether it will evict the line from the L3 depends on the implementation, and I don’t have enough information to speculate on whether this will work on Xeon processors.   For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
  3. For memory mapped my the MTRRs as WP (“Write Protect”), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches.  This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
  4. The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer’s Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed.  Additional details are discussed in the SW Developer’s Guide, Volume 3.    Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer’s Guide, Vol 2.
  5. The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost).   This is very likely to crash your system, and is only mentioned here for completeness.
  6. AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO).  This is *almost* useful for supporting cached MMIO, but (at least on the Family 10h processors), the specific mode that I wanted to set up (see addendum below) is disallowed for ugly microarchitectural reasons that I can’t discuss.

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate.   There are no guarantees, but this seems like the approach most likely to work.

 

Addendum: The AMD almost-solution using MTRR extensions.

The AMD64 architecture provides extensions to the MTRR mechanism called IORRs that allow the system programmer to independently specify whether reads to a certain region go to system memory or MMIO and whether writes to that region go to system memory or MMIO.   This is discussed in the “AMD64 Architecture Programmers Manual, Volume 2: System Programming” (publication number 24593).
I am using version 3.22 from September 2012, where this is described in section 7.9.

The idea was to use this to modify the behavior of the “read-only” MMIO mapping so that reads would go to MMIO while writes would go to system memory.  At first glance this seems strange — I would be creating a “write-only” region of system memory that could never be read (because reads to that address range would go to MMIO).

So why would this help?

It would help because sending the writes to system memory would cause the cache coherence mechanisms to be activated.   A streaming store (for example) to this region would be sent to the memory controller for that physical address range.  The memory controller treats streaming stores in the same way as DMA stores from IO devices to system memory, and it sends out invalidate messages to all caches in the system.  This would invalidate the cached MMIO line in all caches, which would eliminate both the need to pin the thread to a specific core and the problem of the CLFLUSH not reaching the L3 cache.

At least in the AMD Family 10h processors, this IORR function works, but due to some implementation issues in this particular use case it forces the region to the MTRR UC (uncached) type, which defeats my purpose in the exercise.   I think that the implementation issues could be either fixed or worked around, but since this is a fix to a mode that is not entirely supported, it is easy to understand that this never showed up as a high priority to “fix”.

Posted in Accelerated Computing, Computer Hardware, Linux | Comments Off on Coherence with Cached Memory-Mapped IO