John McCalpin's blog

Dr. Bandwidth explains all….

“Memory directories” in Intel processors

Posted by John D. McCalpin, Ph.D. on 28th August 2023

One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”.   This is used in multi-socket systems to reduce cache coherence traffic between sockets.

I have referred to this in various presentations as:

“A Memory Directory is one or more bits per cache line in DRAM that tell the processor whether another socket might have a dirty copy of the cache line.”

When asked for references to public information about the Intel’s memory directory implementation, I had to go back and find the various tiny bits of information scattered around different places.   That was boring, so I am posting my notes here so I can find them again — maybe others will find this useful as well….


Existence:

The clearest admission that the memory directory feature exists is in this “technical overview” presentation – in the section “Directory-Based Coherency” – https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html. The language is not as precise as I would prefer, but there are actually quite a few interesting details in that section.

History of Implementations:

Intel has disclosed various details in some presentation at the Hot Chips series of conferences:

(I apologize in advance if the links are broken — it is hard to keep up with web site reorganizations — but the presentations should be relatively easy to find using web search services.)

Implicit Information:

Much of the detailed understanding of the memory directory implementation comes from studying the uncore performance counter events that make reference to memory directories.

As an example, there are direct references to memory directories in several sections of the Intel Uncore Performance Monitoring Guides for their various processor families.    These documents are linked from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The version that I have spent the most time with is for the 1st and 2nd generations of Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”):  “Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual”, Intel document 336274.

Specific references in this manual include:

  • CHA Events “DIR_LOOKUP” and “DIR_UPDATE”
  • IMC Event
    • IMC_WRITES_COUNT includes: “NOTE: Directory bits are stored in memory. Remote socket RFOs will result in a directory update which, in turn, will cause a write command.”
    • Aside: it would be easy to be misled by this description.  Remote socket RFOs will result in a directory update but so will ordinary reads that return data in E state (the default for data reads that hit unshared lines).
  • M2M Events “DIRECTORY_HIT”, “DIRECTORY_LOOKUP”, “DIRECTORY_MISS”, “DIRECTORY_UPDATE”
    • These include information on directory states – very helpful for understanding the implementation.
  • Section 3.1.3 “Reference for M2M Packet Matching”, Table 3-9 “SMI3 Opcodes”, mentions the directory in the description of 10 of the 18 transaction types.

It is helpful to compare and contrast the documentation from all of the recent processor generations.  The specific items disclosed in each generation are not the same, and sometimes a generation will have a more verbose explanation that fills in gaps for the other generations.

Reading the documentation is seldom enough to understand the implementation — I typically have to create customized microbenchmarks to generate known counts of transactions of various types and then compare the measured performance counter values with the values I expected to generate.   When there are significant differences more thinking is required.  Thinking does not always help — sometimes the performance counter event is just broken, sometimes there is just not enough information disclosed to derive reasonable bounds on the implementation, and sometimes the implementation has strongly dynamic/adaptive behavior based on activity metrics that are not visible or not controllable using available HW features.

OEM Vendor Disclosures:

I have run across a few other bits of information scattered around the interwebs:

Encoding the Memory Directory bits:

In various presentations I have said that the “one or more” memory directory bits were “hidden” in the ECC bits.  This is easy to do – standard 64-bit DRAM interfaces with ECC provide 72 bits of data with every “64-bit” data transfer and generate 8 contiguous transfers per transaction.   64 bits of data requires 8 bits of ECC for SECDED for every transfer.  Computing the ECC over P transfers only requires log2(P) additional bits, while the ECC gives you 8*(P-1) extra bits.  Intel used this aggregation approach very aggressively on the Knights Corner accelerator to store the SECDED ECC bits for 64 Byte cache lines in regular memory – reserving every 32ndcache line for that purpose.


Are there any other important references that I am missing?

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on “Memory directories” in Intel processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on 10th September 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Posted in Cache Coherence Implementations, Computer Architecture | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on 18th February 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Intel’s future “CLDEMOTE” instruction

SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Posted by John D. McCalpin, Ph.D. on 7th January 2019

Here are the annotated slides from my SC18 presentation on Snoop Filter Conflicts that cause performance variability in HPL and DGEMM on the Xeon Platinum 8160 processor.

This slide presentation includes data (not included in the paper) showing that Snoop Filter Conflicts occur in all Intel Scalable Processors (a.k.a., “Skylake Xeon”) with 18 cores or more, and also occurs on the Xeon Phi x200 processors (“Knights Landing”).

The published paper is available (with ACM subscription) at https://dl.acm.org/citation.cfm?id=3291680


This is less boring than it sounds!


A more exciting version of the title.


This story is very abridged — please read the paper!



Execution times only — no performance counters yet.

500 nodes tested, but only 392 nodes had the 7 runs needed for a good computation of the median performance.

Dozens of different nodes showed slowdowns of greater than 5%.


I measured memory bandwidth first simply because I had the tools to do this easily.
Read memory controller performance counters before and after each execution and compute DRAM traffic.
Write traffic was almost constant — only the read traffic showed significant variability.


It is important to decouple the sockets for several reasons.  (1) Each socket manages its frequency independently to remain within the Running Average Power Limit. (2) Cache coherence is managed differently within and between sockets.
The performance counter infrastructure is at https://github.com/jdmccalpin/periodic-performance-counters
Over 25,000 DGEMM runs in total, generating over 240 GiB of performance counter output.


I already saw that slow runs were associated with higher DRAM traffic, but needed to find out which level(s) of the cache were experience extra load misses.
The strongest correlation between execution time and cache miss counts was with L2 misses (measured here as L2 cache fills).

The variation of L2 fills for the full-speed runs is surprisingly large, but the slow runs all have L2 fill counts that are at least 1.5x the minimum value.
Some runs tolerate increased L2 fill counts up to 2x the minimum value, but all cases with >2x L2 fills are slow.

This chart looks at the sum of L2 fills for all the cores on the chip — next I will look at whether these misses are uniform across the cores.


I picked 15-20 cases in which a “good” trial (at or above median performance) was followed immediately by a “slow” trial (at least 20% below median performance).
This shows the L2 Fills by core for a “good” trial — the red dashed line corresponds to the minimum L2 fill count from the previous chart divided by 24 to get the minimum per-core value.
Different sets of cores and different numbers of cores had high counts in each run — even on the same node.


This adds the “slow” execution that immediately followed the “good” execution.
For the slow runs, most of cores had highly elevated L2 counts.  Again, different sets of cores and different numbers of cores had high counts in each run.

This data provides a critical clue:  Since L2 caches are private and 2MiB caches fully control the L2 cache index bits, the extra L2 cache misses must be caused by something *external* to the cores.


The Snoop Filter is essentially the same as the directory tags of the inclusive L3 cache of previous Intel Xeon processors, but without room to store the data for the cache lines tracked.
The key concept is “inclusivity” — lines tracked by a Snoop Filter entry must be invalidated before that Snoop Filter entry can be freed to track a different cache line address.


I initially found some poorly documented core counters that looked like they were related to Snoop Filter evictions, then later found counters in the “uncore” that count Snoop Filter evictions directly.
This allowed direct confirmation of my hypothesis, as summarized in the next slides.


About 1% of the runs are more than 10% slower than the fastest run.


Snoop Filter Evictions clearly account for the majority of the excess L2 fills.

But there is one more feature of the “slow” runs….


For all of the “slow” runs, the DRAM traffic is increased.  This means that a fraction of the data evicted from the L2 caches by the Snoop Filter evictions was also evicted from the L3 cache, and so must be retrieved from DRAM.

At high Snoop Filter conflict rates (>4e10 Snoop Filter evictions), all of the cases have elevated DRAM traffic, with something like 10%-15% of the Snoop Filter evictions missing in the L3 cache.

There are some cases in the range of 100-110 seconds that have elevated snoop filter evictions, but not elevated DRAM reads, that show minimal slowdowns.

This suggests that DGEMM can tolerate the extra latency of L2 miss/L3 hit for its data, but not the extra latency of L2 miss/L3 miss/DRAM hit.


Based on my experience in processor design groups at SGI, IBM, and AMD, I wondered if using contiguous physical addresses might avoid these snoop filter conflicts….


Baseline with 2MiB pages.


With 1GiB pages, the tail almost completely disappears in both width and depth.


Zooming in on the slowest 10% of the runs shows no evidence of systematic slowdowns when using 1GiB pages.
The performance counter data confirms that the snoop filter eviction rate is very small.

So we have a fix for single-socket DGEMM, what about HPL?


Intel provided a test version of their optimized HPL benchmark in December 2017 that supported 1GiB pages.

First, I verified that the performance variability for single-node (2-socket) HPL runs was eliminated by using 1GiB pages.

The variation across nodes is strong (due to different thermal characteristics), but the variation across runs on each node is extremely small.

The 8.6% range of average performance for this set of 31 nodes increases to >12% when considering the full 1736-node SKX partition of the Stampede2 system.

So we have a fix for single-node HPL, what about the full system?


Intel provided a full version of their optimized HPL benchmark in March 2018 and we ran it on the full system in April 2018.

The estimated breakdown of performance improvement into individual contributions is a ballpark estimate — it would be a huge project to measure the details at this scale.

The “practical peak performance” of this system is 8.77 PFLOPS on the KNLs plus 3.73 PFLOPS on the SKX nodes, for 12.5 PFLOPS “practical peak”.  The 10.68 PFLOPS obtained is about 85% of this peak performance.


During the review of the paper, I was able to simplify the test further to allow quick testing on other systems (and larger ensembles).

This is mostly new material (not in the paper).


https://github.com/jdmccalpin/SKX-SF-Conflicts

This lets me more directly address my hypothesis about conflicts with contiguous physical addresses, since each 1GiB page is much larger than the 24 MiB of aggregate L2 cache.


It turns out I was wrong — Snoop Filter Conflicts can occur with contiguous physical addresses on this processor.

The pattern repeats every 256 MiB.

If the re-used space is in the 1st 32 MiB of any 1GiB page, there will be no Snoop Filter Conflicts.

What about other processors?


I tested Skylake Xeon processors with 14, 16, 18, 20, 22, 24, 26, 28 cores, and a 68-core KNL (Xeon Phi 7250).

These four processors are the only ones that show Snoop Filter Conflicts with contiguous physical addresses.

But with random 2MiB pages, all processors with more than 16 cores show Snoop Filter conflicts for some combinations of addresses…..


These are average L2 miss rates — individual cores can have significantly higher miss rates (and the maximum miss rate may be the controlling factor in performance for multi-threaded codes).

The details are interesting, but no time in the current presentation….



Overall, the uncertainty associated with this performance variability is probably more important than the performance loss.

Using performance counter measurements to look for codes that are subject to this issue is a serious “needle in haystack” problem — it is probably easier to choose codes that might have the properties above and test them explicitly.


Cache-contained shallow water model, cache-contained FFTs.


The new DGEMM implementation uses dynamic scheduling of the block updates to decouple the memory access patterns.  There is no guarantee that this will alleviate the Snoop Filter Conflict problem, but in this case it does.


I now have a model that predicts all Snoop Filter Conflicts involving 2MiB pages on the 24-core SKX processors.
Unfortunately, the zonesort approach won’t work under Linux because pages are allocated on a “first-come, first-served” basis, so the fine control required is not possible.

An OS with support for page coloring (such as BSD) could me modified to provide this mitigation.


Again, the inability of Linux to use the virtual address as a criterion for selecting the physical page to use will prevent any sorting-based approach from working.


Intel has prepared a response.  If you are interested, you should ask your Intel representative for a copy.





 

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Linux, Performance Counters | Comments Off on SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Notes on “non-temporal” (aka “streaming”) stores

Posted by John D. McCalpin, Ph.D. on 1st January 2018

Memory systems using caches have a lot more potential flexibility than most implementations are able to exploit – you get the standard behavior all the time, even if an alternative behavior would be allowable and desirable in a specific circumstance.  One area in which many vendors have provided an alternative to the standard behavior is in “non-temporal” or “streaming” stores. These are sometimes also referred to as “cache-bypassing” or “non-allocating” stores, and even “Non-Globally-Ordered stores” in some cases.

The availability of streaming stores on various processors can often be tied to the improvements in performance that they provide on the STREAM benchmark, but there are several approaches to implementing this functionality, with some subtleties that may not be obvious.  (The underlying mechanism of write-combining was developed long before, and for other reasons, but exposing it to the user in standard cached memory space required the motivation provided by a highly visible benchmark.)

Before getting into details, it is helpful to review the standard behavior of a “typical” cached system for store operations.  Most recent systems use a “write-allocate” policy — when a store misses the cache, the cache line containing the store’s target address is read into the cache, then the parts of the line that are receiving the new data are updated.   These “write-allocate” cache policies have several advantages (mostly related to implementation correctness), but if the code overwrites all the data in the cache line, then reading the cache line from memory could be considered an unnecessary performance overhead.    It is this extra overhead of read transactions that makes the subject of streaming stores important for the STREAM benchmark.

As in many subjects, a lot of mistakes can be avoided by looking carefully at what the various terms mean — considering both similarities and differences.

  • “Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”).
    • So there is no benefit to keeping the data in the processor’s cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
  • “Streaming store” is suggestive of stores to contiguous addresses (i.e., high “spatial locality”).
  • “Cache-bypassing store” says that at least some aspects of the transaction bypass the cache(s).
  • “Non-allocating store” says that a store that misses in a cache will not load the corresponding cache line into the cache before performing the store.
    • This implies that there is some other sort of structure to hold the data from the stores until it is sent to memory.  This may be called a “store buffer”, a “write buffer”, or a “write-combining buffer”.
  • “Non-globally-ordered store” says that the results of an NGO store might appear in a different order (relative to ordinary stores or to other NGO stores) on other processors.
    • All stores must appear in program order on the processor performing the stores, but may only need to appear in the same order on other processors in some special cases, such as stores related to interprocessor communication and synchronization.

There are at least three issues raised by these terms:

  1. Caching: Do we want the result of the store to displace something else from the cache?
  2. Allocating: Can we tolerate the extra memory traffic incurred by reading the cache line before overwriting it?
  3. Ordering: Do the results of this store need to appear in order with respect to other stores?

Different applications may have very different requirements with respect to these issues and may be best served by different implementations.   For example, if the priority is to prevent the stored data from displacing other data in the processor cache, then it may suffice to put the data in the cache, but mark it as Least-Recently-Used, so that it will displace as little useful data as possible.   In the case of the STREAM benchmark, the usual priority is the question of allocation — we want to avoid reading the data before over-writing it.

While I am not quite apologizing for that, it is clear that STREAM over-represents stores in general, and store misses in particular, relative to the high-bandwidth applications that I intended it to represent.

To understand the benefit of non-temporal stores, you need to understand if you are operating in (A) a concurrency-limited regime or (B) a bandwidth-limited regime. (This is not a short story, but you can learn a lot about how real systems work from studying it.)

(Note: The performance numbers below are from the original private version of this post created on 2015-05-29.   Newer systems require more concurrency — numbers will be updated when I have a few minutes to dig them up….)

(A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s.

Store misses are essentially the same as load misses in this analysis.

L1 hardware prefetchers don’t help performance here because they share the same 10 L1 cache miss buffers. L2 hardware prefetchers do help because bringing the data closer reduces the occupancy required by each cache miss transaction. Unfortunately, L2 hardware prefetchers are not directly controllable, so experimentation is required. The best read bandwidth that I have been able to achieve on this system is about 17.8 GB/s, which corresponds to an “effective concurrency” of about 22 cache lines or an “effective latency” of about 36 ns (for each of the 10 concurrent L1 misses).

L2 hardware prefetchers on this system are also able to perform prefetches for store miss streams, thus reducing the occupancy for store misses and increasing the store miss bandwidth.

For non-temporal stores there is no concept corresponding to “prefetch”, so you are stuck with whatever buffer occupancy the hardware gives you. Note that since the non-temporal store does not bring data *to* the processor, there is no reason for its occupancy to be tied to the memory latency. One would expect a cache miss buffer holding a non-temporal store to have a significantly lower latency than a cache miss, since it is allocated, filled, and then transfers its contents out to the memory controller (which presumably has many more buffers than the 10 cache miss buffers that the core supports).

But, while non-temporal stores are expected to have a shorter buffer occupancy than that of a cache miss that goes all the way to memory, it is not at all clear whether they will have a shorter buffer occupancy than a store misses that activates an L2 hardware prefetch engine. It turns out that on the Xeon E5-2680 (Sandy Bridge EP), non-temporal stores have *higher* occupancy than store misses that activate the L2 hardware prefetcher, so using non-temporal stores slows down the performance of each of the four STREAM kernels. I don’t have all the detailed numbers in front of me, but IIRC, STREAM Triad runs at about 10 GB/s with one thread on a Xeon E5-2680 when using non-temporal stores and at between 12-14 GB/s when *not* using non-temporal stores.

This result is not a general rule. On the Intel Xeon E3-1270 (also a Sandy Bridge core, but with the “client” L3 and memory controller), the occupancy of non-temporal stores in the L1 cache miss buffers appears to be much shorter, so there is not an occupancy-induced slowdown. On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses. In this case the use of non-temporal stores allowed one thread to have more concurrent
memory transactions, so it almost always helped performance.

(B) STREAM is typically run using as many cores as is necessary to achieve the maximum sustainable memory bandwidth. In this bandwidth-limited regime non-temporal stores play a very different role. When using all the cores there are no longer concurrency limits, but there is a big difference in bulk memory traffic. Store misses must read the target lines into the L1 cache before overwriting it, while non-temporal stores avoid this (semantically unnecessary) load from memory. Thus for the STREAM Copy and Scale kernels, using cached stores results in two memory reads and one memory write, while using non-temporal stores requires only one memory read and one memory write — a 3:2 ratio. Similarly, the STREAM Add and Triad kernels transfer 4:3 as much data when using cached stores compared to non-temporal stores.

On most systems the reported performance ratios for STREAM (using all processors) with and without non-temporal stores are very close to these 3:2 and 4:3 ratios. It is also typically the case that STREAM results (again using all processors) are about the same for all four kernels when non-temporal stores are used, while (due to the differing ratios of extra read traffic) the Add and Triad kernels are typically ~1/3 faster on systems that use cached stores.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Computer Hardware, Reference | Comments Off on Notes on “non-temporal” (aka “streaming”) stores

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on 11th September 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on 14th July 2013


In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….


For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on Notes on the mystery of hardware cache performance counters

Notes on Cached Access to Memory-Mapped IO Regions

Posted by John D. McCalpin, Ph.D. on 29th May 2013

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices.  These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device.   This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device.  The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance.   As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges.   Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput.  When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references.   So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses?   The answer is a resounding, “yes, but….“.    The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

  • Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
    BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
  • Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
    This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
    The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
  • Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
    The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
    It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
  • Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
    Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

  • Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining).  In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.
  • Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
    For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
    These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region.   This software based coherence functionality can be implemented at many different levels of complexity, for example:

  • For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
  • If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic.  In such a case, a simple “directory-based coherence” mechanism can be used.  The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”.  The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated.   Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….

 

Posted in Accelerated Computing, Computer Hardware, Linux | Comments Off on Notes on Cached Access to Memory-Mapped IO Regions