John McCalpin's blog

Dr. Bandwidth explains all….

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on April 2, 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.

An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.

This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….

(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)

Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…

Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.

The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.

The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.

Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.

Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.

This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).

Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).

The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.

Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.

Drum roll, please….

Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.

These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.

Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?

On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.

Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.

These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Timing Methodology for MPI Programs

Posted by John D. McCalpin, Ph.D. on March 4, 2019

While working on the implementation of the MPI version of the STREAM benchmark, I realized that there were some subtleties in timing that could easily lead to inaccurate and/or misleading results.  This post is a transcription of my notes as I looked at the issues….

Primary requirement: I want a measure of wall clock time that is guaranteed to start before any rank does work and to end after all ranks have finished their work.

Secondary goal: I also want the start time to be as late as possible relative to the initiation of work by any rank, and for the end time to be as early as possible relative to the completion of the work by all ranks.

I am not particularly concerned about OS scheduling issues, so I can assume that timers will be very close to the execution time of the the preceding statement completion and the subsequent statement initiation.  Any deviations caused by stalls between timers, barriers, and work must be in the direction of increasing the reported time, not decreasing it.  (This is a corollary of my initial requirement.)
The discussion here will be based on a simple example, where the “t” variables are (local) wall clock times for MPI rank k and WORK() represents the parallel workload that I am testing.
Generically, I want:
      t_start(k) = time()
      t_end(k) = time()
but for an MPI job, the methodology needs to be provably correct for arbitrary (real) skew across nodes as well as for arbitrary offsets between the absolute time across nodes.  (I am deliberately ignoring the rare case in which a clock is modified on one or more nodes during a run — most time protocols try hard to avoid such shifts, and instead change the rate at which the clock is incremented to drive synchronization.)
After some thinking, I came up with this pseudo-code, which is executed independently by each MPI rank (indexed by “k”):
      t0(k) = time()
      t1(k) = time()


      t2(k) = time()
      t3(k) = time()
If the clocks are synchronized, then all I need is:
    tstart = min(t1(k)), k=1..numranks
    tstop  = max(t2(k)), k=1..numranks
If the clocks are not synchronized, then I need to make some use of the barriers — but exactly how?
In the pseudo-code above, the barriers ensure that the following two statements are true:
  • For the start time, t0(k) is guaranteed to be earlier than the initiation of any work.
  • For the end time, t3(k) is guaranteed to be later than the completion of any work.
These statements are true for each rank individually, so the tightest bound available from the collection of t0(k) and t3(k) values is:
      tstop - tstart > min(t3(k)-t0(k)),   k=1..numranks
This gives a (tstop – tstart) that is at least as large as the time required for the actual work plus the time required for the two MPI_barrier() operations.
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance, Reference | Tagged: | Comments Off on Timing Methodology for MPI Programs

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on February 18, 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.


Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Tagged: , | Comments Off on Intel’s future “CLDEMOTE” instruction

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on January 9, 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | 2 Comments »

SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Posted by John D. McCalpin, Ph.D. on January 7, 2019

Here are the annotated slides from my SC18 presentation on Snoop Filter Conflicts that cause performance variability in HPL and DGEMM on the Xeon Platinum 8160 processor.

This slide presentation includes data (not included in the paper) showing that Snoop Filter Conflicts occur in all Intel Scalable Processors (a.k.a., “Skylake Xeon”) with 18 cores or more, and also occurs on the Xeon Phi x200 processors (“Knights Landing”).

The published paper is available (with ACM subscription) at

This is less boring than it sounds!

A more exciting version of the title.

This story is very abridged — please read the paper!

Execution times only — no performance counters yet.

500 nodes tested, but only 392 nodes had the 7 runs needed for a good computation of the median performance.

Dozens of different nodes showed slowdowns of greater than 5%.

I measured memory bandwidth first simply because I had the tools to do this easily.
Read memory controller performance counters before and after each execution and compute DRAM traffic.
Write traffic was almost constant — only the read traffic showed significant variability.

It is important to decouple the sockets for several reasons.  (1) Each socket manages its frequency independently to remain within the Running Average Power Limit. (2) Cache coherence is managed differently within and between sockets.
The performance counter infrastructure is at
Over 25,000 DGEMM runs in total, generating over 240 GiB of performance counter output.

I already saw that slow runs were associated with higher DRAM traffic, but needed to find out which level(s) of the cache were experience extra load misses.
The strongest correlation between execution time and cache miss counts was with L2 misses (measured here as L2 cache fills).

The variation of L2 fills for the full-speed runs is surprisingly large, but the slow runs all have L2 fill counts that are at least 1.5x the minimum value.
Some runs tolerate increased L2 fill counts up to 2x the minimum value, but all cases with >2x L2 fills are slow.

This chart looks at the sum of L2 fills for all the cores on the chip — next I will look at whether these misses are uniform across the cores.

I picked 15-20 cases in which a “good” trial (at or above median performance) was followed immediately by a “slow” trial (at least 20% below median performance).
This shows the L2 Fills by core for a “good” trial — the red dashed line corresponds to the minimum L2 fill count from the previous chart divided by 24 to get the minimum per-core value.
Different sets of cores and different numbers of cores had high counts in each run — even on the same node.

This adds the “slow” execution that immediately followed the “good” execution.
For the slow runs, most of cores had highly elevated L2 counts.  Again, different sets of cores and different numbers of cores had high counts in each run.

This data provides a critical clue:  Since L2 caches are private and 2MiB caches fully control the L2 cache index bits, the extra L2 cache misses must be caused by something *external* to the cores.

The Snoop Filter is essentially the same as the directory tags of the inclusive L3 cache of previous Intel Xeon processors, but without room to store the data for the cache lines tracked.
The key concept is “inclusivity” — lines tracked by a Snoop Filter entry must be invalidated before that Snoop Filter entry can be freed to track a different cache line address.

I initially found some poorly documented core counters that looked like they were related to Snoop Filter evictions, then later found counters in the “uncore” that count Snoop Filter evictions directly.
This allowed direct confirmation of my hypothesis, as summarized in the next slides.

About 1% of the runs are more than 10% slower than the fastest run.

Snoop Filter Evictions clearly account for the majority of the excess L2 fills.

But there is one more feature of the “slow” runs….

For all of the “slow” runs, the DRAM traffic is increased.  This means that a fraction of the data evicted from the L2 caches by the Snoop Filter evictions was also evicted from the L3 cache, and so must be retrieved from DRAM.

At high Snoop Filter conflict rates (>4e10 Snoop Filter evictions), all of the cases have elevated DRAM traffic, with something like 10%-15% of the Snoop Filter evictions missing in the L3 cache.

There are some cases in the range of 100-110 seconds that have elevated snoop filter evictions, but not elevated DRAM reads, that show minimal slowdowns.

This suggests that DGEMM can tolerate the extra latency of L2 miss/L3 hit for its data, but not the extra latency of L2 miss/L3 miss/DRAM hit.

Based on my experience in processor design groups at SGI, IBM, and AMD, I wondered if using contiguous physical addresses might avoid these snoop filter conflicts….

Baseline with 2MiB pages.

With 1GiB pages, the tail almost completely disappears in both width and depth.

Zooming in on the slowest 10% of the runs shows no evidence of systematic slowdowns when using 1GiB pages.
The performance counter data confirms that the snoop filter eviction rate is very small.

So we have a fix for single-socket DGEMM, what about HPL?

Intel provided a test version of their optimized HPL benchmark in December 2017 that supported 1GiB pages.

First, I verified that the performance variability for single-node (2-socket) HPL runs was eliminated by using 1GiB pages.

The variation across nodes is strong (due to different thermal characteristics), but the variation across runs on each node is extremely small.

The 8.6% range of average performance for this set of 31 nodes increases to >12% when considering the full 1736-node SKX partition of the Stampede2 system.

So we have a fix for single-node HPL, what about the full system?

Intel provided a full version of their optimized HPL benchmark in March 2018 and we ran it on the full system in April 2018.

The estimated breakdown of performance improvement into individual contributions is a ballpark estimate — it would be a huge project to measure the details at this scale.

The “practical peak performance” of this system is 8.77 PFLOPS on the KNLs plus 3.73 PFLOPS on the SKX nodes, for 12.5 PFLOPS “practical peak”.  The 10.68 PFLOPS obtained is about 85% of this peak performance.

During the review of the paper, I was able to simplify the test further to allow quick testing on other systems (and larger ensembles).

This is mostly new material (not in the paper).

This lets me more directly address my hypothesis about conflicts with contiguous physical addresses, since each 1GiB page is much larger than the 24 MiB of aggregate L2 cache.

It turns out I was wrong — Snoop Filter Conflicts can occur with contiguous physical addresses on this processor.

The pattern repeats every 256 MiB.

If the re-used space is in the 1st 32 MiB of any 1GiB page, there will be no Snoop Filter Conflicts.

What about other processors?

I tested Skylake Xeon processors with 14, 16, 18, 20, 22, 24, 26, 28 cores, and a 68-core KNL (Xeon Phi 7250).

These four processors are the only ones that show Snoop Filter Conflicts with contiguous physical addresses.

But with random 2MiB pages, all processors with more than 16 cores show Snoop Filter conflicts for some combinations of addresses…..

These are average L2 miss rates — individual cores can have significantly higher miss rates (and the maximum miss rate may be the controlling factor in performance for multi-threaded codes).

The details are interesting, but no time in the current presentation….

Overall, the uncertainty associated with this performance variability is probably more important than the performance loss.

Using performance counter measurements to look for codes that are subject to this issue is a serious “needle in haystack” problem — it is probably easier to choose codes that might have the properties above and test them explicitly.

Cache-contained shallow water model, cache-contained FFTs.

The new DGEMM implementation uses dynamic scheduling of the block updates to decouple the memory access patterns.  There is no guarantee that this will alleviate the Snoop Filter Conflict problem, but in this case it does.

I now have a model that predicts all Snoop Filter Conflicts involving 2MiB pages on the 24-core SKX processors.
Unfortunately, the zonesort approach won’t work under Linux because pages are allocated on a “first-come, first-served” basis, so the fine control required is not possible.

An OS with support for page coloring (such as BSD) could me modified to provide this mitigation.

Again, the inability of Linux to use the virtual address as a criterion for selecting the physical page to use will prevent any sorting-based approach from working.

Intel has prepared a response.  If you are interested, you should ask your Intel representative for a copy.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Linux, Performance Counters | Tagged: , | Comments Off on SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Posted by John D. McCalpin, Ph.D. on September 17, 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).

Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:

  1. Neither Logical Processor active
  2. Logical Processor 0 Active, Logical Processor 1 Inactive
  3. Logical Processor 0 Inactive, Logical Processor 1 Active
  4. Both Logical Processors Active

It was not immediately obvious how to obtain this split from the available performance counters.

Every recent Intel processor has:

  • An invariant, non-stop Time Stamp Counter (TSC)
  • Three “fixed-function” performance counters per logical processor
    1. Fixed-Function Counter 0: Instructions retired (not used here)
    2. Fixed-Function Counter 1: Actual Cycles Not Halted
    3. Fixed-Function Counter 2: Reference Cycles Not Halted
  • Two or more (typically 4) programmable performance counters per logical processor
    • A few of the “events” are common across all processors, but most are model-specific.

The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories.   There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both  Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval.    On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.

Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core.  This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….

The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.

  • The TSC increments at a fixed rate.
    • For most Intel processors this rate is the same as the “nominal” processor frequency.
      • Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
    • It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
  • The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
    • Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
    • For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
    • For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
      • This clock was also used in the definition of the processor frequencies.
      • For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
    • For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
      • For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
  • The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
    • Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
      • No additional scaling needed.
    • Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
      • Results must be scaled by the base CPU ratio.
    • Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
      • Note that the name still includes “XCLK”, but the definition has changed!
      • Results must be scaled by 4 times the base CPU ratio.

Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.

For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:

  • TSC (on either logical processor)
  • Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
  • Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)

So each Logical Processor makes two measurements, but they are asymmetric.

From these results, the algebra required to split the counts into the desired categories is not entirely obvious.  I eventually worked up the following sequence:

  1. Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
  2. Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
  3. Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
  4. Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”

Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive.  This can certainly be used to obtain the same results, but it does not appear to save any effort.   My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter.   Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance, Performance Counters, Reference | Comments Off on Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Why I hate MPI (from a performance analysis perspective)

Posted by John D. McCalpin, Ph.D. on August 1, 2018

According to Dr. Bandwidth, performance analysis has two recurring themes:

  1. How fast should this code (or “simple” variations on this code) run on this hardware?
  2. If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?

For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack.  This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood.  This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

MPI is a commonly used standard to implement message-passing programs on computer clusters.   The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.

A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.

Interacting components in the execution of an MPI job — a brief outline (from memory):

  1. The user source code, which contains an ordered set of calls to MPI routines.
    • There are many different types of MPI calls that can be used to perform the required communication.
    • For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
    • The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
  2. The user execution environment.
    • This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
    • The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
  3. The MPI runtime library.
    • An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
      • The source code to the library may not be available.
      • Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
    • An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
      • In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
    • MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
      • The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
      • Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
  4. The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
    • These vary in complexity and sophistication.
    • The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
    • Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
  5. The processor hardware available to support shared-memory transport.
    • Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
    • Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support.  Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
    • For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
      • Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
      • Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
      • Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
  6. The networking hardware.
    • Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
    • Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
    • The MPI standards were not designed to support highly efficient implementations on most networking hardware.
  7. Runtime contention
    • Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
    • Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
      • It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
      • Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
      • Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.

Understanding any one of these components is not usually too hard (unless the implementation is undocumented).

Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.

Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.

To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Performance | Tagged: | Comments Off on Why I hate MPI (from a performance analysis perspective)

Comments on timing short code sections on Intel processors

Posted by John D. McCalpin, Ph.D. on July 23, 2018

(From a recent post of mine on the Intel software developer forums — some potentially useful words to go along with my new low-overhead-timers project…)

Updates on 2019-01-23 in blue.

There are lots of topics that you need to be aware of when attempting fine-grain timing.  A few of the more important ones are:

  • The RDTSC instruction increments at the rate of the “base” (or “nominal”) processor frequency, while instructions are executed at the “core frequency”.  The “core frequency” may be higher or lower than the “base” frequency, and it may change during your measurement interval.
    • If you have the ability to “pin” the processor frequency to match the “base” frequency, interpreting the results is often easier.
    • Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results.  More on this below.
  • With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active.  When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
  • Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
    • I usually precede the code that I want to test with a “warm-up” loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
  • Always pin the thread you want to test to a single logical processor (if possible).
    • This allows you to use the RDPMC instruction to read the logical processor’s fixed-function performance counters.
    • It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

  • Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
    • The duration of these halts varies by product and in some cases by the amount of the frequency change.  I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100’s of cycles)

  • The RDTSC instruction is not ordered with respect to the execution of other instructions.  Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect — in either direction.
  • The RDTSCP instruction is partially ordered — it will not execute until all prior instructions in program order have executed.
    • RDTSCP can still be executed later than expected, but not earlier.
    • This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
  • The LFENCE (“Load Fence”) instruction was originally intended to order memory load operations, but was later extended (architecturally) to order instruction execution.
    • The LFENCE instruction will not execute until all prior instructions have “completed locally”, and no later instructions will begin execution (even speculatively) until the LFENCE instruction completes.
      • It may not be safe to assume that “completed locally” and “retired” are equivalent.
      • “Completed locally” is definitely not the same as “globally visible” — SFENCE is still required if you need to ensure that stores are globally visible before continuing.
    • LFENCE is a fairly lightweight instruction, though the cost depends on the nature of the surrounding instructions. 
      • The repeat rate for consecutive LFENCE instructions is 4 cycles for mainstream Intel processors starting with Sandy Bridge (through at least Skylake).
    • The combination “LFENCE; RDTSC” has a slightly stronger ordering than RDTSCP.
      • RDTSCP waits until prior instructions have completed, but does not prevent later instructions from beginning execution before the RDTSCP instruction executes.  
        • If you want a lower bound on the execution time between “start” and “stop” instructions, the minimum requirement would be to add “RDTSC; LFENCE” before the “start” instruction and to add “RDTSCP” after the “stop” instruction. 
  • The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
    • If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will “remember” which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
    • If you increase the inner loop trip count to 35 or more, the branch predictor will not “remember” which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
    • This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
    • This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

  • A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
  • I recommend measuring a minimum of four values:
    • Elapsed TSC cycles (using RDTSC or RDTSCP)
    • Instructions — using the RDPMC instruction with counter number (1<<30)+0
    • Core Cycles not halted — using the RDPMC instruction with counter number (1<<30)+1
    • Reference Cycles not halted — using the RDPMC instruction with counter number (1<<30)+2
  • If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
    • Instructions executed in kernel mode.
    • Core cycles not halted in kernel mode.
  • Compute these metrics:
    • Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
      • If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
    • Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
      • This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
    • Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
      • This will tell you how much of your expected frequency has been lost due to processor halts.
    • Instructions Retired / Instructions Expected
      • For simple loops, you can look at the assembly code and count instructions.
      • This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
      • This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
      • For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
    • Kernel instructions / Total instructions
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
    • Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Comments on timing short code sections on Intel processors

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on January 22, 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)


In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.



The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor ( show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.


Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.


Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  

For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document ( shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):


Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.


RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).



The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.


After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware, Performance, Performance Counters | Tagged: , , , | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Notes on “non-temporal” (aka “streaming”) stores

Posted by John D. McCalpin, Ph.D. on January 1, 2018

Memory systems using caches have a lot more potential flexibility than most implementations are able to exploit – you get the standard behavior all the time, even if an alternative behavior would be allowable and desirable in a specific circumstance.  One area in which many vendors have provided an alternative to the standard behavior is in “non-temporal” or “streaming” stores. These are sometimes also referred to as “cache-bypassing” or “non-allocating” stores, and even “Non-Globally-Ordered stores” in some cases.

The availability of streaming stores on various processors can often be tied to the improvements in performance that they provide on the STREAM benchmark, but there are several approaches to implementing this functionality, with some subtleties that may not be obvious.  (The underlying mechanism of write-combining was developed long before, and for other reasons, but exposing it to the user in standard cached memory space required the motivation provided by a highly visible benchmark.)

Before getting into details, it is helpful to review the standard behavior of a “typical” cached system for store operations.  Most recent systems use a “write-allocate” policy — when a store misses the cache, the cache line containing the store’s target address is read into the cache, then the parts of the line that are receiving the new data are updated.   These “write-allocate” cache policies have several advantages (mostly related to implementation correctness), but if the code overwrites all the data in the cache line, then reading the cache line from memory could be considered an unnecessary performance overhead.    It is this extra overhead of read transactions that makes the subject of streaming stores important for the STREAM benchmark.

As in many subjects, a lot of mistakes can be avoided by looking carefully at what the various terms mean — considering both similarities and differences.

  • “Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”).
    • So there is no benefit to keeping the data in the processor’s cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
  • “Streaming store” is suggestive of stores to contiguous addresses (i.e., high “spatial locality”).
  • “Cache-bypassing store” says that at least some aspects of the transaction bypass the cache(s).
  • “Non-allocating store” says that a store that misses in a cache will not load the corresponding cache line into the cache before performing the store.
    • This implies that there is some other sort of structure to hold the data from the stores until it is sent to memory.  This may be called a “store buffer”, a “write buffer”, or a “write-combining buffer”.
  • “Non-globally-ordered store” says that the results of an NGO store might appear in a different order (relative to ordinary stores or to other NGO stores) on other processors.
    • All stores must appear in program order on the processor performing the stores, but may only need to appear in the same order on other processors in some special cases, such as stores related to interprocessor communication and synchronization.

There are at least three issues raised by these terms:

  1. Caching: Do we want the result of the store to displace something else from the cache?
  2. Allocating: Can we tolerate the extra memory traffic incurred by reading the cache line before overwriting it?
  3. Ordering: Do the results of this store need to appear in order with respect to other stores?

Different applications may have very different requirements with respect to these issues and may be best served by different implementations.   For example, if the priority is to prevent the stored data from displacing other data in the processor cache, then it may suffice to put the data in the cache, but mark it as Least-Recently-Used, so that it will displace as little useful data as possible.   In the case of the STREAM benchmark, the usual priority is the question of allocation — we want to avoid reading the data before over-writing it.

While I am not quite apologizing for that, it is clear that STREAM over-represents stores in general, and store misses in particular, relative to the high-bandwidth applications that I intended it to represent.

To understand the benefit of non-temporal stores, you need to understand if you are operating in (A) a concurrency-limited regime or (B) a bandwidth-limited regime. (This is not a short story, but you can learn a lot about how real systems work from studying it.)

(Note: The performance numbers below are from the original private version of this post created on 2015-05-29.   Newer systems require more concurrency — numbers will be updated when I have a few minutes to dig them up….)

(A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s.

Store misses are essentially the same as load misses in this analysis.

L1 hardware prefetchers don’t help performance here because they share the same 10 L1 cache miss buffers. L2 hardware prefetchers do help because bringing the data closer reduces the occupancy required by each cache miss transaction. Unfortunately, L2 hardware prefetchers are not directly controllable, so experimentation is required. The best read bandwidth that I have been able to achieve on this system is about 17.8 GB/s, which corresponds to an “effective concurrency” of about 22 cache lines or an “effective latency” of about 36 ns (for each of the 10 concurrent L1 misses).

L2 hardware prefetchers on this system are also able to perform prefetches for store miss streams, thus reducing the occupancy for store misses and increasing the store miss bandwidth.

For non-temporal stores there is no concept corresponding to “prefetch”, so you are stuck with whatever buffer occupancy the hardware gives you. Note that since the non-temporal store does not bring data *to* the processor, there is no reason for its occupancy to be tied to the memory latency. One would expect a cache miss buffer holding a non-temporal store to have a significantly lower latency than a cache miss, since it is allocated, filled, and then transfers its contents out to the memory controller (which presumably has many more buffers than the 10 cache miss buffers that the core supports).

But, while non-temporal stores are expected to have a shorter buffer occupancy than that of a cache miss that goes all the way to memory, it is not at all clear whether they will have a shorter buffer occupancy than a store misses that activates an L2 hardware prefetch engine. It turns out that on the Xeon E5-2680 (Sandy Bridge EP), non-temporal stores have *higher* occupancy than store misses that activate the L2 hardware prefetcher, so using non-temporal stores slows down the performance of each of the four STREAM kernels. I don’t have all the detailed numbers in front of me, but IIRC, STREAM Triad runs at about 10 GB/s with one thread on a Xeon E5-2680 when using non-temporal stores and at between 12-14 GB/s when *not* using non-temporal stores.

This result is not a general rule. On the Intel Xeon E3-1270 (also a Sandy Bridge core, but with the “client” L3 and memory controller), the occupancy of non-temporal stores in the L1 cache miss buffers appears to be much shorter, so there is not an occupancy-induced slowdown. On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses. In this case the use of non-temporal stores allowed one thread to have more concurrent
memory transactions, so it almost always helped performance.

(B) STREAM is typically run using as many cores as is necessary to achieve the maximum sustainable memory bandwidth. In this bandwidth-limited regime non-temporal stores play a very different role. When using all the cores there are no longer concurrency limits, but there is a big difference in bulk memory traffic. Store misses must read the target lines into the L1 cache before overwriting it, while non-temporal stores avoid this (semantically unnecessary) load from memory. Thus for the STREAM Copy and Scale kernels, using cached stores results in two memory reads and one memory write, while using non-temporal stores requires only one memory read and one memory write — a 3:2 ratio. Similarly, the STREAM Add and Triad kernels transfer 4:3 as much data when using cached stores compared to non-temporal stores.

On most systems the reported performance ratios for STREAM (using all processors) with and without non-temporal stores are very close to these 3:2 and 4:3 ratios. It is also typically the case that STREAM results (again using all processors) are about the same for all four kernels when non-temporal stores are used, while (due to the differing ratios of extra read traffic) the Add and Triad kernels are typically ~1/3 faster on systems that use cached stores.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Computer Hardware, Reference | Tagged: , | Comments Off on Notes on “non-temporal” (aka “streaming”) stores