John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Performance' Category

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on 9th January 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Posted in Computer Architecture, Performance | No Comments »

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Posted by John D. McCalpin, Ph.D. on 17th September 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).

Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:

  1. Neither Logical Processor active
  2. Logical Processor 0 Active, Logical Processor 1 Inactive
  3. Logical Processor 0 Inactive, Logical Processor 1 Active
  4. Both Logical Processors Active

It was not immediately obvious how to obtain this split from the available performance counters.

Every recent Intel processor has:

  • An invariant, non-stop Time Stamp Counter (TSC)
  • Three “fixed-function” performance counters per logical processor
    1. Fixed-Function Counter 0: Instructions retired (not used here)
    2. Fixed-Function Counter 1: Actual Cycles Not Halted
    3. Fixed-Function Counter 2: Reference Cycles Not Halted
  • Two or more (typically 4) programmable performance counters per logical processor
    • A few of the “events” are common across all processors, but most are model-specific.

The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories.   There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both  Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval.    On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.

Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core.  This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….

The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.

  • The TSC increments at a fixed rate.
    • For most Intel processors this rate is the same as the “nominal” processor frequency.
      • Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
    • It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
  • The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
    • Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
    • For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
    • For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
      • This clock was also used in the definition of the processor frequencies.
      • For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
    • For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
      • For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
  • The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
    • Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
      • No additional scaling needed.
    • Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
      • Results must be scaled by the base CPU ratio.
    • Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
      • Note that the name still includes “XCLK”, but the definition has changed!
      • Results must be scaled by 4 times the base CPU ratio.

Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.

For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:

  • TSC (on either logical processor)
  • Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
  • Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)

So each Logical Processor makes two measurements, but they are asymmetric.

From these results, the algebra required to split the counts into the desired categories is not entirely obvious.  I eventually worked up the following sequence:

  1. Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
  2. Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
  3. Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
  4. Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”

Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive.  This can certainly be used to obtain the same results, but it does not appear to save any effort.   My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter.   Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.

Posted in Performance, Performance Counters, Reference | Comments Off on Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Why I hate MPI (from a performance analysis perspective)

Posted by John D. McCalpin, Ph.D. on 1st August 2018

According to Dr. Bandwidth, performance analysis has two recurring themes:

  1. How fast should this code (or “simple” variations on this code) run on this hardware?
  2. If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?

For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack.  This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood.  This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

MPI is a commonly used standard to implement message-passing programs on computer clusters.   The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.

A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.

Interacting components in the execution of an MPI job — a brief outline (from memory):

  1. The user source code, which contains an ordered set of calls to MPI routines.
    • There are many different types of MPI calls that can be used to perform the required communication.
    • For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
    • The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
  2. The user execution environment.
    • This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
    • The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
  3. The MPI runtime library.
    • An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
      • The source code to the library may not be available.
      • Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
    • An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
      • In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
    • MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
      • The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
      • Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
  4. The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
    • These vary in complexity and sophistication.
    • The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
    • Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
  5. The processor hardware available to support shared-memory transport.
    • Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
    • Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support.  Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
    • For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
      • Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
      • Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
      • Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
  6. The networking hardware.
    • Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
    • Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
    • The MPI standards were not designed to support highly efficient implementations on most networking hardware.
  7. Runtime contention
    • Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
    • Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
      • It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
      • Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
      • Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.

Understanding any one of these components is not usually too hard (unless the implementation is undocumented).

Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.

Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.

To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Why I hate MPI (from a performance analysis perspective)

Comments on timing short code sections on Intel processors

Posted by John D. McCalpin, Ph.D. on 23rd July 2018

(From a recent post of mine on the Intel software developer forums — some potentially useful words to go along with my new low-overhead-timers project…)

There are lots of topics that you need to be aware of when attempting fine-grain timing.  A few of the more important ones are:

  • The RDTSC instruction increments at the rate of the “base” (or “nominal”) processor frequency, while instructions are executed at the “core frequency”.  The “core frequency” may be higher or lower than the “base” frequency, and it may change during your measurement interval.
    • If you have the ability to “pin” the processor frequency to match the “base” frequency, interpreting the results is often easier.
    • Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results.  More on this below.
  • With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active.  When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
  • Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
    • I usually precede the code that I want to test with a “warm-up” loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
  • Always pin the thread you want to test to a single logical processor (if possible).
    • This allows you to use the RDPMC instruction to read the logical processor’s fixed-function performance counters.
    • It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

  • Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
    • The duration of these halts varies by product and in some cases by the amount of the frequency change.  I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100’s of cycles)

  • The RDTSC instruction is not ordered with respect to the execution of other instructions.  Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect — in either direction.
  • The RDTSCP instruction is partially ordered — it will not execute until all prior instructions in program order have executed.
    • RDTSCP can still be executed later than expected, but not earlier.
    • This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
  • The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
    • If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will “remember” which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
    • If you increase the inner loop trip count to 35 or more, the branch predictor will not “remember” which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
    • This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
    • This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

  • A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
  • I recommend measuring a minimum of four values:
    • Elapsed TSC cycles (using RDTSC or RDTSCP)
    • Instructions — using the RDPMC instruction with counter number (1<<30)+0
    • Core Cycles not halted — using the RDPMC instruction with counter number (1<<30)+1
    • Reference Cycles not halted — using the RDPMC instruction with counter number (1<<30)+2
  • If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
    • Instructions executed in kernel mode.
    • Core cycles not halted in kernel mode.
  • Compute these metrics:
    • Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
      • If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
    • Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
      • This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
    • Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
      • This will tell you how much of your expected frequency has been lost due to processor halts.
    • Instructions Retired / Instructions Expected
      • For simple loops, you can look at the assembly code and count instructions.
      • This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
      • This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
      • For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
    • Kernel instructions / Total instructions
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
    • Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Comments on timing short code sections on Intel processors

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on 22nd January 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)


In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report ( to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.



The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor ( show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.


Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.


Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  

For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document ( shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):


Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.


RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).



The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.


After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on 6th December 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)


  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref:
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.


This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.


  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g.,, for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?

UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on 11th September 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Counting Stall Cycles on the Intel Sandy Bridge Processor

Posted by John D. McCalpin, Ph.D. on 4th June 2014

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

  • Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
    queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)

    1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
    2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
    1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
      • I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
    2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
        HyperThreading enabled, and I don’t have any systems configured that way to test right now.
      • These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Counting Stall Cycles on the Intel Sandy Bridge Processor