John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Performance' Category

The evolution of single-core bandwidth in multicore processors

Posted by John D. McCalpin, Ph.D. on 25th April 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores.  For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core.  This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread.  In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.

With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years.  Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).

Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:

So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.

Are these “good” improvements?  The table below may provide some perspective:

2023 vs 2010 speedup 1-core BW 1-core GFLOPS all-core BW all-core GFLOPS
Intel ~2x ~5x ~10x/DDR5 ~30x/HBM >40x
AMD ~5x ~5x ~20x ~30x

 

The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket.  The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.

These observations naturally give rise to a variety of questions.  Some I will address today, and some in the next blog entry.

Questions and Answers:

  1. Why use a “read-only” memory access pattern?  Why not something like STREAM?
  2. Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
    • Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle.  At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM.  I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
  3. Why is the single-core bandwidth increasing so slowly?
    • To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
    • That is the topic of the next blog entry.  (“Real Soon Now”)
  4. Can this problem be fixed?
    • Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
    • The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth.   On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks),  I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.

 

Stay tuned!

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on The evolution of single-core bandwidth in multicore processors

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on 2nd April 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Slide01


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.


An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.


This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….


(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)


Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…



Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.


The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.


The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.


Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.


Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.


This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).


Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).


The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.



Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.


Drum roll, please….




Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.


These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.


Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?


On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.


Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.



These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)







Posted in Computer Architecture, Performance | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Timing Methodology for MPI Programs

Posted by John D. McCalpin, Ph.D. on 4th March 2019

While working on the implementation of the MPI version of the STREAM benchmark, I realized that there were some subtleties in timing that could easily lead to inaccurate and/or misleading results.  This post is a transcription of my notes as I looked at the issues….

Primary requirement: I want a measure of wall clock time that is guaranteed to start before any rank does work and to end after all ranks have finished their work.

Secondary goal: I also want the start time to be as late as possible relative to the initiation of work by any rank, and for the end time to be as early as possible relative to the completion of the work by all ranks.

I am not particularly concerned about OS scheduling issues, so I can assume that timers will be very close to the execution time of the the preceding statement completion and the subsequent statement initiation.  Any deviations caused by stalls between timers, barriers, and work must be in the direction of increasing the reported time, not decreasing it.  (This is a corollary of my initial requirement.)
The discussion here will be based on a simple example, where the “t” variables are (local) wall clock times for MPI rank k and WORK() represents the parallel workload that I am testing.
Generically, I want:
      t_start(k) = time()
      WORK()
      t_end(k) = time()
but for an MPI job, the methodology needs to be provably correct for arbitrary (real) skew across nodes as well as for arbitrary offsets between the absolute time across nodes.  (I am deliberately ignoring the rare case in which a clock is modified on one or more nodes during a run — most time protocols try hard to avoid such shifts, and instead change the rate at which the clock is incremented to drive synchronization.)
After some thinking, I came up with this pseudo-code, which is executed independently by each MPI rank (indexed by “k”):
      t0(k) = time()
      MPI_barrier()
      t1(k) = time()

      WORK()

      t2(k) = time()
      MPI_barrier()
      t3(k) = time()
If the clocks are synchronized, then all I need is:
    tstart = min(t1(k)), k=1..numranks
    tstop  = max(t2(k)), k=1..numranks
If the clocks are not synchronized, then I need to make some use of the barriers — but exactly how?
In the pseudo-code above, the barriers ensure that the following two statements are true:
  • For the start time, t0(k) is guaranteed to be earlier than the initiation of any work.
  • For the end time, t3(k) is guaranteed to be later than the completion of any work.
These statements are true for each rank individually, so the tightest bound available from the collection of t0(k) and t3(k) values is:
      tstop - tstart > min(t3(k)-t0(k)),   k=1..numranks
This gives a (tstop – tstart) that is at least as large as the time required for the actual work plus the time required for the two MPI_barrier() operations.

Posted in Performance, Reference | Comments Off on Timing Methodology for MPI Programs

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on 9th January 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Posted in Computer Architecture, Performance | 2 Comments »

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Posted by John D. McCalpin, Ph.D. on 17th September 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).

Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:

  1. Neither Logical Processor active
  2. Logical Processor 0 Active, Logical Processor 1 Inactive
  3. Logical Processor 0 Inactive, Logical Processor 1 Active
  4. Both Logical Processors Active

It was not immediately obvious how to obtain this split from the available performance counters.

Every recent Intel processor has:

  • An invariant, non-stop Time Stamp Counter (TSC)
  • Three “fixed-function” performance counters per logical processor
    1. Fixed-Function Counter 0: Instructions retired (not used here)
    2. Fixed-Function Counter 1: Actual Cycles Not Halted
    3. Fixed-Function Counter 2: Reference Cycles Not Halted
  • Two or more (typically 4) programmable performance counters per logical processor
    • A few of the “events” are common across all processors, but most are model-specific.

The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories.   There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both  Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval.    On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.

Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core.  This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….

The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.

  • The TSC increments at a fixed rate.
    • For most Intel processors this rate is the same as the “nominal” processor frequency.
      • Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
    • It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
  • The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
    • Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
    • For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
    • For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
      • This clock was also used in the definition of the processor frequencies.
      • For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
    • For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
      • For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
  • The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
    • Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
      • No additional scaling needed.
    • Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
      • Results must be scaled by the base CPU ratio.
    • Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
      • Note that the name still includes “XCLK”, but the definition has changed!
      • Results must be scaled by 4 times the base CPU ratio.

Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.

For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:

  • TSC (on either logical processor)
  • Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
  • Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)

So each Logical Processor makes two measurements, but they are asymmetric.

From these results, the algebra required to split the counts into the desired categories is not entirely obvious.  I eventually worked up the following sequence:

  1. Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
  2. Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
  3. Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
  4. Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”

Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive.  This can certainly be used to obtain the same results, but it does not appear to save any effort.   My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter.   Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.

Posted in Performance, Performance Counters, Reference | Comments Off on Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Why I hate MPI (from a performance analysis perspective)

Posted by John D. McCalpin, Ph.D. on 1st August 2018

According to Dr. Bandwidth, performance analysis has two recurring themes:

  1. How fast should this code (or “simple” variations on this code) run on this hardware?
  2. If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?

For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack.  This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood.  This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

MPI is a commonly used standard to implement message-passing programs on computer clusters.   The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.

A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.

Interacting components in the execution of an MPI job — a brief outline (from memory):

  1. The user source code, which contains an ordered set of calls to MPI routines.
    • There are many different types of MPI calls that can be used to perform the required communication.
    • For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
    • The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
  2. The user execution environment.
    • This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
    • The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
  3. The MPI runtime library.
    • An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
      • The source code to the library may not be available.
      • Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
    • An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
      • In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
    • MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
      • The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
      • Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
  4. The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
    • These vary in complexity and sophistication.
    • The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
    • Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
  5. The processor hardware available to support shared-memory transport.
    • Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
    • Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support.  Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
    • For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
      • Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
      • Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
      • Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
  6. The networking hardware.
    • Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
    • Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
    • The MPI standards were not designed to support highly efficient implementations on most networking hardware.
  7. Runtime contention
    • Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
    • Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
      • It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
      • Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
      • Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.

Understanding any one of these components is not usually too hard (unless the implementation is undocumented).

Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.

Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.

To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Why I hate MPI (from a performance analysis perspective)

Comments on timing short code sections on Intel processors

Posted by John D. McCalpin, Ph.D. on 23rd July 2018

(From a recent post of mine on the Intel software developer forums — some potentially useful words to go along with my new low-overhead-timers project…)

Updates on 2019-01-23 in blue.

There are lots of topics that you need to be aware of when attempting fine-grain timing.  A few of the more important ones are:

  • The RDTSC instruction increments at the rate of the “base” (or “nominal”) processor frequency, while instructions are executed at the “core frequency”.  The “core frequency” may be higher or lower than the “base” frequency, and it may change during your measurement interval.
    • If you have the ability to “pin” the processor frequency to match the “base” frequency, interpreting the results is often easier.
    • Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results.  More on this below.
  • With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active.  When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
  • Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
    • I usually precede the code that I want to test with a “warm-up” loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
  • Always pin the thread you want to test to a single logical processor (if possible).
    • This allows you to use the RDPMC instruction to read the logical processor’s fixed-function performance counters.
    • It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

  • Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
    • The duration of these halts varies by product and in some cases by the amount of the frequency change.  I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100’s of cycles)

  • The RDTSC instruction is not ordered with respect to the execution of other instructions.  Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect — in either direction.
  • The RDTSCP instruction is partially ordered — it will not execute until all prior instructions in program order have executed.
    • RDTSCP can still be executed later than expected, but not earlier.
    • This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
  • The LFENCE (“Load Fence”) instruction was originally intended to order memory load operations, but was later extended (architecturally) to order instruction execution.
    • The LFENCE instruction will not execute until all prior instructions have “completed locally”, and no later instructions will begin execution (even speculatively) until the LFENCE instruction completes.
      • It may not be safe to assume that “completed locally” and “retired” are equivalent.
      • “Completed locally” is definitely not the same as “globally visible” — SFENCE is still required if you need to ensure that stores are globally visible before continuing.
    • LFENCE is a fairly lightweight instruction, though the cost depends on the nature of the surrounding instructions. 
      • The repeat rate for consecutive LFENCE instructions is 4 cycles for mainstream Intel processors starting with Sandy Bridge (through at least Skylake).
    • The combination “LFENCE; RDTSC” has a slightly stronger ordering than RDTSCP.
      • RDTSCP waits until prior instructions have completed, but does not prevent later instructions from beginning execution before the RDTSCP instruction executes.  
        • If you want a lower bound on the execution time between “start” and “stop” instructions, the minimum requirement would be to add “RDTSC; LFENCE” before the “start” instruction and to add “RDTSCP” after the “stop” instruction. 
  • The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
    • If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will “remember” which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
    • If you increase the inner loop trip count to 35 or more, the branch predictor will not “remember” which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
    • This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
    • This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

  • A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
  • I recommend measuring a minimum of four values:
    • Elapsed TSC cycles (using RDTSC or RDTSCP)
    • Instructions — using the RDPMC instruction with counter number (1<<30)+0
    • Core Cycles not halted — using the RDPMC instruction with counter number (1<<30)+1
    • Reference Cycles not halted — using the RDPMC instruction with counter number (1<<30)+2
  • If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
    • Instructions executed in kernel mode.
    • Core cycles not halted in kernel mode.
  • Compute these metrics:
    • Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
      • If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
    • Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
      • This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
    • Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
      • This will tell you how much of your expected frequency has been lost due to processor halts.
    • Instructions Retired / Instructions Expected
      • For simple loops, you can look at the assembly code and count instructions.
      • This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
      • This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
      • For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
    • Kernel instructions / Total instructions
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
    • Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Comments on timing short code sections on Intel processors

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on 22nd January 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Introduction:

In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report (https://colfaxresearch.com/skl-avx512/) to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.

 

Background:

The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper http://publications.computer.org/micro/2016/07/09/knights-landing-second-generation-intel-xeon-phi-product/.)

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor (https://software.intel.com/en-us/mkl/features/benchmarks) show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

..B1.8: 
 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
1123151515.01568.08.0560.05675.0%74.48%0.70%
22432713.513.51377.07.0860.08685.71%84.67%1.22%
44835112.7512.76376.57.0850.58592.31%84.69%8.26%
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.

 

Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.

 

Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  


For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document (http://www.agner.org/optimize/instruction_tables.pdf) shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):

  • FETCH_STALL.ICACHE_FILL_PENDING_CYCLES
  • MS_DECODED.MS_ENTRY
  • MACHINE_CLEARS.FP_ASSIST

Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

  • UOPS_RETIRED.PACKED_SIMD
    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.

Mechanisms:

RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).

 

Summary:

The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.

Postscript:

After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on 6th December 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)1.01.00.93.1

Caveats:

  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)1.22.41.13.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »