John McCalpin's blog

Dr. Bandwidth explains all….

Why I hate MPI (from a performance analysis perspective)

Posted by John D. McCalpin, Ph.D. on 1st August 2018

According to Dr. Bandwidth, performance analysis has two recurring themes:

  1. How fast should this code (or “simple” variations on this code) run on this hardware?
  2. If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?

For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack.  This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood.  This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

MPI is a commonly used standard to implement message-passing programs on computer clusters.   The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.

A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.

Interacting components in the execution of an MPI job — a brief outline (from memory):

  1. The user source code, which contains an ordered set of calls to MPI routines.
    • There are many different types of MPI calls that can be used to perform the required communication.
    • For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
    • The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
  2. The user execution environment.
    • This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
    • The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
  3. The MPI runtime library.
    • An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
      • The source code to the library may not be available.
      • Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
    • An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
      • In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
    • MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
      • The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
      • Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
  4. The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
    • These vary in complexity and sophistication.
    • The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
    • Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
  5. The processor hardware available to support shared-memory transport.
    • Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
    • Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support.  Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
    • For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
      • Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
      • Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
      • Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
  6. The networking hardware.
    • Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
    • Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
    • The MPI standards were not designed to support highly efficient implementations on most networking hardware.
  7. Runtime contention
    • Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
    • Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
      • It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
      • Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
      • Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.

Understanding any one of these components is not usually too hard (unless the implementation is undocumented).

Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.

Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.

To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Why I hate MPI (from a performance analysis perspective)

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on 22nd January 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Introduction:

In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report (https://colfaxresearch.com/skl-avx512/) to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.

 

Background:

The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper http://publications.computer.org/micro/2016/07/09/knights-landing-second-generation-intel-xeon-phi-product/.)

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor (https://software.intel.com/en-us/mkl/features/benchmarks) show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

..B1.8: 
 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
1123151515.01568.08.0560.05675.0%74.48%0.70%
22432713.513.51377.07.0860.08685.71%84.67%1.22%
44835112.7512.76376.57.0850.58592.31%84.69%8.26%
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.

 

Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.

 

Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  


For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document (http://www.agner.org/optimize/instruction_tables.pdf) shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):

  • FETCH_STALL.ICACHE_FILL_PENDING_CYCLES
  • MS_DECODED.MS_ENTRY
  • MACHINE_CLEARS.FP_ASSIST

Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

  • UOPS_RETIRED.PACKED_SIMD
    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.

Mechanisms:

RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).

 

Summary:

The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.

Postscript:

After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

SC16 Invited Talk: Memory Bandwidth and System Balance in HPC Systems

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

Slide01

I have been involved in HPC for over 30 years:

  • 12 years as student & faculty user in ocean modeling,
  • 12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
  • over 7 years as a research scientist at TACC.

Slide02

Slide03
  • This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.

Slide04
  • Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
    • The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
    • The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
  • RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
    • The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
    • The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
    • In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
  • x86-based systems rapidly replaced RISC systems starting in around 2003.
    • The first big x86 system on the list was ASCI Red in 1996.
    • The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
    • The earliest of these systems were 32-bit Intel processors.
    • The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
    • Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
    • AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
  • My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
    • Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
    • The big exception is the new #1 system, the Sunway Taihulight system in China.
  • Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
    • “RoadRunner” was the only significant system using the IBM Cell processor.
    • Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
      • The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
      • The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
    • Note the rapid growth, then contraction, of the accelerated systems Rmax.
      • More on this topic below in the discussion of “clusters of clusters”.

Slide05
  • Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
  • With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
  • The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
  • One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.

Slide06
  • Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
  • So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
  • Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
  • A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.

Slide07

Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.


Slide08

Slide09
  • This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
  • I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.

Slide10
  • Note that the output of each kernel is used as an input to the next.
    • The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
    • Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
    • Another way to say this is that STREAM is resistant to undetected over-optimization.
  • OpenMP directives were added in 1996 or 1997.
  • STREAM in C was made fully 64-bit capable in 2013.
    • The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
  • Output print formats keep getting wider as systems get faster.

Slide11
  • STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
    • For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
    • I count the reads that I asked for and the writes that I asked for.
    • If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
      • The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
      • The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
  • One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
    • A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
    • Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
      • For example, GCC will not generate streaming stores.
    • In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.

Slide12

Back to the main topic….

For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.


Slide13
  • Using the same performance axes as on the previous slide….
  • All balances are shifting to make data motion relatively more expensive than arithmetic.

Slide14
  • The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
    • This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
    • I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
  • The data points here are from a set of real systems.
    • The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
      • ~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
      • ~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
      • ~1996: SGI Origin2000 (MIPS R10000 processor)
      • ~1999: DEC AlphaServer DS20 (DEC Alpha processor)
      • ~2005: AMD Opteron 244 (single-core, DDR1 memory)
      • ~2006: AMD Opteron 275 (dual-core, DDR1 memory)
      • ~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
      • ~2011: Intel Xeon X5680 (6-core Westmere EP)
      • ~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
      • ~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
      • ~2014: Intel Xeon E5 v3 (12-core Haswell EP)
      • (future: Intel Xeon E5 v5)
  • Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
    • Unfortunately every other metric is much worse….

Slide15
  • ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
  • Memory latency is becoming expensive much more rapidly than memory bandwidth!
    • Memory latency is dominated by the time required for cache coherence in most recent systems.
    • Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
  • Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
    • More on this below — slide 38.
    • The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.

Slide16
  • Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
  • Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart.  Don’t believe me?  OK…

Slide17
  • Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
  • Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
    • The details depend on the scaling characteristics of your application and are left as an exercise….

Slide18
  • Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low.   This is no longer the case.
  • In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced.  We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
  • In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
  • Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
    • These rapid changes require more design cost which results in higher product cost.
Slide19

Slide20
  • This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
    • The specific values are based on the median-priced 2-socket server processor in each time frame.
    • The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
    • Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
      • This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
  • Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
    • Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
  • How do Intel’s ManyCore (Xeon Phi) processors compare?

Slide21
  • Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
  • The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
  • The basic physics being applied here is based on several complementary principles:
    • Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
    • Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
    • For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
  • These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).

Slide22
  • The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
  • Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.

Slide23
  • In 1990, the original “Killer Micros” had a single level of cache.
  • Since about 2008, most x86 processors have had 3 levels of cache.
    • Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
      • E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
      • The extra lookups can actually reduce performance if they delay “real” transactions.
  • Asynchronous clock crossings are hard to implement with low latency.
    • A big, and under-appreciated, topic for another forum…
  • Intel Xeon processors evolution:
    • Monolithic L3: Nehalem/Westmere — 1 coherence protocol supported
    • Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
    • Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported

Slide24

Slide25
  • Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
  • Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
  • Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
    • It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.

Slide26

Slide27
  • Thanks to Burton Smith for the Little’s Law reference!
  • Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….

Slide28
  • Simple graphical illustration of Little’s Law.
  • I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
  • The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
  • As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
  • The second set of six loads can re-use the same buffers, so only six buffers are needed

Slide29
  • In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
  • In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
  • By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
    • Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
    • It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
  • The “Mainstream” machines are
    • SGI Origin2000 (MIPS R10000)
    • DEC Alpha DS20 (DEC Alpha EV5)
    • AMD Opteron 244 (single-core, DDR1 memory)
    • AMD Opteron 275 (dual-core, DDR1 memory)
    • AMD Opteron 2352 (dual-core, DDR2 memory)
    • Intel Xeon X5680 (6-core Westmere EP)
    • Intel Xeon E5 (8-core Sandy Bridge EP)
    • Intel Xeon E5 v2 (10-core Ivy Bridge EP)
    • Intel Xeon E5 v3 (12-core Haswell EP)
    • Intel Xeon E5 v4 (14-core Broadwell EP)
    • (future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
  • The GPU/Manycore machines are:
    • NVIDIA M2050
    • Intel Xeon Phi (KNC)
    • NVIDIA K20
    • NVIDIA K40
    • Intel Xeon Phi/KNL
    • NVIDIA P100 (Latency estimated).

Slide30

Slide31

Slide32

Slide33

Slide34
  • Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
  • For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
  • Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
  • Electrical cost is unlikely to increase by large factors in locations attached to power grids.
  • So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
  • Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
  • Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
  • In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.

Slide35
  • Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
    • 25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
    • (Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
  • Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
  • So if we get cheaper processors, they must be more energy-efficient as well.
  • This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
    • Not enough time for that topic today, but there are some speculations in the bonus slides.

Slide36
  • For the purposes of this talk, my speculations focus on the “most likely” scenarios.
  • Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.

Slide37
  • Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
    • (This is fairly widespread in practice.)
  • If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
    • (This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
  • Next we will look at examples of “application balance” from various application areas.

Slide38
  • CRITICAL!  Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
  • The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
  • TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
  • This chart is based on a sensitivity-based performance model with additive performance components.
    • In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
    • More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
    • But the modeling approach is still in active use at TACC, and was the subject of my keynote talk at the
      2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)

Slide39
  • Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!

Slide40
  • It is very hard for vendors to back off of existing performance levels….
  • As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
  • Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
  • Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.

Slide41
  • This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
  • There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
    • As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
    • If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.

Slide42
  • Here I hit the 45-minute mark and ended the talk.
  • I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.

 

Posted in Computer Architecture, Computer Hardware | 1 Comment »

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on 11th September 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Some comments on the Xeon Phi coprocessor

Posted by John D. McCalpin, Ph.D. on 17th November 2012

As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.

In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.

I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.

Local Memory Bandwidth

The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.

Background: Memory Balance

Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.

Some real-world examples from TACC's systems:

  • TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
  • TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
  • TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.

Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)

The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.

Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.

PCIe Interface Bandwidth

There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.

Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.

  • MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
  • If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
  • If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
  • Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….

Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:

  • Run on host and offload loops/functions to the Xeon Phi.
  • Run on Xeon Phi and offload loops/functions to the host.
  • Run on Xeon Phi and host as peers, for example with MPI.
  • Run only on the host and ignore the Xeon Phi.
  • Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.

Lots of things to try….

ECC

Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.

Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.

Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)

Speedup relative to Xeon E5

Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.

Looking at some specific microarchitectural performance factors:

  1. Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
  2. Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
  3. Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
  4. Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
  5. Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
  6. Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.

None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)

The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.

Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.

On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.

Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!

Posted in Computer Hardware | Comments Off on Some comments on the Xeon Phi coprocessor

Is “ordered summation” a hard problem to speed up?

Posted by John D. McCalpin, Ph.D. on 15th February 2012

Sometimes things that seem incredibly difficult aren’t really that bad….

I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.

In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.

Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).

For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.

I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)

Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.

Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.

I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….

Posted in Algorithms, Computer Hardware | Comments Off on Is “ordered summation” a hard problem to speed up?

Opteron/PhenomII STREAM Bandwidth vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.

This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂

Single-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 10,101 MB/s 9,813 MB/s 9,081 MB/s
2.5 GHz 10,077 MB/s 9,784 MB/s 9,053 MB/s
2.1 GHz 10,075 MB/s 9,783 MB/s 9,051 MB/s

Two-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 13,683 MB/s 12,851 MB/s 11,490 MB/s
2.5 GHz 13,337 MB/s 12,546 MB/s 11,252 MB/s
2.1 GHz 13,132 MB/s 12,367 MB/s 11,112 MB/s

So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.

Posted in Computer Hardware, Reference | 4 Comments »

Opteron Memory Latency vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

  • AMD Phenom II 555 Processor
    • Two cores
    • Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
    • Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
  • 4 GB of DDR3/1600 DRAM
    • Two 2GB unregistered, unbuffered DIMMs
    • Dual-rank DIMMs
    • Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066 DDR3/800
3.2 GHz 51.58 ns 54.18 ns 57.46 ns 66.18 ns
2.5 GHz 52.77 ns 54.97 ns 58.30 ns 65.74 ns
2.1 GHz 53.94 ns 55.37 ns 58.72 ns 65.79 ns
0.8 GHz 82.86 ns 87.42 ns 94.96 ns 94.51 ns

Notes:

  1. The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Posted in Computer Hardware, Reference | 2 Comments »