John McCalpin's blog

Dr. Bandwidth explains all….

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on 9th January 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Posted in Computer Architecture, Performance | 2 Comments »

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on 22nd January 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Introduction:

In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report (https://colfaxresearch.com/skl-avx512/) to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.

 

Background:

The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper http://publications.computer.org/micro/2016/07/09/knights-landing-second-generation-intel-xeon-phi-product/.)

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor (https://software.intel.com/en-us/mkl/features/benchmarks) show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

..B1.8: 
 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
1123151515.01568.08.0560.05675.0%74.48%0.70%
22432713.513.51377.07.0860.08685.71%84.67%1.22%
44835112.7512.76376.57.0850.58592.31%84.69%8.26%
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.

 

Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.

 

Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  


For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document (http://www.agner.org/optimize/instruction_tables.pdf) shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):

  • FETCH_STALL.ICACHE_FILL_PENDING_CYCLES
  • MS_DECODED.MS_ENTRY
  • MACHINE_CLEARS.FP_ASSIST

Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

  • UOPS_RETIRED.PACKED_SIMD
    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.

Mechanisms:

RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).

 

Summary:

The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.

Postscript:

After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on 6th December 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)1.01.00.93.1

Caveats:

  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)1.22.41.13.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Memory Bandwidth on Xeon Phi (Knights Corner)

Posted by John D. McCalpin, Ph.D. on 5th December 2013

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

  • Official STREAM submission at: http://www.cs.virginia.edu/stream/stream_mail/2013/0015.html
  • Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
  • Configured with an array size of 64 million elements per array and 10 iterations.
  • Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.

 

Function Best Rate MB/s Avg time (sec) Min time (sec) Max time (sec)
Copy 169446.8 0.0062 0.0060 0.0063
Scale 169173.1 0.0062 0.0061 0.0063
Add 174824.3 0.0090 0.0088 0.0091
Triad 174663.2 0.0089 0.0088 0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels.  Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface.   Each of the 32 GDDR5 DRAM chips has 16 banks.  This gives a *raw* total of 512 DRAM banks.  BUT:

  • The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface.  (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface).  This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
  • The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface.  Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource.   Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines.  (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks.  On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1).    Some data from the STREAM benchmark supports this hypothesis:

Kernel Reads Writes 2/core 3/core 4/core
Copy 1 1 -0.8% -5.2% -7.3%
Scale 1 1 -1.0% -3.3% -6.7%
Add 2 1 -3.1% -12.0% -13.6%
Triad 2 1 -3.6% -11.2% -13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization.  The default mode is to give priority to reads while buffering stores.  Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads.  The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012).      So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams.   For STREAM Triad, the observed bandwidth is ~175 GB/s.  Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time.  This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller.  For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times.   If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….

Posted in Computer Architecture, Performance | Comments Off on Memory Bandwidth on Xeon Phi (Knights Corner)

Some comments on the Xeon Phi coprocessor

Posted by John D. McCalpin, Ph.D. on 17th November 2012

As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.

In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.

I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.

Local Memory Bandwidth

The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.

Background: Memory Balance

Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.

Some real-world examples from TACC's systems:

  • TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
  • TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
  • TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.

Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)

The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.

Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.

PCIe Interface Bandwidth

There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.

Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.

  • MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
  • If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
  • If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
  • Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….

Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:

  • Run on host and offload loops/functions to the Xeon Phi.
  • Run on Xeon Phi and offload loops/functions to the host.
  • Run on Xeon Phi and host as peers, for example with MPI.
  • Run only on the host and ignore the Xeon Phi.
  • Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.

Lots of things to try….

ECC

Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.

Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.

Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)

Speedup relative to Xeon E5

Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.

Looking at some specific microarchitectural performance factors:

  1. Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
  2. Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
  3. Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
  4. Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
  5. Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
  6. Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.

None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)

The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.

Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.

On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.

Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!

Posted in Computer Hardware | Comments Off on Some comments on the Xeon Phi coprocessor