John McCalpin's blog

Dr. Bandwidth explains all….

“Memory directories” in Intel processors

Posted by John D. McCalpin, Ph.D. on 28th August 2023

One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”.   This is used in multi-socket systems to reduce cache coherence traffic between sockets.

I have referred to this in various presentations as:

“A Memory Directory is one or more bits per cache line in DRAM that tell the processor whether another socket might have a dirty copy of the cache line.”

When asked for references to public information about the Intel’s memory directory implementation, I had to go back and find the various tiny bits of information scattered around different places.   That was boring, so I am posting my notes here so I can find them again — maybe others will find this useful as well….


Existence:

The clearest admission that the memory directory feature exists is in this “technical overview” presentation – in the section “Directory-Based Coherency” – https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html. The language is not as precise as I would prefer, but there are actually quite a few interesting details in that section.

History of Implementations:

Intel has disclosed various details in some presentation at the Hot Chips series of conferences:

(I apologize in advance if the links are broken — it is hard to keep up with web site reorganizations — but the presentations should be relatively easy to find using web search services.)

Implicit Information:

Much of the detailed understanding of the memory directory implementation comes from studying the uncore performance counter events that make reference to memory directories.

As an example, there are direct references to memory directories in several sections of the Intel Uncore Performance Monitoring Guides for their various processor families.    These documents are linked from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The version that I have spent the most time with is for the 1st and 2nd generations of Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”):  “Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual”, Intel document 336274.

Specific references in this manual include:

  • CHA Events “DIR_LOOKUP” and “DIR_UPDATE”
  • IMC Event
    • IMC_WRITES_COUNT includes: “NOTE: Directory bits are stored in memory. Remote socket RFOs will result in a directory update which, in turn, will cause a write command.”
    • Aside: it would be easy to be misled by this description.  Remote socket RFOs will result in a directory update but so will ordinary reads that return data in E state (the default for data reads that hit unshared lines).
  • M2M Events “DIRECTORY_HIT”, “DIRECTORY_LOOKUP”, “DIRECTORY_MISS”, “DIRECTORY_UPDATE”
    • These include information on directory states – very helpful for understanding the implementation.
  • Section 3.1.3 “Reference for M2M Packet Matching”, Table 3-9 “SMI3 Opcodes”, mentions the directory in the description of 10 of the 18 transaction types.

It is helpful to compare and contrast the documentation from all of the recent processor generations.  The specific items disclosed in each generation are not the same, and sometimes a generation will have a more verbose explanation that fills in gaps for the other generations.

Reading the documentation is seldom enough to understand the implementation — I typically have to create customized microbenchmarks to generate known counts of transactions of various types and then compare the measured performance counter values with the values I expected to generate.   When there are significant differences more thinking is required.  Thinking does not always help — sometimes the performance counter event is just broken, sometimes there is just not enough information disclosed to derive reasonable bounds on the implementation, and sometimes the implementation has strongly dynamic/adaptive behavior based on activity metrics that are not visible or not controllable using available HW features.

OEM Vendor Disclosures:

I have run across a few other bits of information scattered around the interwebs:

Encoding the Memory Directory bits:

In various presentations I have said that the “one or more” memory directory bits were “hidden” in the ECC bits.  This is easy to do – standard 64-bit DRAM interfaces with ECC provide 72 bits of data with every “64-bit” data transfer and generate 8 contiguous transfers per transaction.   64 bits of data requires 8 bits of ECC for SECDED for every transfer.  Computing the ECC over P transfers only requires log2(P) additional bits, while the ECC gives you 8*(P-1) extra bits.  Intel used this aggregation approach very aggressively on the Knights Corner accelerator to store the SECDED ECC bits for 64 Byte cache lines in regular memory – reserving every 32ndcache line for that purpose.


Are there any other important references that I am missing?

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on “Memory directories” in Intel processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on 10th September 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Posted in Cache Coherence Implementations, Computer Architecture | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Posted by John D. McCalpin, Ph.D. on 22nd January 2018

A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

Introduction:

In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report (https://colfaxresearch.com/skl-avx512/) to the Xeon Phi x200 (Knights Landing) processors here at TACC.   There was no deep goal — just a desire to see the maximum GFLOPS in action.     The exercise seemed simple enough — just fix one item in the Colfax code and we should be finished.   Instead, we found puzzle after puzzle.  After almost four weeks, we have a solid characterization of the behavior — no tested code exceeds an execution rate of 12 vector pipe instructions every 7 cycles (6/7 of the nominal peak) when executed on a single core — but we are unable to propose a testable quantitative model for the source of the throughput limitation.

Dr. Damon McDougall gave a short presentation on this study at the IXPUG 2018 Fall Conference (pdf) — I originally wrote these notes to help organize my thoughts as we were preparing the IXPUG presentation, and later decided that the extra details contained here are interesting enough for me to post it.

 

Background:

The Xeon Phi x200 (Knights Landing) processor is Intel’s second-generation many-core product.  The Xeon Phi 7250 processors at TACC have 68 cores per processor, and each core has two 512-bit SIMD vector pipelines.   For 64-bit floating-point data, the 512-bit Fused Multiply-Add (FMA) instructions performs 16 floating-point operations (8 adds and 8 multiplies).  Each of the two vector units can issue one FMA instruction per cycle, assuming that there are enough independent accumulators to tolerate the 6-cycle dependent-operation latency.  The minimum number of independent accumulators required is: 2 VPUs times 6 cycles = 12 independent accumulators.

The Xeon Phi x200 has six execution units (two VPUs, two ALUs, and two Memory units), but is limited to two instructions per cycle by the decode, allocation, and retirement sections of the processor pipeline. (Most of the details of the Xeon Phi x200 series presented here are from the Intel-authored paper http://publications.computer.org/micro/2016/07/09/knights-landing-second-generation-intel-xeon-phi-product/.)

In our initial evaluation of the Xeon Phi x200, we did not fully appreciate the two-instruction-per-cycle limitation.  Since “peak performance” for the processor is two (512-bit SIMD) FMA instructions per cycle, any instructions that are not FMA instructions subtract directly from the available peak performance.  On “mainstream” Xeon processors, there is plenty of instruction decode/allocation/retirement bandwidth to overlap extra instructions with the SIMD FMA instructions, so we don’t usually even think about them.  Pointer arithmetic, loop index increments, loop trip count comparisons, and conditional branches are all essentially “free” on mainstream Xeon processors, but have to be considered very carefully on the Xeon Phi x200.

A “best case” scenario: DGEMM

The double-precision matrix multiplication routine “DGEMM” is typically the computational kernel that achieves the highest fraction of peak performance on high performance computing systems.  Hardware performance counter results for a simple benchmark code calling Intel’s optimized DGEMM implementation for this processor (from the Intel MKL library) show that about 20% of the dynamic instruction count consists of instructions that are not packed SIMD operations (i.e., not FMAs).  These “non-FMA” instructions include the pointer manipulation and loop control required by any loop-based code, plus explicit loads from memory and a smaller number of stores back to memory. (These are in addition to the loads that can be “piggy-backed” onto FMA instructions using the single memory input operand available for most computational operations in the Intel instruction set).  DGEMM implementations also typically require software prefetches to be interspersed with the computation to minimize memory stalls when moving from one “block” of the computation to the next.

Published DGEMM benchmark results for the Xeon Phi 7250 processor (https://software.intel.com/en-us/mkl/features/benchmarks) show maximum values of about 2100 GFLOPS when using all 68 cores (a very approximate estimate from a bar chart). Tests on one TACC development node gave slightly higher results — 2148 GFLOPS to 2254 GFLOPS (average = 2235 GFLOPS), for a set of 180 trials of a DGEMM test with M=N=K=8000 and using all 68 cores.   These runs reported a stable average frequency of 1.495 GHz, so the average of 2235 GFLOPS therefore corresponds to 68.7% of the peak performance of (68 cores * 32 FP ops/cycle/core * 1.495 GHz =) 3253 GFLOPS (note1). This is an uninspiring fraction of peak performance that would normally suggest significant inefficiencies in either the hardware or software.   In this case, however, the average of 2235 GFLOPS is more appropriately interpreted as 85.9% of the “adjusted peak performance” of 2602 GFLOPS (80% of the raw peak value — as limited by the instruction mix of the DGEMM kernel).    At 85.9% of the “adjusted peak performance”, there is no longer a significant upside to performance tuning.

Notes on DGEMM:

  1. For recent processors with power-limited frequencies, compute-intensive kernels will experience an average frequency that is a function of the characteristics of the specific processor die and of the effectiveness of the cooling system at the time of the run.  Other nodes will show lower average frequencies due to power/cooling limitations, so the numerical values must be adjusted accordingly — the percentage of peak obtained should be unchanged.
  2. It is possible to get higher values by disabling the L2 Hardware Prefetchers — up to about 2329 GFLOPS (89% of “adjusted peak”) — but that is a longer story for another day….
  3. The DGEMM efficiency values are not significantly limited by the use of all cores.  Single-core testing with the same DGEMM routine showed maximum values of just under 72% of the nominal peak (about 90% of “adjusted peak”).

Please Note: The throughput limitation we observed (12 vector instructions per 7 cycles = 85.7% of nominal peak) is significantly higher than the instruction-issue-limited vector throughput of the best DGEMM measurement we have ever observed (~73% of peak, or approximately 10 vector instructions every 7 cycles).   We are unaware of any real computational kernels whose optimal implementation will contain significantly fewer than 15% non-vector-pipe instructions, so the throughput limitation we observe is unlikely to be a significant performance limiter on any real scientific codes.  This note is therefore not intended as a criticism of the Xeon Phi x200 implementation — it is intended to document our exploration of the characteristics of this performance limitation.

Initial Experiments:

In order to approach the peak performance of the processor, we started with a slightly modified version of the code from the Colfax report above.  This code is entirely synthetic — it performs repeated FMA operations on a set of registers with no memory references in the inner loop.  The only non-FMA instructions are those required for loop control, and the number of FMA operations in the loop can be easily adjusted to change the fraction of “overhead” instructions.  The throughput limitation can be observed on a single core, so the following tests and analysis will be limited to this case.

Using the minimum number of accumulator registers needed to tolerate the pipeline latency (12), the assembly code for the inner loop is:

..B1.8: 
 addl $1, %eax
 vfmadd213pd %zmm16, %zmm17, %zmm29 
 vfmadd213pd %zmm16, %zmm17, %zmm28
 vfmadd213pd %zmm16, %zmm17, %zmm27 
 vfmadd213pd %zmm16, %zmm17, %zmm26 
 vfmadd213pd %zmm16, %zmm17, %zmm25 
 vfmadd213pd %zmm16, %zmm17, %zmm24 
 vfmadd213pd %zmm16, %zmm17, %zmm23 
 vfmadd213pd %zmm16, %zmm17, %zmm22 
 vfmadd213pd %zmm16, %zmm17, %zmm21 
 vfmadd213pd %zmm16, %zmm17, %zmm20 
 vfmadd213pd %zmm16, %zmm17, %zmm19
 vfmadd213pd %zmm16, %zmm17, %zmm18 
 cmpl $1000000000, %eax 
 jb ..B1.8

This loop contains 12 independent 512-bit FMA instructions and is executed 1 billion times.   Timers and hardware performance counters are measured immediately outside the loop, where their overhead is negligible.   Vector registers zmm18-zmm29 are the accumulators, while vector registers zmm16 and zmm17 are loop-invariant.

The loop has 15 instructions, so must require a minimum of 7.5 cycles to issue.  The three loop control instructions take 2 cycles (instead of 1.5) when measured in isolation.  When combined with other instructions, the loop control instructions require 1.5 cycles when combined with an odd number of additional instructions or 2.0 cycles in combination with an even number of additional instructions — i.e., in the absence of other stalls, the conditional branch causes the loop cycle count to round up to an integer value.  Equivalent sequences of two instructions that avoid the explicit compare instruction (e.g., pre-loading %eax with 1 billion and subtracting 1 each iteration) have either 1.0-cycle or 1.5-cycle overhead depending on the number of additional instructions (again rounding up to the nearest even cycle count).   The 12 FMA instructions are expected to require 6 cycles to issue, for a total of 8 cycles per loop iteration, or 8 billion cycles in total.   Experiments showed a highly repeatable 8.05 billion cycle execution time, with the 0.6% extra cycles almost exactly accounted for by the overhead of OS scheduler interrupts (1000 per second on this CentOS 7.4 kernel).   Note that 12 FMAs in 8 cycles is only 75% of peak, but the discrepancy here can be entirely attributed to loop overhead.

Further unrolling of the loop decreases the number of “overhead” instructions, and we expected to see an asymptotic approach to peak as the loop length was increased.  We were disappointed.

The first set of experiments compared the cycle and instruction counts for the loop above with the results from unrolling loop two and four times.    The table below shows the expected and measured instruction counts and cycle counts.

KNL 12-accumulator FMA throughput

Unrolling FactorFMA instructions per unrolled loop iterationNon-FMA instructions per unrolled loop iterationTotal Instructions per unrolled loop iterationExpected instructions (B)Measured Instructions (B)Expected Cycles (B) Measured Cycles (B)Unexpected Cycles (B)Expected %Peak GFLOPSMeasured %Peak GFLOPS% Performance shortfall
1123151515.01568.08.0560.05675.0%74.48%0.70%
22432713.513.51377.07.0860.08685.71%84.67%1.22%
44835112.7512.76376.57.0850.58592.31%84.69%8.26%
Comparison of expected and observed cycle counts for loops with 12 independent accumulators updated by 512-bit VFMADD213PD instructions on an Intel Xeon Phi 7250 processor. The loop is repeated until 12 billion FMA instructions have been executed.

 

Notes on methodology:

  • The unrolling was controlled by a “#pragma unroll_and_jam()” directive in the source code.   In each case the assembly code was carefully examined to ensure that the loop structure matched expectations — 12,24,48 FMAs with the appropriate ordering (for dependencies) and the same 3 loop control instructions (but with the iteration count reduced proportionately for the unrolled cases).
  • The node was allocated privately, non-essential daemons were disabled, and the test thread was bound to a single logical processor.
  • Instruction counts were obtained inline using the RDPMC instruction to read Fixed-Function Counter 0 (INST_RETIRED.ANY), while cycle counts were obtained using the RDPMC instruction to read Fixed-Function Counter 1 (CPU_CLK_UNHALTED.THREAD).
  • Execution time was greater than 4 seconds in all cases, so the overhead of reading the counters was at least 7 orders of magnitude smaller than the execution time.
  • Each test was run at least three times, and the trial with the lowest cycle count was used for the analysis in the table.

Comments on results:

  • The 12-FMA loop required 0.7% more cycles than expected.
    • Later experiments show that this overhead is essentially identical to the the fraction of cycles spent servicing the 1-millisecond OS scheduler interrupt.
  • The 24-FMA loop required 1.2% more cycles than expected.
    • About half of these extra cycles can be explained by the OS overhead, leaving an unexplained overhead in the 0.5%-0.6% range (not large enough to worry about).
  • The 48-FMA loop required 8.3% more cycles than expected.
    • Cycle count variations across trials varied by no more than 1 part in 4000, making this overhead at least 300 times the run-to-run variability.
  • The two unrolled cases gave performance results that appear to be bounded above by 6/7 (85.71%) of peak.

 

Initial (Incorrect) Hypothesis

My immediate response to the results was that this was a simple matter of running out of rename registers.   Modern processors (almost) all have more physical registers than they have register names.  The hardware automatically renames registers to avoid false dependencies, but with deep execution pipelines (particularly for floating-point operations), it is possible to run out of rename registers before reaching full performance.

This is easily illustrated using Little’s Law from queuing theory, which can be expressed as:

Throughput = Concurrency / Occupancy

For this scenario, “Throughput” has units of register allocations per cycle, “Concurrency” is the number of registers in use in support of all of the active instructions, and “Occupancy” is the average number of cycles that a register is busy during the execution of an instruction.

An illustrative example:   

The IBM POWER4 has 72 floating-point rename registers and two floating-point arithmetic units capable of executing fused multiply-add (FMA) instructions (a = b+c*d).   Each FMA instruction requires four registers, and these registers are all held for some number of cycles (discussed below), so full performance (both FMA units starting new operations every cycle) would require eight registers to be allocated each cycle (and for these registers to remain occupied for the duration of the corresponding instruction).   We can estimate the duration by reviewing the execution pipeline diagram (Figure 2-3) in The POWER4 Processor Introduction and Tuning Guide.  The exact details of when registers are allocated and freed is not published, but if we assume that registers are allocated in the “MP” stage (“Mapping”) and held until the “CP” (“Completion”, aka “retirement”) stage, then the registers will be held for a total of 12 cycles.  The corresponding pipeline stages from Figure 2-3 are: MP, ISS, RF, F1, F2, F3, F4, F5, F6, WB, Xfer, CP.  

Restating this in terms of Little’s Law, the peak performance of 2 FMAs per cycle corresponds to a “Throughput” of 8 registers allocated per cycle.  With an “Occupancy” of 12 cycles for each of those registers, the required “Concurrency” is 8*12 = 96 registers.  But, as noted above, the POWER4 only has 72 floating-point rename registers.  If we assume a maximum “Concurrency” of 72 registers, the “Throughput” can be computed as 72/12 = 6 registers per cycle, or 75% of the target throughput of 8 registers allocated per cycle.    It is perhaps not a coincidence that the maximum performance I ever saw on DGEMM on a POWER4 system (while working for IBM in the POWER4 design team) was just under 70% of 2 FMAs/cycle, or just over 92% of the occupancy-limited throughput of 1.5 FMAs/cycle.  


For comparison, the IBM POWER5 processor (similar to POWER4, but with 120 floating-point rename registers) delivered up to 94% of 2 FMAs/cycle on DGEMM, suggesting that a DGEMM efficiency in the 90%-95% of peak range is appropriate for DGEMM on this architecture family.

Applying this model to Xeon Phi x200 is slightly more difficult for a number of reasons, but back-of-the-envelope estimates suggested that it was plausible.

The usual way of demonstrating that rename register occupancy is limiting performance is to change the instructions to reduce the number of registers used per instruction, or the number of cycles that the instructions hold the register, or both.  If this reduces the required concurrency to less than the number of available rename registers, full performance should be obtained.

Several quick tests with instructions using fewer registers (e.g., simple addition instead of FMA) or with fewer registers and shorter pipeline latency (e.g, bitwise XOR) showed no change in throughput — the processor still delivered a maximum throughput of 12 vector instructions every 7 cycles.

Our curiosity was piqued by these results, and more experiments followed.   These piqued us even more, eventually leading to a suite of several hundred experiments in which we varied everything that we could figure out how to vary.

We will spare the reader the chronological details, and instead provide a brief overview of the scope of the testing that followed.

Extended Experiments:

Additional experiments (each performed with multiple degrees of unrolling) that showed no change in the limitation of 12 vector instructions per 7 cycles included:

  1. Increasing the dependency latency from 6 cycles to 8 cycles (i.e., using 16 independent vector accumulators) and extending the unrolling to up to 128 FMAs per inner loop iteration.
  2. Increasing the dependency latency to 10 cycles (20 independent vector accumulators), with unrolling to test 20, 40, 60, 80 FMAs per inner loop iteration.
  3. Increasing the dependency latency to 12 cycles (24 independent vector accumulators).
  4. Replacing the 512-bit VFMADD213PD instructions with the scalar version VFMADD213SD.  (This is the AVX-512 EVEX-encoded instruction, not the VEX-encoded version.)
  5. Replacing the 512-bit VFMADD213PD instructions with the AVX2 (VEX-encoded) 256-bit versions.
  6. Increasing the number of loop-invariant registers used from 2 to 4 to 8 (and ensuring that consecutive instructions used different pairs of loop-invariant registers).
  7. Decreasing the number of loop-invariant registers per FMA from 2 to 1, drawing the other input from the output of an FMA instruction at least 12 instructions (6 cycles) away.
  8. Replacing the VFMADD213PD instructions with shorter-latency instructions (VPADDQ and VPXORQ were tested independently).
  9. Replacing the VFMADD213PD instructions with an instruction that has both shorter latency and fewer operands: VPABSQ (which has only 1 input and 1 output register).
  10. Replacing every other VFMADD213PD instruction with a shorter-latency instruction (VPXORQ).
  11. Replacing the three-instruction loop control (add, compare, branch) with two-instruction loop control (subtract, branch).  The three-instruction version counts up from zero, then compares to the iteration count to set the condition code for the terminal conditional branch.  The two-instruction version counts down from the target iteration count to zero, allowing us to use the condition code from the subtract (i.e., not zero) as the branch condition, so no compare instruction is required.  The instruction counts changed as expected, but the cycle counts did not.
  12. Forcing 16-Byte alignment for the branch target at the beginning of the inner loop. (The compiler did this automatically in some cases but not in others — we saw no difference in cycle counts when we forced it to occur).
  13. Many (not all) of the executable files were disassembled with “objump -d” to ensure that the encoding of the instructions did not exceed the limit of 3 prefixes or 8 Bytes per instruction.  We saw no cases where either of these rules were violated in the inner loops.

Additional experiments showed that the throughput limitation only applies to instructions that execute in the vector pipes:

  1. Replacing the Vector instructions with integer ALU instructions (ADDL) –> performance approached two instructions per cycle asymptotically, as expected.
  2. Replacing the Vector instructions with Load instructions from L1-resident data (to vector registers) –> performance approached two instructions per cycle asymptotically, as expected.

Some vector instructions can execute in only one of the two vector pipelines.  This is mentioned in the IEEE Micro paper linked above, but is discussed in more detail in Chapter 17 of the document “Intel 64 and IA-32 Architectures Optimization Reference Manual”, (Intel document 248966, revision 037, July 2017).  In addition, Agner Fog’s “Instruction Tables” document (http://www.agner.org/optimize/instruction_tables.pdf) shows which of the two vector units is used for a large subset of the instructions that can only execute in one of the VPUs.   This allows another set of experiments that show:

  •   Each vector pipeline can sustain its full rate of 1 instruction per cycle when used in isolation.
    • VPU0 was tested with VPERMD, VPBROADCASTQ, and VPLZCNT.
    • VPU1 was tested with KORTESTW.
  • Alternating a VPU0 instruction (VPLZCNTQ) with a VPU1 instruction (KORTESTW) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
  • Alternating a VPU0 instruction with an FMA (that can be executed in either VPU) showed the same 12 instruction per 7 cycle throughput limitation as the original FMA case.
    • This was tested with VPERMD and VPLZCNT as the VPU0 instructions.
  • One specific combination of VPU0 and FMA instructions gave a reduced throughput of 1 vector instruction per cycle: VPBROADCASTQ alternating with FMA.
    • VPBROADCASTQ requires a read from a GPR (in the integer side of the core), then broadcasts the result across all the lanes of a vector register.
    • This operation is documented (in the Intel Optimization Reference Manual) as having a latency of 2 cycles and a maximum throughput of 1 per cycle (as we saw with VPBROADCASTQ running in isolation).
    • The GPR to VPU move is a sufficiently uncommon access pattern that it is not particularly surprising to find a case for which it inhibits parallelism across the VPUs, though it is unclear why this is the only case we found that allows the use of both vector pipelines but is still limited to 1 instruction per cycle.

Additional Performance Counter Measurements and Second-Order Effects:

After the first few dozens of experiments, the test codes were augmented with more hardware performance counters.  The full set of counters measured before and after the loop includes:

  • Time Stamp Counter (TSC)
  • Fixed-Function Counter 0 (Instructions Retired)
  • Fixed-Function Counter 1 (Core Cycles Not Halted)
  • Fixed-Function Counter 2 (Reference Cycles Not Halted)
  • Programmable PMC0
  • Programmable PMC1

The TSC was collected with the RDTSC instruction, while the other five counters were collected using the RDPMC instruction.  The total overhead for measuring these six counters is about 250 cycles, compared to a minimum of 4 billion cycles for the monitored loop.

Several programmable performance counter events were collected as “sanity checks”, with negligible counts (as expected):

  • FETCH_STALL.ICACHE_FILL_PENDING_CYCLES
  • MS_DECODED.MS_ENTRY
  • MACHINE_CLEARS.FP_ASSIST

Another programmable performance counter event was collected to verify that the correct number of VPU instructions were being executed:

  • UOPS_RETIRED.PACKED_SIMD
    • Typical result:
      • Nominal expected 16,000,000,000
      • Measured in user-space: 16,000,000,016
        • This event does not count the 16 loads before the inner loop, but does count the 16 stores after the end of the inner loop.
      • Measured in kernel-space: varied from 19,626 to 21,893.
        • Not sure why the kernel is doing packed SIMD instructions, but these are spread across more than 6 seconds of execution time (>6000 scheduler interrupts).
        • These kernel instruction counts are 6 orders of magnitude smaller than the counts for tested code, so they will be ignored here.

The performance counter events with the most interesting results were:

  • NO_ALLOC_CYCLES.RAT_STALL — counts the number of core cycles in which no micro-ops were allocated and the “RAT Stall” (reservation station full) signal is asserted.
  • NO_ALLOC_CYCLES.ROB_FULL — counts the number of core cycles in which no micro-ops were allocated and the Reorder Buffer (ROB) was full.
  • RS_FULL_STALL.ALL — counts the number of core cycles in which the allocation pipeline is stalled and any of the Reservation Stations is full
    • This should be the same as NO_ALLOC_CYCLES.RAT_STALL, and in all but one case the numbers were nearly identical.
    • The RS_FULL_STALL.ALL event includes a Umask of 0x1F — five bits set.
      • This is consistent with the IEEE Micro paper (linked above) that shows 2 VPU reservation stations, 2 integer reservation stations, and one memory unit reservation station.
      • The only other Umask defined in the Intel documentation is RS_FULL_STALL.MEC (“Memory Execution Cluster”) with a value of 0x01.
      • Directed testing with VPU0 and VPU1 instructions shows that a Umask of 0x08 corresponds to the reservation station for VPU0 and a Umask of 0x10 corresponds to the reservation station for VPU1.

For the all-FMA test cases that were expected to sustain more than 12 VPU instructions per 7 cycles, the NO_ALLOC_CYCLES.RAT_STALL and RS_FULL_STALL.ALL events were a near-perfect match for the number of extra cycles taken by the loop execution.  The values were slightly larger than computation of “extra cycles”, but were always consistent with the assumption of 1.5 cycles “overhead” for the three loop control instructions (matching the instruction issue limit), rather than the 2.0 cycles that I assumed as a baseline.  This is consistent with a NO_ALLOC_CYCLES.RAT_STALL count that overlaps with cycles that are simultaneously experiencing a branch-related pipeline bubble. One or the other should be counted as a stall, but not both.   For these cases, the NO_ALLOC_CYCLES.ROB_FULL counts were negligible.

Interestingly, the individual counts for RS_FULL_STALL for the two vector pipelines were sometimes of similar magnitude and sometimes uneven, but were extremely stable for repeated runs of a particular binary.  The relative counts for the stalls in the two vector pipelines can be modified by changing the code alignment and/or instruction ordering.  In limited tests, it was possible to make either VPU report more stalls than the other, but in every case, the “effective” stall count (VPU0 stalled OR VPU1 stalled) was the amount needed to reduce the throughput to 12 VPU instructions every 7 cycles.

When interleaving vector instructions of different latencies, the total number of stall cycles remained the same (i.e., enough to limit performance to 12 VPU instructions per 7 cycles), but these were split between RAT_STALLs and ROB_STALLs in different ways for different loop lengths.   Typical results showed a pattern like:

  • 16 VPU instructions per loop iteration: approximately zero stalls, as expected
  • 32 VPU instructions per loop iteration: approximately 6.7% RAT_STALLs and negligible ROB_STALLs
  • 64 VPU instructions per loop iteration: ~1% RAT_STALLs (vs ~10% in the all-FMA case) and about 9.9% ROB_STALLs (vs ~0% in the all-FMA case).
    • Execution time increased by about 0.6% relative to the all-FMA case.
  • 128 VPU instructions per loop iteration: negligible RAT_STALLS (vs ~12% in the all-FMA case) and almost 20% ROB_STALLS (vs 0% in the all-FMA case).
    • Execution time increased by 9%, to a value that is ~2.3% slower than the 16-VPU-instruction case.

The conversion of RAT_STALLs to ROB_STALLs when interleaving instructions of different latencies does not seem surprising.  RAT_STALLs occur when instructions are backed up before execution, while ROB_STALLs occur when instructions back up before retirement.  Alternating instructions of different latencies seems guaranteed to push the shorter-latency instructions from the RAT to the ROB until the ROB backs up.  The net slowdown at 128 VPU instructions per loop iteration is not a performance concern, since asymptotic performance is available with anywhere between 24 and (almost) 64 VPU instructions in the inner loop.   These results are included because they might provide some insight into the nature of the mechanisms that limits throughput of vector instructions.

Mechanisms:

RAT_STALLs count the number of cycles in which the Allocate/Rename unit does not dispatch any micro-ops because a target Reservation Station is full.   While this does not directly equate to execution stalls (i.e., no instructions dispatched from the Vector Reservation Station to the corresponding Vector Execution Pipe), the only way the Reservation Station can become full (given an instruction stream with enough independent instructions) is the occurrence of cycles in which instructions are received (from the Allocate/Rename unit), but in which no instruction can be dispatched.    If this occurs repeatedly, the 20-entry Reservation Station will become full, and the RAT_STALL signal will be asserted to prevent the Allocate/Rename unit from sending more micro-ops.

An example code that generates RAT Stalls is a modification of the test code using too few independent accumulators to fully tolerate the pipeline latency.  For example, using 10 accumulators, the code can only tolerate 5 cycles of the 6 cycle latency of the FMA operations.  This inhibits the execution of the FMAs, which fill up the Reservation Station and back up to stall the Allocate/Rename.   Tests with 10..80 FMAs per inner loop iteration show RAT_STALL counts that match the dependency stall cycles that are not overlapped with loop control stall cycles.

We know from the single-VPU tests that the 20-entry Reservation Station for each Vector pipeline is big enough for that pipeline’s operation — no stall cycles were observed.  Therefore the stalls that prevent execution dispatch must be in the shared resources further down the pipeline.   From the IEEE Micro paper, the first execution step is to read the input values from the “rename buffer and the register file”, after which the operations proceed down their assigned vector pipeline.  The vector pipelines should be fully independent until the final execution step in which they write their output values to the rename buffer.  After this, the micro-ops will wait in the Reorder Buffer until they can be retired in program order.  If the bottleneck was in the retirement step, then I would expect the stalls to be attributed to the ROB, not the RAT.   Since the stalls in the most common cases are overwhelming RAT stalls, I conclude that the congestion is not *directly* related to instruction retirement.

As mentioned above, the predominance of RAT stalls suggests that limitations at Retirement cannot be directly responsible for the throughput limitation, but there may be an indirect mechanism at work.   The IEEE Micro paper’s section on the Allocation Unit says:

“The rename buffer stores the results of the in-flight micro-ops until they retire, at which point the results are transferred to the architectural register file.”

This comment is consistent with Figure 3 of the paper and with the comment that vector instructions read their input arguments from the “rename buffer and the register file”, implying that the rename buffer and register file are separate register arrays.  In many processor implementations there is a single “physical register” array, with the architectural registers being the subset of the physical registers that are pointed to by a mapping vector.  The mapping vector is updated every time instructions retire, but the contents of the registers do not need to be copied from one place to another.  The description of the Knights Landing implementation suggests that at retirement, results are read from the “rename buffer” and written to the “register file”.  This increases the number of ports required, since this must happen every cycle in parallel with the first step of each of the vector execution pipelines.  It seems entirely plausible that such a design could include a systematic conflict (a “structural hazard”) between the accesses needed by the execution pipes and the accesses needed by the retirement unit.  If this conflict is resolved in favor of the retirement unit, then execution would be stalled, the Reservation Stations would fill up, and the observed behavior could be generated.   If such a conflict exists, it is clearly independent of the number of input arguments (since instructions with 1, 2, and 3 input arguments have the same behavior), leaving the single output argument as the only common feature.  If such a conflict exists, it must almost certainly also be systematic — occurring independent of alignment, timing, or functional unit details — otherwise it seems likely that we would have seen at least one case in the hundreds of tests here that violates the 12/7 throughput limit.

Tests using a variant of the test code with much smaller loops (varying between 160 and 24,000 FMAs per measurement interval, repeated 100,000 times) also strongly support the 12/7 throughput limit.  In every case the minimum cycle count over the 100,000 iterations  was consistent with 12 VPU instructions every 7 cycles (plus measurement overhead).

 

Summary:

The Intel Xeon Phi x200 (Knights Landing) appears to have a systematic throughput limit of 12 Vector Pipe instructions per 7 cycles — 6/7 of the nominal peak performance.  This throughput limitation is not displayed by the integer functional units or the memory units.  Due to the two-instruction-per-cycle limitations of allocate/rename/retire, this performance limit in the vector units is not expected to have an impact on “real” codes.   A wide variety of tests were performed to attempt to develop quantitative models that might account for this limitation, but none matched the specifics of the observed timing and performance counts.

Postscript:

After Damon McDougall’s presentation at the IXPUG 2018 Fall Conference, we talked to a number of Intel engineers who were familiar with this issue.  Unfortunately, we did not get a clear indication of whether their comments were covered by non-disclosure agreements, so if they gave us an explanation, I can’t repeat it….

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)

SC16 Invited Talk: Memory Bandwidth and System Balance in HPC Systems

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

Slide01

I have been involved in HPC for over 30 years:

  • 12 years as student & faculty user in ocean modeling,
  • 12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
  • over 7 years as a research scientist at TACC.

Slide02

Slide03
  • This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.

Slide04
  • Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
    • The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
    • The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
  • RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
    • The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
    • The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
    • In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
  • x86-based systems rapidly replaced RISC systems starting in around 2003.
    • The first big x86 system on the list was ASCI Red in 1996.
    • The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
    • The earliest of these systems were 32-bit Intel processors.
    • The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
    • Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
    • AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
  • My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
    • Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
    • The big exception is the new #1 system, the Sunway Taihulight system in China.
  • Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
    • “RoadRunner” was the only significant system using the IBM Cell processor.
    • Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
      • The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
      • The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
    • Note the rapid growth, then contraction, of the accelerated systems Rmax.
      • More on this topic below in the discussion of “clusters of clusters”.

Slide05
  • Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
  • With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
  • The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
  • One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.

Slide06
  • Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
  • So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
  • Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
  • A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.

Slide07

Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.


Slide08

Slide09
  • This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
  • I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.

Slide10
  • Note that the output of each kernel is used as an input to the next.
    • The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
    • Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
    • Another way to say this is that STREAM is resistant to undetected over-optimization.
  • OpenMP directives were added in 1996 or 1997.
  • STREAM in C was made fully 64-bit capable in 2013.
    • The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
  • Output print formats keep getting wider as systems get faster.

Slide11
  • STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
    • For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
    • I count the reads that I asked for and the writes that I asked for.
    • If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
      • The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
      • The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
  • One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
    • A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
    • Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
      • For example, GCC will not generate streaming stores.
    • In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.

Slide12

Back to the main topic….

For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.


Slide13
  • Using the same performance axes as on the previous slide….
  • All balances are shifting to make data motion relatively more expensive than arithmetic.

Slide14
  • The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
    • This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
    • I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
  • The data points here are from a set of real systems.
    • The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
      • ~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
      • ~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
      • ~1996: SGI Origin2000 (MIPS R10000 processor)
      • ~1999: DEC AlphaServer DS20 (DEC Alpha processor)
      • ~2005: AMD Opteron 244 (single-core, DDR1 memory)
      • ~2006: AMD Opteron 275 (dual-core, DDR1 memory)
      • ~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
      • ~2011: Intel Xeon X5680 (6-core Westmere EP)
      • ~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
      • ~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
      • ~2014: Intel Xeon E5 v3 (12-core Haswell EP)
      • (future: Intel Xeon E5 v5)
  • Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
    • Unfortunately every other metric is much worse….

Slide15
  • ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
  • Memory latency is becoming expensive much more rapidly than memory bandwidth!
    • Memory latency is dominated by the time required for cache coherence in most recent systems.
    • Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
  • Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
    • More on this below — slide 38.
    • The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.

Slide16
  • Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
  • Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart.  Don’t believe me?  OK…

Slide17
  • Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
  • Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
    • The details depend on the scaling characteristics of your application and are left as an exercise….

Slide18
  • Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low.   This is no longer the case.
  • In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced.  We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
  • In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
  • Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
    • These rapid changes require more design cost which results in higher product cost.
Slide19

Slide20
  • This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
    • The specific values are based on the median-priced 2-socket server processor in each time frame.
    • The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
    • Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
      • This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
  • Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
    • Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
  • How do Intel’s ManyCore (Xeon Phi) processors compare?

Slide21
  • Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
  • The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
  • The basic physics being applied here is based on several complementary principles:
    • Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
    • Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
    • For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
  • These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).

Slide22
  • The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
  • Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.

Slide23
  • In 1990, the original “Killer Micros” had a single level of cache.
  • Since about 2008, most x86 processors have had 3 levels of cache.
    • Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
      • E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
      • The extra lookups can actually reduce performance if they delay “real” transactions.
  • Asynchronous clock crossings are hard to implement with low latency.
    • A big, and under-appreciated, topic for another forum…
  • Intel Xeon processors evolution:
    • Monolithic L3: Nehalem/Westmere — 1 coherence protocol supported
    • Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
    • Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported

Slide24

Slide25
  • Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
  • Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
  • Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
    • It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.

Slide26

Slide27
  • Thanks to Burton Smith for the Little’s Law reference!
  • Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….

Slide28
  • Simple graphical illustration of Little’s Law.
  • I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
  • The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
  • As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
  • The second set of six loads can re-use the same buffers, so only six buffers are needed

Slide29
  • In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
  • In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
  • By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
    • Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
    • It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
  • The “Mainstream” machines are
    • SGI Origin2000 (MIPS R10000)
    • DEC Alpha DS20 (DEC Alpha EV5)
    • AMD Opteron 244 (single-core, DDR1 memory)
    • AMD Opteron 275 (dual-core, DDR1 memory)
    • AMD Opteron 2352 (dual-core, DDR2 memory)
    • Intel Xeon X5680 (6-core Westmere EP)
    • Intel Xeon E5 (8-core Sandy Bridge EP)
    • Intel Xeon E5 v2 (10-core Ivy Bridge EP)
    • Intel Xeon E5 v3 (12-core Haswell EP)
    • Intel Xeon E5 v4 (14-core Broadwell EP)
    • (future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
  • The GPU/Manycore machines are:
    • NVIDIA M2050
    • Intel Xeon Phi (KNC)
    • NVIDIA K20
    • NVIDIA K40
    • Intel Xeon Phi/KNL
    • NVIDIA P100 (Latency estimated).

Slide30

Slide31

Slide32

Slide33

Slide34
  • Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
  • For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
  • Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
  • Electrical cost is unlikely to increase by large factors in locations attached to power grids.
  • So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
  • Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
  • Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
  • In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.

Slide35
  • Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
    • 25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
    • (Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
  • Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
  • So if we get cheaper processors, they must be more energy-efficient as well.
  • This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
    • Not enough time for that topic today, but there are some speculations in the bonus slides.

Slide36
  • For the purposes of this talk, my speculations focus on the “most likely” scenarios.
  • Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.

Slide37
  • Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
    • (This is fairly widespread in practice.)
  • If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
    • (This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
  • Next we will look at examples of “application balance” from various application areas.

Slide38
  • CRITICAL!  Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
  • The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
  • TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
  • This chart is based on a sensitivity-based performance model with additive performance components.
    • In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
    • More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
    • But the modeling approach is still in active use at TACC, and was the subject of my keynote talk at the
      2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)

Slide39
  • Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!

Slide40
  • It is very hard for vendors to back off of existing performance levels….
  • As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
  • Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
  • Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.

Slide41
  • This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
  • There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
    • As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
    • If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.

Slide42
  • Here I hit the 45-minute mark and ended the talk.
  • I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.

 

Posted in Computer Architecture, Computer Hardware | 1 Comment »

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on 14th July 2013


In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….


For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on Notes on the mystery of hardware cache performance counters

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.


Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?

AMD Opteron Processor models, families, and revisions

Posted by John D. McCalpin, Ph.D. on 2nd April 2011

Opteron Processor models, families, and revisions/steppings

Opteron naming is not that confusing, but AMD seems intent on making it difficult by rearranging their web site in mysterious ways….

I am creating this blog entry to make it easier for me to find my own notes on the topic!

The Wikipedia page is has a pretty good listing:
List of AMD Opteron microprocessors

AMD has useful product comparison reference pages at:
AMD Opteron Processor Solutions
AMD Desktop Processor Solutions
AMD Opteron First Generation Reference (pdf)

Borrowing from those pages, a simple summary is:

First Generation Opteron: models 1xx, 2xx, 8xx.

  • These are all Family K8, and are described in AMD pub 26094.
  • They are usually referred to as “Rev E” or “K8, Rev E” processors.
    This is usually OK since most of the 130 nm parts are gone, but there is a new Family 10h rev E (below).
  • They are characterized by having DDR DRAM interfaces, supporting DDR 266, 333, and (Revision E) 400 MHz.
  • This also includes Athlon 64 and Athlon 64 X2 in sockets 754 and 939.
  • Versions:
    • Single core, 130 nm process: K8 revisions B3, C0, CG
    • Single core, 90 nm process: K8 revisions E4, E6
    • Dual core, 90 nm process: K8 revisions E1, E6

Second Generation Opteron: models 12xx, 22xx, 82xx

  • These are upgraded Family K8 cores, with a DDR2 memory controller.
  • They are usually referred to as “Revision F”, or “K8, Rev F”, and are described in AMD pub 32559 (where they are referred to as “Family NPT 0Fh”, with NPT meaning “New Platform Technology” and referring to the infrastructure related to socket F (aka socket 1207), and socket AM2 )
  • This also includes socket AM2 models of Athlon and most Athlon X2 processors (some are Family 11h, described below).
  • There is only one server version, with two steppings:
    • Dual core, 90 nm process: K8 revisions F2, F3

Upgraded Second Generation Opteron: Athlon X2, Sempron, Turion, Turion X2

  • These are very similar to Family 0Fh, revision G (not used in server parts), and are described in AMD document 41256.
  • The memory controller has less functionality.
  • The HyperTransport interface is upgraded to support HyperTransport generation 3.
    This allows a higher frequency connection between the processor chip and the external PCIe controller, so that PCIe gen2 speeds can be supported.

Third Generation Opteron: models 13xx, 23xx, 83xx

  • These are Family 10h cores with an enhanced DDR2 memory controller and are described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • This also includes Phenom X2, X3, and X4 (Rev B3) and Phenom II X2, X3, X4 (Rev C)
  • Versions:
    • Barcelona: Dual core & Quad core, 65 nm process: Family 10h revisions B0, B2, B3, BA
    • Shanghai: Dual core & Quad core, 45 nm process: Family 10h revision C2
    • Istanbul: Up to 6-core, 45 nm process: Family 10h, revision D0
  • Revision D (“Istanbul”) introduced the “HT Assist” probe filter feature to improve scalability in 4-socket and 8-socket systems.

Upgraded Third Generation Opteron: models 41xx & 61xx

  • These are Family 10h cores with an enhanced DDR3-capable memory controller and are also described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • It does not appear that any of the desktop parts use this same stepping as the server parts (D1).
  • There are two versions — both manufactured using a 45nm process:
    • Lisbon: 41xx series have one Family10h revision D1 die per package (socket C32).
    • Magny-Cours: 61xx series have two Family10h revision D1 dice per package (socket G34).
  • Family 10h, Revision E0 is used in the Phenom II X6 products.
    • This revision is the first to offer the “Core Performance Boost” feature.
    • It is also the first to generate confusion about the label “Rev E”.
    • It should be referred to as “Family 10h, Revision E” to avoid ambiguity.

Fourth Generation Opteron: server processor models 42xx & 62xx, and “AMD FX” desktop processors

  • These are socket-compatible with the 41xx and 61xx series, but with the “Bulldozer” core rather than the Family 10h core.
  • The Bulldozer core adds support for:
    • AVX — the extension of SSE from 128 bits wide to 256 bits wide, plus many other improvements. (First introduced in Intel “Sandy Bridge” processors.)
    • AES — additional instructions to dramatically improve performance of AES encryption/descryption. (First introduced in Intel “Westmere” processors.)
    • FMA4 — AMD’s 4-operand multiply-accumulate instructions. (32-bit & 64-bit arithmetic, with 64b, 128b, or 256b vectors.)
    • XOP — AMD’s set of extra integer instructions that were not included in AVX: multiply/accumulate, shift/rotate/permute, etc.
  • All current parts are produced in a 32 nm semiconductor process.
  • Valencia: 42xx series have one Bulldozer revision B2 die per package (socket C32)
  • Interlagos: 62xx series have two Bulldozer revision B2 dice per package (socket G34)
  • “AMD FX”: desktop processors have one Bulldozer revision B2 die per package (socket AM3+)
  • Counting cores and chips is getting more confusing…
    • Each die has 1, 2, 3, or 4 “Bulldozer modules”.
    • Each “Bulldozer module” has two processor cores.
    • The two processor cores in a module share the instruction cache (64kB), some of the instruction fetch logic, the pair of floating-point units, and the 2MB L2 cache.
    • The two processor cores in a module each have a private data cache (16kB), private fixed point functional and address generation units, and schedulers.
    • All modules on a die share an 8 MB L3 cache and the dual-channel DDR3 memory controller.
  • Bulldozer-based systems are characterized by a much larger “turbo” boost frequency increase than previous processors, with almost models supporting an automatic frequency boost of over 20% when not using all the cores, and some models supporting frequency boosts of more than 30%.

Posted in Computer Hardware, Reference | 4 Comments »