John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Algorithms' Category

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref:
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.


This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.


  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g.,, for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?

UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on 11th September 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Is “ordered summation” a hard problem to speed up?

Posted by John D. McCalpin, Ph.D. on 15th February 2012

Sometimes things that seem incredibly difficult aren’t really that bad….

I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.

In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.

Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).

For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.

I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)

Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.

Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.

I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….

Posted in Algorithms, Computer Hardware | Comments Off on Is “ordered summation” a hard problem to speed up?