The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors. In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches. Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple *consecutive* operations) and “SIMD” instructions (in the sense of performing multiple *simultaneous* operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 *consecutive* multiply-accumulate operations with a single 512-bit accumulator register, four *different (consecutively-numbered)* 512-bit input registers, and four (consecutive) 32-bit memory values from memory. As an example of one mode of operation, the four steps are:

- Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
- Repeat step 1, but the first input value comes from the
*next consecutive* 32-bit memory location and the second input value comes from the *next consecutive* register number. The results are added to the same accumulator.
- Repeat step 2, but the first input value comes from the
*next* *consecutive* 32-bit memory location and the second input value comes from the *next consecutive* register number. The results are added to the same accumulator.
- Repeat step 3, but the first input value comes from the
*next* *consecutive* 32-bit memory location and the second input value comes from the *next consecutive* register number. The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
for (j=0; j<N; j++) {
for (k=0; k<N; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}

Notes on notation:

- C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
- In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
- In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
- On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
- The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.

- I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

- Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set. This includes Haswell and newer cores.
- These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
- Optimization requires vectorization and data re-use.
- The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.

- With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
- Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
- Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.

- The number of independent accumulators is equal to the product of three terms:
- the unrolling factor for the “i” loop,
- the unrolling factor for the “j” loop,
- the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
- So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.

- The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
- Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
- Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
- Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.

- The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
- Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
- 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
- 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
- 12 is the only number of accumulators that makes sense.
- Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
- The optimum register blocking is therefore based on
- Unrolling the “i” loop 4 times
- Unrolling the “j” loop 3 times
- Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)

- The resulting code requires all 16 registers:
- 12 registers to hold the 12 SIMD accumulators,
- 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
- 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.

- I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
- I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor. In the code description above, the broadcast of A requires a separate load with broadcast into a named register. This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads. On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core. The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit. Shouldn’t this fix the problem? Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once. The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only. These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times). The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes. This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop). (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

This brings us back to the newly disclosed “vector+SIMD” instructions. A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle. This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

- The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
- Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
- Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
- E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
- For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.

- It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
- The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
- Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
- If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
- For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
- E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary. Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
- As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
- It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
- On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?

UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access. The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units. It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles. The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.