John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Architecture' Category

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

Counting Stall Cycles on the Intel Sandy Bridge Processor

Posted by John D. McCalpin, Ph.D. on 4th June 2014

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

  • Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
    queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)

    1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
    2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
    1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
      • I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
    2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
        HyperThreading enabled, and I don’t have any systems configured that way to test right now.
      • These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Counting Stall Cycles on the Intel Sandy Bridge Processor

Memory Bandwidth on Xeon Phi (Knights Corner)

Posted by John D. McCalpin, Ph.D. on 5th December 2013

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

  • Official STREAM submission at: http://www.cs.virginia.edu/stream/stream_mail/2013/0015.html
  • Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
  • Configured with an array size of 64 million elements per array and 10 iterations.
  • Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.

 

Function Best Rate MB/s Avg time (sec) Min time (sec) Max time (sec)
Copy 169446.8 0.0062 0.0060 0.0063
Scale 169173.1 0.0062 0.0061 0.0063
Add 174824.3 0.0090 0.0088 0.0091
Triad 174663.2 0.0089 0.0088 0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels.  Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface.   Each of the 32 GDDR5 DRAM chips has 16 banks.  This gives a *raw* total of 512 DRAM banks.  BUT:

  • The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface.  (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface).  This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
  • The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface.  Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource.   Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines.  (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks.  On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1).    Some data from the STREAM benchmark supports this hypothesis:

Kernel Reads Writes 2/core 3/core 4/core
Copy 1 1 -0.8% -5.2% -7.3%
Scale 1 1 -1.0% -3.3% -6.7%
Add 2 1 -3.1% -12.0% -13.6%
Triad 2 1 -3.6% -11.2% -13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization.  The default mode is to give priority to reads while buffering stores.  Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads.  The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012).      So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams.   For STREAM Triad, the observed bandwidth is ~175 GB/s.  Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time.  This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller.  For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times.   If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….

Posted in Computer Architecture, Performance | Comments Off on Memory Bandwidth on Xeon Phi (Knights Corner)

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.


Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?