John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Performance' Category

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on 11th September 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Counting Stall Cycles on the Intel Sandy Bridge Processor

Posted by John D. McCalpin, Ph.D. on 4th June 2014

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

  • Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
    queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)

    1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
      • Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
    2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
      • Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
  • Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
    1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
      • I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
    2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
      • This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
        HyperThreading enabled, and I don’t have any systems configured that way to test right now.
      • These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

  • For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
  • For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Counting Stall Cycles on the Intel Sandy Bridge Processor

Memory Bandwidth on Xeon Phi (Knights Corner)

Posted by John D. McCalpin, Ph.D. on 5th December 2013

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

  • Official STREAM submission at:
  • Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
  • Configured with an array size of 64 million elements per array and 10 iterations.
  • Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.


Function Best Rate MB/s Avg time (sec) Min time (sec) Max time (sec)
Copy 169446.8 0.0062 0.0060 0.0063
Scale 169173.1 0.0062 0.0061 0.0063
Add 174824.3 0.0090 0.0088 0.0091
Triad 174663.2 0.0089 0.0088 0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels.  Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface.   Each of the 32 GDDR5 DRAM chips has 16 banks.  This gives a *raw* total of 512 DRAM banks.  BUT:

  • The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface.  (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface).  This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
  • The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface.  Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource.   Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines.  (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks.  On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1).    Some data from the STREAM benchmark supports this hypothesis:

Kernel Reads Writes 2/core 3/core 4/core
Copy 1 1 -0.8% -5.2% -7.3%
Scale 1 1 -1.0% -3.3% -6.7%
Add 2 1 -3.1% -12.0% -13.6%
Triad 2 1 -3.6% -11.2% -13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization.  The default mode is to give priority to reads while buffering stores.  Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads.  The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012).      So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams.   For STREAM Triad, the observed bandwidth is ~175 GB/s.  Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time.  This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller.  For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times.   If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….

Posted in Computer Architecture, Performance | Comments Off on Memory Bandwidth on Xeon Phi (Knights Corner)

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on 14th July 2013

In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….

For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Posted in Computer Hardware, Performance, Performance Counters | Comments Off on Notes on the mystery of hardware cache performance counters

STREAM version 5.10 released

Posted by John D. McCalpin, Ph.D. on 17th January 2013

After much too long a delay, version 5.10 of the STREAM benchmark has been released (at least in the C language version).

Although version 5.10 of the benchmark still measures exactly the same thing as previous versions, a number of long-awaited features have finally been integrated.

  • Updated Validation Code
  • Array indexing now allows arrays with more than 2 billion elements
  • Data type used can now be overridden from the default “double” to “float” with a single compile flag
  • Many small output formatting changes to account for computers getting bigger and faster

The validation code update is the biggest change to version 5.10 of stream.c.
With previous versions, the validation code was subject to accumulated round-off error that could cause the code to report that validation failed with large array sizes — even if nothing was actually wrong. The revised code eliminates this problem and has been tested to array sizes of 10 billion elements with no problems.

Previous version of STREAM were limited to 32-bit array indices. Version 5.10 defines the array indices using a type that will map to a 64-bit integer on 64-bit machine — thus allowing arrays with more than 2 billion elements. Most compilers require an additional command-line flag like “-mcmodel=medium” to allow full 64-bit addressing. The changes to STREAM in version 5.10 are required in addition to the extra command-line flag.

Dr. Bandwidth also found eight older submissions (from 2009 through early 2012) that somehow got lost in my mailbox and never posted to the site. These are listed on the STREAM benchmark What’s New page.

Along with these older submissions, four new submissions have just been added to the site, ranging from a Raspberry Pi delivering about 200 MB/s to a Xeon Phi SE10P coprocessor delivering over 160,000 MB/s — that’s an 800 to 1 ratio of sustained memory bandwidths measured for single-chip systems!

Users of systems at the Texas Advanced Computing Center will be interested in seeing new results posted for three different components of the Stampede system:

  • Stampede Compute Nodes: Dell_DCS8000 servers with two Xeon E5-2680 (8-core, 2.7 GHz) processors
  • Stampede Coprocessors: Intel_XeonPhi_SE10P Coprocessor (61-core, 1.1 GHz)
  • Stampede Large Memory Nodes: Dell_PowerEdge_820 servers with four Xeon E5-4650 (8-core, 2.7 GHz) processors

Posted in Performance | 3 Comments »

Counting binary vs decimal powers in the STREAM benchmark

Posted by John D. McCalpin, Ph.D. on 5th January 2013

A question came up recently about my choice of definitions for “MB” used in the computation of memory bandwidth (in “MB/s”) in the STREAM benchmark.

According to this reference from NIST, the convention is:

Binary Powers Value abbreviation full name
2^10 1,024 KiB kibibyte
2^20 1,048,576 MiB mebibyte
2^30 1,073,741,824 GiB gibibyte
Decimal Powers Value abbreviation full name
10^3 1,000 kB kilobyte
10^6 1,000,000 MB megabyte
10^9 1,000,000,000 GB gigabyte

Since its inception in 1991, the STREAM benchmark has reported the amount of memory used in MiB (2^20) and (more recently) GiB (2^30), but always reports the transfer rates in MB/s (10^6).

An example may make my motivation more clear.

Suppose a computer system reads 524,288 Bytes and writes 524,288 Bytes in 1.000 seconds, for a total of 1,048,576 Bytes transferred in 1.000 seconds.
The corresponding performance could be reported in a variety of ways:

  • Option 1: report as 1,048,576 Bytes/s
  • Option 2: report as 1.000000 MiB/s
  • Option 3: report as 1.049 MB/s

From my perspective:

  • Option 1 gives inconveniently large numbers.
  • Option 2 is consistent with typical units for memory storage, but:
    • it is not consistent with typical units for counting arithmetic operations (more on that below), and
    • it would allow unscrupulous parties (or simply parties with different opinions about how to “properly” count) to change the definition of “MB” from 2^20 to 10^6, allowing them to report values that were almost 4.9% higher than the *same* performance on other systems.
  • Option 3 is what I chose. It is consistent with how FLOPS are counted and it preempts the potential “performance inflation” from abusing Option 2.

Note that if floating-point arithmetic operation counts define “MFLOPS” as 10^6 FP Ops/s (as is typical), then “balance” ratios of (MB/s)/MFLOPS require that (MB/s) also be defined using a decimal base.
These “balance” ratios are an important output of the STREAM benchmark project.
(Aside: I would not encourage anyone to consider a 5% difference in “balance” to mean very much — these are intended as relatively coarse scaling estimates.)

Posted in Performance | Comments Off on Counting binary vs decimal powers in the STREAM benchmark

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.

Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?

Is “ordered summation” a hard problem to speed up?

Posted by John D. McCalpin, Ph.D. on 15th February 2012

Sometimes things that seem incredibly difficult aren’t really that bad….

I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.

In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.

Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).

For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.

I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)

Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.

Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.

I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….

Posted in Algorithms, Computer Hardware | Comments Off on Is “ordered summation” a hard problem to speed up?