John McCalpin's blog

Dr. Bandwidth explains all….

The evolution of single-core bandwidth in multicore processors

Posted by John D. McCalpin, Ph.D. on 25th April 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores.  For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core.  This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread.  In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.

With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years.  Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).

Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:

So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.

Are these “good” improvements?  The table below may provide some perspective:

2023 vs 2010 speedup 1-core BW 1-core GFLOPS all-core BW all-core GFLOPS
Intel ~2x ~5x ~10x/DDR5 ~30x/HBM >40x
AMD ~5x ~5x ~20x ~30x

 

The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket.  The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.

These observations naturally give rise to a variety of questions.  Some I will address today, and some in the next blog entry.

Questions and Answers:

  1. Why use a “read-only” memory access pattern?  Why not something like STREAM?
  2. Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
    • Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle.  At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM.  I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
  3. Why is the single-core bandwidth increasing so slowly?
    • To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
    • That is the topic of the next blog entry.  (“Real Soon Now”)
  4. Can this problem be fixed?
    • Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
    • The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth.   On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks),  I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.

 

Stay tuned!

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on The evolution of single-core bandwidth in multicore processors

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on 2nd April 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Slide01


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.


An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.


This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….


(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)


Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…



Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.


The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.


The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.


Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.


Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.


This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).


Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).


The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.



Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.


Drum roll, please….




Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.


These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.


Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?


On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.


Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.



These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)







Posted in Computer Architecture, Performance | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on 9th January 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Posted in Computer Architecture, Performance | 2 Comments »

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on 6th December 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)1.01.00.93.1

Caveats:

  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)1.22.41.13.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on 16th October 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Invited Talk at SuperComputing 2016!

AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies

Posted by John D. McCalpin, Ph.D. on 27th July 2012

In an earlier post, I documented the local and remote memory latencies for the SunBlade X6420 compute nodes in the TACC Ranger supercomputer, using AMD Opteron “Barcelona” (model 8356) processors running at 2.3 GHz.

Similar latency tests were run on other systems based on AMD Opteron processors in the TACC “Discovery” benchmarking cluster.  The systems reported here include

  • 2-socket node with AMD Opteron “Shanghai” processors (model 2389, quad-core, revision C2, 2.9 GHz)
  • 2-socket node with AMD Opteron “Istanbul” processors (model 2435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Istanbul” processors (model 8435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Magny-Cours” processors (model 6174, twelve-core, revision D1, 2.2 GHz)

Compared to the previous results with the AMD Opteron “Barcelona” processors on the TACC Ranger system, the “Shanghai” and “Istanbul” processors have a more advanced memory controller prefetcher, and the “Istanbul” processor also supports “HT Assist”, which allocates a portion of the L3 cache to serve as a “snoop filter” in 4-socket configurations.  The “Magny-Cours” processor uses 2 “Istanbul” die in a single package.

Note that for the “Barcelona” processors, the hardware prefetcher in the memory controller did not perform cache coherence “snoops” — it just loaded data from memory into a buffer.  When a core subsequently issued a load for that address, missing in the L3 cache initiated a coherence snoop.   In both 2-socket and 4-socket systems, this snoop took longer than obtaining the data from DRAM, so the memory prefetcher had no impact on the effective latency seen by a processor core.  “Shanghai” and later Opteron processors include a coherent prefetcher, so prefetched lines could be loaded with lower effective latency.   This difference means that latency testing on “Shanghai” and later processors needs to be slightly more sophisticated to prevent memory controller prefetching from biasing the latency measurements.  In practice, using a pointer-chasing code with a stride of 512 Bytes was sufficient to avoid hardware prefetch in “Shanghai”, “Istanbul”, and “Magny-Cours”.

Results for 2-socket systems

Processorcores/packageFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)
Opteron 2222230000Fh6095
Opteron 23564230010hB3Barcelona85
Opteron 23894290010hC2Shanghai73
Opteron 24356260010hD0Istanbul78
Opteron 617412220010hD1Magny-Cours

Notes for 2-socket results:

  1. The values for the Opteron 2222 are from memory, but should be fairly accurate.
  2. The local latency value for the Opteron 2356 is from memory, but should be in the right ballpark.  The latency is higher than the earlier processors because of the lower core frequency, the lower “Northbridge” frequency, the presence of an L3 cache, and the asynchronous clock boundary between the core(s) and the Northbridge.
  3. The script used for the Opteron 2389 (“Shanghai”) did not correctly bind the threads, so no remote latency data was collected.
  4. The script used for the Opteron 2435 (“Istanbul”) did not correctly bind the threads, so no remote latency data was collected.
  5. The Opteron 6174 was not tested in the 2-socket configuration.

Results for 4-socket systems

ProcessorFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)NOTES1-hop median Latency (ns)2-hop median Latency (ns)
Opteron 822230000Fh90
Opteron 8356230010hB3Barcelona100/133122-1461,2
Opteron 8389290010hC2Shanghai
Opteron 8435260010hD0Istanbul561183,4
Opteron 6174220010hD1Magny-Cours56121-1795124179

Notes for 4-socket results:

  1. On the 4-socket Opteron 8356 (“Barcelona”) system, 2 of the 4 sockets have a local latency of 100ns, while the other 2 have a local latency of 133ns.  This is due to the asymmetric/anisotropic HyperTransport interconnect, in which only 2 of sockets have direct connections to all other sockets, while the other 2 sockets require two hops to reach one of the remote sockets.
  2. On the 4-socket Opteron 8356 (“Barcelona”) system, the asymmetric/anisotropic HyperTransport interconnect gives rise to several different latencies for various combinations of requestor and target socket.  This is discussed in more detail at AMD “Barcelona” Memory Latency.
  3. Starting with “Istanbul”, 4-socket systems have lower local latency than 2-sockets systems because “HyperTransport Assist” (a probe filter) is activated.  Enabling this feature reduces the L3 cache size from 6MiB to 5MiB, but enables the processor to avoid sending probes to the other chips in many cases (including this one).
  4. On the 4-socket Opteron 8435 (“Istanbul”) system, the scripts I ran had an error causing them to only measure local latency and remote latency on 1 of the 3 remote sockets.  Based on other system results, it looks like the remote value was measured for a single “hop”, with no values available for the 2-hop case.
  5. On the 4-socket Opteron 6140 (“Magny-Cours”) system, each package has 2 die, each constituting a separate NUMA node.  The HyperTransport interconnect is asymmetric/anisotropic, with 2 die having direct links to 6 other die (with 1 die requiring 2 hops), and the other 6 die having direct links to 4 other die (with 3 die requiring 2 hops).  The average latency for globally uniform accesses (local and remote) is 133ns, while the average latency for uniformly distributed remote accesses is 144ns.

Comments

This disappointingly incomplete dataset still shows a few important features….

  • Local latency is not improving, and shows occasional degradations
    • Processor frequencies are not increasing.
    • DRAM latencies are essentially unchanged over this period — about 15 ns for DRAM page hits, 30 ns for DRAM pages misses, 45 ns for DRAM page conflicts, and 60 ns for DRAM bank cycle time.   The latency benchmark is configured to allow open page hits for the majority of accesses, but these results did not include instrumentation of the DRAM open page hit rate.
    • Many design changes act to increase memory latency.
      • Major factors include the increased number of cores per chip, the addition of a shared L3 cache, the increase in the number of die per package, and the addition of a separate clock frequency domain for the Northbridge.
    • There have been no architectural changes to move away from the transparent, “flat” shared-memory paradigm.
    • Instead, overcoming these latency adders required the introduction of “probe filters” – a useful feature, but one that significantly complicates the implementation, uses 1/6th of the L3 cache, and significantly complicates performance analysis.
  • Remote latency is getting slowly worse
    • This is primarily due to the addition of the L3 cache, the increase in the number of cores, and the increase in the number of die per package.

Magny-Cours Topology

The pointer-chasing latency code was run for all 64 combinations of data binding (NUMA nodes 0..7) and thread binding (1 core chosen from each of NUMA nodes 0..7).   It was not initially clear which topology was used in the system under test, but the observed latency pattern showed very clearly that 2 NUMA nodes had 6 1-hop neighbors, while the other 6 NUMA nodes had only 4 1-hop neighbors.   This property is also shown by the “Max Perf” configuration from Figure 3(c) of the 2010 IEEE Micro paper “Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor” by Conway, et al. (which I highly recommend for its discussion of the cache coherence protocol and probe filter).   The figure below corrects an error from the figure in the paper (which is missing the x16 link between the upper and lower chips in the package on the left), and renumbers the die so that they correspond to the NUMA nodes of the system I was testing.

The latency values are quite easy to understand: all the local values are the same, all the 1-hop values are almost the same, and all the 2-hop values are the same.  (The 1-hop values that are within the same package are about 3.3ns faster than the 1-hop values between packages, but this is a difference of less than 3%, so it will not impact “real-world” performance.)

The bandwidth patterns are much less pretty, but that is a much longer topic for another day….

Posted in Computer Hardware | Comments Off on AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies

TACC Ranger Node Local and Remote Memory Latency Tables

Posted by John D. McCalpin, Ph.D. on 26th July 2012

In the previous post, I published my best set of numbers for local memory latency on a variety of AMD Opteron system configurations. Here I expand that to include remote memory latency on some of the systems that I have available for testing.

Ranger is the oldest system still operational here at TACC.  It was brought on-line in February 2008 and is currently scheduled to be decommissioned in early 2013.  Each of the 3936 SunBlade X6420 nodes contains four AMD “Barcelona” quad-core Opteron processors (model 8356), running at a core frequency of 2.3 GHz and a NorthBridge frequency of 1.6 GHz.  (The Opteron 8356 processor supports a higher NorthBridge frequency, but this requires a different motherboard with  “split-plane” power supply support — not available on the SunBlade X6420.)

The on-node interconnect topology of the SunBlade X6420 is asymmetric, making maximum use of the three HyperTransport links on each Opteron processor while still allowing 2 HyperTransport links to be used for I/O.

As seen in the figure below, chips 1 & 2 on each node are directly connected to each of the other three chips, while chips 0 & 3 are only connected to two other chips — requiring two “hops” on the HyperTransport network to access the third remote chip.  Memory latency on this system is bounded below by the time required to “snoop” the caches on the other chips.  Chips 1 & 2 are directly connected to the other chips, so they get their snoop responses back more quickly and therefore have lower memory latency.

Ranger compute node inter-processor topology.

Ranger Compute node processor interconnect.

A variant of the “lat_mem_rd.c” program from “lmbench” (version 2) was used to measure the memory access latency.  The benchmark chases a chain of pointers that have been set up with a fixed stride of 128 Bytes (so that the core hardware prefetchers are not activated) and with a total size that significantly exceeds the size of the 2MiB L3 cache.  For the table below, array sizes of 32MiB to 1024MiB were used, with negligible variations in observed latency.    For this particular system, the memory controller prefetchers were active with the stride of 128 used, but since the effective latency is limited by the snoop response time, there is no change to the effective latency even when the memory controller prefetchers fetch the data early.  (I.e., the processors might get the data earlier due to memory controller prefetch, but they cannot use the data until all the snoop responses have been received.)

Memory latency for all combinations of (chip making request) and (chip holding data) are shown in the table below:

Memory Latency (ns) Data on Chip 0 Data on Chip 1 Data on Chip 2 Data on Chip 3
Request from Chip 0 133.2 136.9 136.4 145.4
Request from Chip 1 140.3 100.3 122.8 139.3
Request from Chip 2 140.4 122.2 100.4 139.3
Request from Chip 3 146.4 137.4 137.4 134.9
Cache latency and local and remote memory latency for Ranger compute nodes.

Cache latency and local and remote memory latency for Ranger compute nodes.

Posted in Computer Hardware | Comments Off on TACC Ranger Node Local and Remote Memory Latency Tables

What good are “Large Pages” ?

Posted by John D. McCalpin, Ph.D. on 12th March 2012

I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.

  • The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
  • A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.

The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).

To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.

Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):

AMD Opteron Family 10h Revision D (“Istanbul”):

  • L1 DTLB:
    • 4kB pages: 48 entries;
    • 2MB pages: 48 entries;
    • 1GB pages: 48 entries
  • L2 TLB:
    • 4kB pages: 512 entries;
    • 2MB pages: 128 entries;
    • 1GB pages: 16 entries

AMD Opteron Family 15h Model 6220 (“Interlagos”):

  • L1 DTLB
    • 4KiB, 32 entry, fully associative
    • 2MiB, 32 entry, fully associative
    • 1GiB, 32 entry, fully associative
  • L2 DTLB: (none)
  • Unified L2 TLB:
    • Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
    • “An entry allocated by one core is not visible to the other core of a compute unit.”

Intel Xeon 56xx (“Westmere”):

  • L1 DTLB:
    • 4KiB pages: 64 entries;
    • 2MiB pages: 32 entries
  • L2 TLB:
    • 4kiB pages: 512 entries;
    • 2MB pages: none

Intel Xeon E5 26xx (“Sandy Bridge EP”):

  • L1 DTLB
    • 4KiB, 64 entries
    • 2MiB/4MiB, 32 entries
    • 1GiB, 4 entries
  • STLB (second-level TLB)
    • 4KiB, 512 entries
    • (There are no entries for 2MiB pages or 1GiB pages in the STLB)

Xeon Phi Coprocessor SE10P: (Note 1)

  • L1 DTLB
    • 4KiB, 64 entries, 4-way associative
    • 2MiB, 8 entries, 4-way associative
  • L2 TLB
    • 4KiB, 64 Page Directory Entries, 4-way associative (Note 2)
    • 2MiB, 64 entries, 4-way associative

Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages.  All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.

What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:

  • 5 trips to memory to load data mapped on a 4KiB page,
  • 4 trips to memory to load data mapped on a 2MiB page, and
  • 3 trips to memory to load data mapped on a 1GiB page.

In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593).  Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)

A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite.  Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.


Note 1:

The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency.  The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth.  The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker.   The third unusual feature is unusual enough to get a separate discussion in Note 2.

Note 2:

Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB.  Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21.  The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB.  In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB.  Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.

Note 3:

The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.

Note 4:

Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.

Posted in Computer Architecture, Computer Hardware, Performance, Reference | Comments Off on What good are “Large Pages” ?

Memory Latency Components

Posted by John D. McCalpin, Ph.D. on 10th March 2011

A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….

Preliminary Comments:

It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.

On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.

Memory Latency Components: Abridged

Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).

  1. The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
    Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
  2. Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
  3. While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
  4. Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
    I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
  5. Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
  6. The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
  7. At this point the memory controller begins to do two things in parallel.  (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
    • probe the other caches on the chip, and
    • begin to set up the DRAM access.
  8. For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
  9. For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
  10. There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
  11. I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
  12. On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
  13. Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.

The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.

Let’s see what they add up to:

  • Core + System Request Interface: outbound: ~9 ns
  • Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
  • Memory Access Asynchronous interface crossings: ~21 ns
  • DRAM CAS latency: 11.25 ns
  • Core data forwarding: ~1.5 ns

This gives:

  • Total non-overlapped: ~43 ns
  • Measured latency: 51.6 ns
  • Unaccounted: ~9 ns = 18 memory controller clock cycles (assuming 2.0 GHz)

Final Comments:

  • I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
  • The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
  • I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.

Comments and corrections are always welcome.  In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me!   🙂

Posted in Computer Hardware | 4 Comments »

Opteron Memory Latency vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

  • AMD Phenom II 555 Processor
    • Two cores
    • Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
    • Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
  • 4 GB of DDR3/1600 DRAM
    • Two 2GB unregistered, unbuffered DIMMs
    • Dual-rank DIMMs
    • Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066 DDR3/800
3.2 GHz 51.58 ns 54.18 ns 57.46 ns 66.18 ns
2.5 GHz 52.77 ns 54.97 ns 58.30 ns 65.74 ns
2.1 GHz 53.94 ns 55.37 ns 58.72 ns 65.79 ns
0.8 GHz 82.86 ns 87.42 ns 94.96 ns 94.51 ns

Notes:

  1. The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Posted in Computer Hardware, Reference | 2 Comments »