The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores. For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.
This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core. This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread. In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.
With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years. Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).
Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:
So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.
Are these “good” improvements? The table below may provide some perspective:
2023 vs 2010 speedup
1-core BW
1-core GFLOPS
all-core BW
all-core GFLOPS
Intel
~2x
~5x
~10x/DDR5 ~30x/HBM
>40x
AMD
~5x
~5x
~20x
~30x
The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket. The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.
These observations naturally give rise to a variety of questions. Some I will address today, and some in the next blog entry.
Questions and Answers:
Why use a “read-only” memory access pattern? Why not something like STREAM?
For the single-core case the bandwidth reported by the STREAM benchmark kernels is very close to the same as the bandwidth for the all-read tests reported here. (Details in the next blog entry.)
Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle. At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM. I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
Why is the single-core bandwidth increasing so slowly?
To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
That is the topic of the next blog entry. (“Real Soon Now”)
Can this problem be fixed?
Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth. On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks), I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.
The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).
More notes interspersed below….
Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.
This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….
(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)
Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.
The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…
Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.
The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.
The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.
Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.
Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.
This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).
Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).
The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:
Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.
The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.
Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.
Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.
In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.
Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.
Drum roll, please….
Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.
These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.
Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?
On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.
Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.
These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)
Posted in Computer Architecture, Performance | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models
The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.
It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)
The modes that are important are:
“Flat” vs “Cache”
In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
Sustained bandwidth from MCDRAM is highest in “Flat” mode.
In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
In this mode the MCDRAM is invisible and effectively uncontrollable. I will discuss the performance characteristics of Cache mode at a later date.
“All-to-All” vs “Quadrant”
In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
“Sub-NUMA-Cluster”
There are several of these modes, only one of which will be discussed here.
“Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
“node 0” owns the 1st quarter of contiguous physical address space.
The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
Ditto for nodes 1, 2, and 3.
The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).
My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory. The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.
Mode of Operation
Flat-Quadrant
Flat-All2All
SNC4 local
SNC4 remote
MCDRAM maximum latency (ns)
156.1
158.3
153.6
164.7
MCDRAM average latency (ns)
154.0
155.9
150.5
156.8
MCDRAM minimum latency (ns)
152.3
154.4
148.3
150.3
MCDRAM standard deviation (ns)
1.0
1.0
0.9
3.1
Caveats:
My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.
Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….
The corresponding data for addresses mapped to DDR4 memory are included in the table below:
Mode of Operation
Flat-Quadrant
Flat-All2All
SNC4 local
SNC4 remote
DDR4 maximum latency (ns)
133.3
136.8
130.0
141.5
DDR4 average latency (ns)
130.4
131.8
128.2
133.1
DDR4 minimum latency (ns)
128.2
128.5
125.4
126.5
DDR4 standard deviation (ns)
1.2
2.4
1.1
3.1
There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.
“Memory Bandwidth and System Balance in HPC Systems”
If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).
I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic). The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.
In an earlier post, I documented the local and remote memory latencies for the SunBlade X6420 compute nodes in the TACC Ranger supercomputer, using AMD Opteron “Barcelona” (model 8356) processors running at 2.3 GHz.
Similar latency tests were run on other systems based on AMD Opteron processors in the TACC “Discovery” benchmarking cluster. The systems reported here include
Compared to the previous results with the AMD Opteron “Barcelona” processors on the TACC Ranger system, the “Shanghai” and “Istanbul” processors have a more advanced memory controller prefetcher, and the “Istanbul” processor also supports “HT Assist”, which allocates a portion of the L3 cache to serve as a “snoop filter” in 4-socket configurations. The “Magny-Cours” processor uses 2 “Istanbul” die in a single package.
Note that for the “Barcelona” processors, the hardware prefetcher in the memory controller did not perform cache coherence “snoops” — it just loaded data from memory into a buffer. When a core subsequently issued a load for that address, missing in the L3 cache initiated a coherence snoop. In both 2-socket and 4-socket systems, this snoop took longer than obtaining the data from DRAM, so the memory prefetcher had no impact on the effective latency seen by a processor core. “Shanghai” and later Opteron processors include a coherent prefetcher, so prefetched lines could be loaded with lower effective latency. This difference means that latency testing on “Shanghai” and later processors needs to be slightly more sophisticated to prevent memory controller prefetching from biasing the latency measurements. In practice, using a pointer-chasing code with a stride of 512 Bytes was sufficient to avoid hardware prefetch in “Shanghai”, “Istanbul”, and “Magny-Cours”.
Results for 2-socket systems
Processor
cores/package
Frequency
Family
Revision
Code Name
Local Latency (ns)
Remote Latency (ns)
Opteron 2222
2
3000
0Fh
60
95
Opteron 2356
4
2300
10h
B3
Barcelona
85
Opteron 2389
4
2900
10h
C2
Shanghai
73
Opteron 2435
6
2600
10h
D0
Istanbul
78
Opteron 6174
12
2200
10h
D1
Magny-Cours
Notes for 2-socket results:
The values for the Opteron 2222 are from memory, but should be fairly accurate.
The local latency value for the Opteron 2356 is from memory, but should be in the right ballpark. The latency is higher than the earlier processors because of the lower core frequency, the lower “Northbridge” frequency, the presence of an L3 cache, and the asynchronous clock boundary between the core(s) and the Northbridge.
The script used for the Opteron 2389 (“Shanghai”) did not correctly bind the threads, so no remote latency data was collected.
The script used for the Opteron 2435 (“Istanbul”) did not correctly bind the threads, so no remote latency data was collected.
The Opteron 6174 was not tested in the 2-socket configuration.
Results for 4-socket systems
Processor
Frequency
Family
Revision
Code Name
Local Latency (ns)
Remote Latency (ns)
NOTES
1-hop median Latency (ns)
2-hop median Latency (ns)
Opteron 8222
3000
0Fh
90
Opteron 8356
2300
10h
B3
Barcelona
100/133
122-146
1,2
Opteron 8389
2900
10h
C2
Shanghai
Opteron 8435
2600
10h
D0
Istanbul
56
118
3,4
Opteron 6174
2200
10h
D1
Magny-Cours
56
121-179
5
124
179
Notes for 4-socket results:
On the 4-socket Opteron 8356 (“Barcelona”) system, 2 of the 4 sockets have a local latency of 100ns, while the other 2 have a local latency of 133ns. This is due to the asymmetric/anisotropic HyperTransport interconnect, in which only 2 of sockets have direct connections to all other sockets, while the other 2 sockets require two hops to reach one of the remote sockets.
On the 4-socket Opteron 8356 (“Barcelona”) system, the asymmetric/anisotropic HyperTransport interconnect gives rise to several different latencies for various combinations of requestor and target socket. This is discussed in more detail at AMD “Barcelona” Memory Latency.
Starting with “Istanbul”, 4-socket systems have lower local latency than 2-sockets systems because “HyperTransport Assist” (a probe filter) is activated. Enabling this feature reduces the L3 cache size from 6MiB to 5MiB, but enables the processor to avoid sending probes to the other chips in many cases (including this one).
On the 4-socket Opteron 8435 (“Istanbul”) system, the scripts I ran had an error causing them to only measure local latency and remote latency on 1 of the 3 remote sockets. Based on other system results, it looks like the remote value was measured for a single “hop”, with no values available for the 2-hop case.
On the 4-socket Opteron 6140 (“Magny-Cours”) system, each package has 2 die, each constituting a separate NUMA node. The HyperTransport interconnect is asymmetric/anisotropic, with 2 die having direct links to 6 other die (with 1 die requiring 2 hops), and the other 6 die having direct links to 4 other die (with 3 die requiring 2 hops). The average latency for globally uniform accesses (local and remote) is 133ns, while the average latency for uniformly distributed remote accesses is 144ns.
Comments
This disappointingly incomplete dataset still shows a few important features….
Local latency is not improving, and shows occasional degradations
Processor frequencies are not increasing.
DRAM latencies are essentially unchanged over this period — about 15 ns for DRAM page hits, 30 ns for DRAM pages misses, 45 ns for DRAM page conflicts, and 60 ns for DRAM bank cycle time. The latency benchmark is configured to allow open page hits for the majority of accesses, but these results did not include instrumentation of the DRAM open page hit rate.
Many design changes act to increase memory latency.
Major factors include the increased number of cores per chip, the addition of a shared L3 cache, the increase in the number of die per package, and the addition of a separate clock frequency domain for the Northbridge.
There have been no architectural changes to move away from the transparent, “flat” shared-memory paradigm.
Instead, overcoming these latency adders required the introduction of “probe filters” – a useful feature, but one that significantly complicates the implementation, uses 1/6th of the L3 cache, and significantly complicates performance analysis.
Remote latency is getting slowly worse
This is primarily due to the addition of the L3 cache, the increase in the number of cores, and the increase in the number of die per package.
Magny-Cours Topology
The pointer-chasing latency code was run for all 64 combinations of data binding (NUMA nodes 0..7) and thread binding (1 core chosen from each of NUMA nodes 0..7). It was not initially clear which topology was used in the system under test, but the observed latency pattern showed very clearly that 2 NUMA nodes had 6 1-hop neighbors, while the other 6 NUMA nodes had only 4 1-hop neighbors. This property is also shown by the “Max Perf” configuration from Figure 3(c) of the 2010 IEEE Micro paper “Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor” by Conway, et al. (which I highly recommend for its discussion of the cache coherence protocol and probe filter). The figure below corrects an error from the figure in the paper (which is missing the x16 link between the upper and lower chips in the package on the left), and renumbers the die so that they correspond to the NUMA nodes of the system I was testing.
The latency values are quite easy to understand: all the local values are the same, all the 1-hop values are almost the same, and all the 2-hop values are the same. (The 1-hop values that are within the same package are about 3.3ns faster than the 1-hop values between packages, but this is a difference of less than 3%, so it will not impact “real-world” performance.)
The bandwidth patterns are much less pretty, but that is a much longer topic for another day….
Posted in Computer Hardware | Comments Off on AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies
In the previous post, I published my best set of numbers for local memory latency on a variety of AMD Opteron system configurations. Here I expand that to include remote memory latency on some of the systems that I have available for testing.
Ranger is the oldest system still operational here at TACC. It was brought on-line in February 2008 and is currently scheduled to be decommissioned in early 2013. Each of the 3936 SunBlade X6420 nodes contains four AMD “Barcelona” quad-core Opteron processors (model 8356), running at a core frequency of 2.3 GHz and a NorthBridge frequency of 1.6 GHz. (The Opteron 8356 processor supports a higher NorthBridge frequency, but this requires a different motherboard with “split-plane” power supply support — not available on the SunBlade X6420.)
The on-node interconnect topology of the SunBlade X6420 is asymmetric, making maximum use of the three HyperTransport links on each Opteron processor while still allowing 2 HyperTransport links to be used for I/O.
As seen in the figure below, chips 1 & 2 on each node are directly connected to each of the other three chips, while chips 0 & 3 are only connected to two other chips — requiring two “hops” on the HyperTransport network to access the third remote chip. Memory latency on this system is bounded below by the time required to “snoop” the caches on the other chips. Chips 1 & 2 are directly connected to the other chips, so they get their snoop responses back more quickly and therefore have lower memory latency.
Ranger Compute node processor interconnect.
A variant of the “lat_mem_rd.c” program from “lmbench” (version 2) was used to measure the memory access latency. The benchmark chases a chain of pointers that have been set up with a fixed stride of 128 Bytes (so that the core hardware prefetchers are not activated) and with a total size that significantly exceeds the size of the 2MiB L3 cache. For the table below, array sizes of 32MiB to 1024MiB were used, with negligible variations in observed latency. For this particular system, the memory controller prefetchers were active with the stride of 128 used, but since the effective latency is limited by the snoop response time, there is no change to the effective latency even when the memory controller prefetchers fetch the data early. (I.e., the processors might get the data earlier due to memory controller prefetch, but they cannot use the data until all the snoop responses have been received.)
Memory latency for all combinations of (chip making request) and (chip holding data) are shown in the table below:
Memory Latency (ns)
Data on Chip 0
Data on Chip 1
Data on Chip 2
Data on Chip 3
Request from Chip 0
133.2
136.9
136.4
145.4
Request from Chip 1
140.3
100.3
122.8
139.3
Request from Chip 2
140.4
122.2
100.4
139.3
Request from Chip 3
146.4
137.4
137.4
134.9
Cache latency and local and remote memory latency for Ranger compute nodes.
Posted in Computer Hardware | Comments Off on TACC Ranger Node Local and Remote Memory Latency Tables
I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.
The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.
The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.
Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):
AMD Opteron Family 10h Revision D (“Istanbul”):
L1 DTLB:
4kB pages: 48 entries;
2MB pages: 48 entries;
1GB pages: 48 entries
L2 TLB:
4kB pages: 512 entries;
2MB pages: 128 entries;
1GB pages: 16 entries
AMD Opteron Family 15h Model 6220 (“Interlagos”):
L1 DTLB
4KiB, 32 entry, fully associative
2MiB, 32 entry, fully associative
1GiB, 32 entry, fully associative
L2 DTLB: (none)
Unified L2 TLB:
Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
“An entry allocated by one core is not visible to the other core of a compute unit.”
Intel Xeon 56xx (“Westmere”):
L1 DTLB:
4KiB pages: 64 entries;
2MiB pages: 32 entries
L2 TLB:
4kiB pages: 512 entries;
2MB pages: none
Intel Xeon E5 26xx (“Sandy Bridge EP”):
L1 DTLB
4KiB, 64 entries
2MiB/4MiB, 32 entries
1GiB, 4 entries
STLB (second-level TLB)
4KiB, 512 entries
(There are no entries for 2MiB pages or 1GiB pages in the STLB)
Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages. All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.
What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:
5 trips to memory to load data mapped on a 4KiB page,
4 trips to memory to load data mapped on a 2MiB page, and
3 trips to memory to load data mapped on a 1GiB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593). Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)
A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite. Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.
Note 1:
The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency. The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth. The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker. The third unusual feature is unusual enough to get a separate discussion in Note 2.
Note 2:
Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB. Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21. The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB. In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB. Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.
Note 3:
The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 4:
Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.
A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….
Preliminary Comments:
It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.
On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.
Memory Latency Components: Abridged
Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).
The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
At this point the memory controller begins to do two things in parallel. (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
probe the other caches on the chip, and
begin to set up the DRAM access.
For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.
The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.
Let’s see what they add up to:
Core + System Request Interface: outbound: ~9 ns
Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.
Comments and corrections are always welcome. In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me! 🙂
In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.
The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.
The system used for these tests is a home-built system:
AMD Phenom II 555 Processor
Two cores
Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
4 GB of DDR3/1600 DRAM
Two 2GB unregistered, unbuffered DIMMs
Dual-rank DIMMs
Each rank is composed of eight 1 Gbit parts (128Mbit x8)
The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)
Processor Frequency
DDR3/1600
DDR3/1333
DDR3/1066
DDR3/800
3.2 GHz
51.58 ns
54.18 ns
57.46 ns
66.18 ns
2.5 GHz
52.77 ns
54.97 ns
58.30 ns
65.74 ns
2.1 GHz
53.94 ns
55.37 ns
58.72 ns
65.79 ns
0.8 GHz
82.86 ns
87.42 ns
94.96 ns
94.51 ns
Notes:
The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)
Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.