The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores. For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.
This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core. This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread. In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.
With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years. Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).
Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:
So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.
Are these “good” improvements? The table below may provide some perspective:
2023 vs 2010 speedup
1-core BW
1-core GFLOPS
all-core BW
all-core GFLOPS
Intel
~2x
~5x
~10x/DDR5 ~30x/HBM
>40x
AMD
~5x
~5x
~20x
~30x
The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket. The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.
These observations naturally give rise to a variety of questions. Some I will address today, and some in the next blog entry.
Questions and Answers:
Why use a “read-only” memory access pattern? Why not something like STREAM?
For the single-core case the bandwidth reported by the STREAM benchmark kernels is very close to the same as the bandwidth for the all-read tests reported here. (Details in the next blog entry.)
Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle. At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM. I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
Why is the single-core bandwidth increasing so slowly?
To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
That is the topic of the next blog entry. (“Real Soon Now”)
Can this problem be fixed?
Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth. On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks), I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.
The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).
More notes interspersed below….
Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.
This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….
(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)
Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.
The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…
Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.
The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.
The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.
Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.
Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.
This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).
Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).
The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:
Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.
The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.
Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.
Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.
In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.
Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.
Drum roll, please….
Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.
These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.
Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?
On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.
Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.
These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)
Posted in Computer Architecture, Performance | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models
The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.
It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)
The modes that are important are:
“Flat” vs “Cache”
In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
Sustained bandwidth from MCDRAM is highest in “Flat” mode.
In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
In this mode the MCDRAM is invisible and effectively uncontrollable. I will discuss the performance characteristics of Cache mode at a later date.
“All-to-All” vs “Quadrant”
In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
“Sub-NUMA-Cluster”
There are several of these modes, only one of which will be discussed here.
“Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
“node 0” owns the 1st quarter of contiguous physical address space.
The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
Ditto for nodes 1, 2, and 3.
The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).
My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory. The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.
Mode of Operation
Flat-Quadrant
Flat-All2All
SNC4 local
SNC4 remote
MCDRAM maximum latency (ns)
156.1
158.3
153.6
164.7
MCDRAM average latency (ns)
154.0
155.9
150.5
156.8
MCDRAM minimum latency (ns)
152.3
154.4
148.3
150.3
MCDRAM standard deviation (ns)
1.0
1.0
0.9
3.1
Caveats:
My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.
Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….
The corresponding data for addresses mapped to DDR4 memory are included in the table below:
Mode of Operation
Flat-Quadrant
Flat-All2All
SNC4 local
SNC4 remote
DDR4 maximum latency (ns)
133.3
136.8
130.0
141.5
DDR4 average latency (ns)
130.4
131.8
128.2
133.1
DDR4 minimum latency (ns)
128.2
128.5
125.4
126.5
DDR4 standard deviation (ns)
1.2
2.4
1.1
3.1
There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.
12 years as student & faculty user in ocean modeling,
12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
over 7 years as a research scientist at TACC.
This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.
Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
x86-based systems rapidly replaced RISC systems starting in around 2003.
The first big x86 system on the list was ASCI Red in 1996.
The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
The earliest of these systems were 32-bit Intel processors.
The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
The big exception is the new #1 system, the Sunway Taihulight system in China.
Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
“RoadRunner” was the only significant system using the IBM Cell processor.
Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
Note the rapid growth, then contraction, of the accelerated systems Rmax.
More on this topic below in the discussion of “clusters of clusters”.
Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.
Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.
Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.
This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.
Note that the output of each kernel is used as an input to the next.
The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
Another way to say this is that STREAM is resistant to undetected over-optimization.
OpenMP directives were added in 1996 or 1997.
STREAM in C was made fully 64-bit capable in 2013.
The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
Output print formats keep getting wider as systems get faster.
STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
I count the reads that I asked for and the writes that I asked for.
If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
For example, GCC will not generate streaming stores.
In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.
Back to the main topic….
For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.
Using the same performance axes as on the previous slide….
All balances are shifting to make data motion relatively more expensive than arithmetic.
The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
The data points here are from a set of real systems.
The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
~1996: SGI Origin2000 (MIPS R10000 processor)
~1999: DEC AlphaServer DS20 (DEC Alpha processor)
~2005: AMD Opteron 244 (single-core, DDR1 memory)
~2006: AMD Opteron 275 (dual-core, DDR1 memory)
~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
~2011: Intel Xeon X5680 (6-core Westmere EP)
~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
~2014: Intel Xeon E5 v3 (12-core Haswell EP)
(future: Intel Xeon E5 v5)
Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
Unfortunately every other metric is much worse….
ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
Memory latency is becoming expensive much more rapidly than memory bandwidth!
Memory latency is dominated by the time required for cache coherence in most recent systems.
Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
More on this below — slide 38.
The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.
Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart. Don’t believe me? OK…
Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
The details depend on the scaling characteristics of your application and are left as an exercise….
Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low. This is no longer the case.
In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced. We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
These rapid changes require more design cost which results in higher product cost.
This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
The specific values are based on the median-priced 2-socket server processor in each time frame.
The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
How do Intel’s ManyCore (Xeon Phi) processors compare?
Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
The basic physics being applied here is based on several complementary principles:
Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).
The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.
In 1990, the original “Killer Micros” had a single level of cache.
Since about 2008, most x86 processors have had 3 levels of cache.
Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
The extra lookups can actually reduce performance if they delay “real” transactions.
Asynchronous clock crossings are hard to implement with low latency.
A big, and under-appreciated, topic for another forum…
Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported
Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.
Thanks to Burton Smith for the Little’s Law reference!
Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….
Simple graphical illustration of Little’s Law.
I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
The second set of six loads can re-use the same buffers, so only six buffers are needed
In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
The “Mainstream” machines are
SGI Origin2000 (MIPS R10000)
DEC Alpha DS20 (DEC Alpha EV5)
AMD Opteron 244 (single-core, DDR1 memory)
AMD Opteron 275 (dual-core, DDR1 memory)
AMD Opteron 2352 (dual-core, DDR2 memory)
Intel Xeon X5680 (6-core Westmere EP)
Intel Xeon E5 (8-core Sandy Bridge EP)
Intel Xeon E5 v2 (10-core Ivy Bridge EP)
Intel Xeon E5 v3 (12-core Haswell EP)
Intel Xeon E5 v4 (14-core Broadwell EP)
(future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
The GPU/Manycore machines are:
NVIDIA M2050
Intel Xeon Phi (KNC)
NVIDIA K20
NVIDIA K40
Intel Xeon Phi/KNL
NVIDIA P100 (Latency estimated).
Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
Electrical cost is unlikely to increase by large factors in locations attached to power grids.
So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.
Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
(Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
So if we get cheaper processors, they must be more energy-efficient as well.
This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
Not enough time for that topic today, but there are some speculations in the bonus slides.
For the purposes of this talk, my speculations focus on the “most likely” scenarios.
Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.
Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
(This is fairly widespread in practice.)
If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
(This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
Next we will look at examples of “application balance” from various application areas.
CRITICAL! Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
This chart is based on a sensitivity-based performance model with additive performance components.
In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!
It is very hard for vendors to back off of existing performance levels….
As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.
This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.
Here I hit the 45-minute mark and ended the talk.
I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.
“Memory Bandwidth and System Balance in HPC Systems”
If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).
I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic). The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.
The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.
But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”
Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.
But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.
A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.
Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.
Posted in Algorithms, Performance | Comments Off on Memory Bandwidth Requirements of the HPL benchmark
There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.
Configured with an array size of 64 million elements per array and 10 iterations.
Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.
Function
Best Rate MB/s
Avg time (sec)
Min time (sec)
Max time (sec)
Copy
169446.8
0.0062
0.0060
0.0063
Scale
169173.1
0.0062
0.0061
0.0063
Add
174824.3
0.0090
0.0088
0.0091
Triad
174663.2
0.0089
0.0088
0.0091
Memory Controllers
The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels. Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface. Each of the 32 GDDR5 DRAM chips has 16 banks. This gives a *raw* total of 512 DRAM banks. BUT:
The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface. (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface). This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface. Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).
So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource. Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines. (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)
Bandwidth vs Number of Data Access STREAM
One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks. On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1). Some data from the STREAM benchmark supports this hypothesis:
Kernel
Reads
Writes
2/core
3/core
4/core
Copy
1
1
-0.8%
-5.2%
-7.3%
Scale
1
1
-1.0%
-3.3%
-6.7%
Add
2
1
-3.1%
-12.0%
-13.6%
Triad
2
1
-3.6%
-11.2%
-13.5%
From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.
So why am I not counting the write streams?
I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization. The default mode is to give priority to reads while buffering stores. Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads. The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012). So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams. For STREAM Triad, the observed bandwidth is ~175 GB/s. Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time. This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller. For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times. If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….
As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.
In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.
I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.
Local Memory Bandwidth
The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.
Background: Memory Balance
Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.
Some real-world examples from TACC's systems:
TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.
Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)
The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.
Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.
PCIe Interface Bandwidth
There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.
Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.
MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….
Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:
Run on host and offload loops/functions to the Xeon Phi.
Run on Xeon Phi and offload loops/functions to the host.
Run on Xeon Phi and host as peers, for example with MPI.
Run only on the host and ignore the Xeon Phi.
Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.
Lots of things to try….
ECC
Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.
Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.
Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)
Speedup relative to Xeon E5
Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.
Looking at some specific microarchitectural performance factors:
Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.
None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)
The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.
Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.
On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.
Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!
Posted in Computer Hardware | Comments Off on Some comments on the Xeon Phi coprocessor
I am often asked what “Large Pages” in computer systems are good for. For commodity (x86_64) processors, “small pages” are 4KiB, while “large pages” are (typically) 2MiB.
The size of the page controls how many bits are translated between virtual and physical addresses, and so represent a trade-off between what the user is able to control (bits that are not translated) and what the operating system is able to control (bits that are translated).
A very knowledgeable user can use address bits that are not translated to control how data is mapped into the caches and how data is mapped to DRAM banks.
The biggest performance benefit of “Large Pages” will come when you are doing widely spaced random accesses to a large region of memory — where “large” means much bigger than the range that can be mapped by all of the small page entries in the TLBs (which typically have multiple levels in modern processors).
To make things more complex, the number of TLB entries for 4KiB pages is often larger than the number of entries for 2MiB pages, but this varies a lot by processor. There is also a lot of variation in how many “large page” entries are available in the Level 2 TLB, and it is often unclear whether the TLB stores entries for 4KiB pages and for 2MiB pages in separate locations or whether they compete for the same underlying buffers.
Examples of the differences between processors (using Todd Allen’s very helpful “cpuid” program):
AMD Opteron Family 10h Revision D (“Istanbul”):
L1 DTLB:
4kB pages: 48 entries;
2MB pages: 48 entries;
1GB pages: 48 entries
L2 TLB:
4kB pages: 512 entries;
2MB pages: 128 entries;
1GB pages: 16 entries
AMD Opteron Family 15h Model 6220 (“Interlagos”):
L1 DTLB
4KiB, 32 entry, fully associative
2MiB, 32 entry, fully associative
1GiB, 32 entry, fully associative
L2 DTLB: (none)
Unified L2 TLB:
Data entries: 4KiB/2MiB/4MiB/1GiB, 1024 entries, 8-way associative
“An entry allocated by one core is not visible to the other core of a compute unit.”
Intel Xeon 56xx (“Westmere”):
L1 DTLB:
4KiB pages: 64 entries;
2MiB pages: 32 entries
L2 TLB:
4kiB pages: 512 entries;
2MB pages: none
Intel Xeon E5 26xx (“Sandy Bridge EP”):
L1 DTLB
4KiB, 64 entries
2MiB/4MiB, 32 entries
1GiB, 4 entries
STLB (second-level TLB)
4KiB, 512 entries
(There are no entries for 2MiB pages or 1GiB pages in the STLB)
Most of these cores can map at least 2MiB (512*4kB) using small pages before suffering level 2 TLB misses, and at least 64 MiB (32*2MiB) using large pages. All of these systems should see a performance increase when performing random accesses over memory ranges that are much larger than 2MB and less than 64MB.
What you are trying to avoid in all these cases is the worst case (Note 3) scenario of traversing all four levels of the x86_64 hierarchical address translation.
If none of the address translation caching mechanisms (Note 4) work, it requires:
5 trips to memory to load data mapped on a 4KiB page,
4 trips to memory to load data mapped on a 2MiB page, and
3 trips to memory to load data mapped on a 1GiB page.
In each case the last trip to memory is to get the requested data, while the other trips are required to obtain the various parts of the page translation information. The best description I have seen is in Section 5.3 of AMD’s “AMD64 Architecture Programmer’s Manual Volume 2: System Programming” (publication 24593). Intel’s documentation is also good once you understand the nomenclature — for 64-bit operation the paging mode is referred to as “IA-32E Paging”, and is described in Section 4.5 of Volume 3 of the “Intel 64 and IA-32 Architectures Software Developer’s Manual” (Intel document 325384 — I use revision 059 from June 2016.)
A benchmark designed to test computer performance for random updates to a very large region of memory is the “RandomAccess” benchmark from the HPC Challenge Benchmark suite. Although the HPC Challenge Benchmark configuration is typically used to measure performance when performing updates across the aggregate memory of a cluster, the test can certainly be run on a single node.
Note 1:
The first generation Intel Xeon Phi (a.k.a., “Knights Corner” or “KNC”) has several unusual features that combine to make large pages very important for sustained bandwidth as well as random memory latency. The first unusual feature is that the hardware prefetchers in the KNC processor are not very aggressive, so software prefetches are required to obtain the highest levels of sustained bandwidth. The second unusual feature is that, unlike most recent Intel processors, the KNC processor will “drop” software prefetches if the address is not mapped in the Level-1 or Level-2 TLB — i.e., a software prefetch will never trigger the Page Table Walker. The third unusual feature is unusual enough to get a separate discussion in Note 2.
Note 2:
Unlike every other recent processor that I know of, the first generation Intel Xeon Phi does not store 4KiB Page Table Entries in the Level-2 TLB. Instead, it stores “Page Directory Entries”, which are the next level “up” in the page translation — responsible for translating virtual address bits 29:21. The benefit here is that storing 64 Page Table Entries would only provide the ability to access another 64*4KiB=256KiB of virtual addresses, while storing 64 Page Directory Entries eliminates one memory lookup for the Page Table Walk for an address range of 64*2MiB=128MiB. In this case, a miss to the Level-1 DTLB for an address mapped to 4KiB pages will cause a Page Table Walk, but there is an extremely high chance that the Page Directory Entry will be in the Level-2 TLB. Combining this with the caching for the first two levels of the hierarchical address translation (see Note 4) and a high probability of finding the Page Table Entry in the L1 or L2 caches this approach trades a small increase in latency for a large increase in the address range that can be covered with 4KiB pages.
Note 3:
The values above are not really the worst case. Running under a virtual machine makes these numbers worse. Running in an environment that causes the memory holding the various levels of the page tables to get swapped to disk makes performance much worse.
Note 4:
Unfortunately, even knowing this level of detail is not enough, because all modern processors have additional caches for the upper levels of the page translation hierarchy. As far as I can tell these are very poorly documented in public.