John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

The evolution of single-core bandwidth in multicore processors

Posted by John D. McCalpin, Ph.D. on 25th April 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores.  For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core.  This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread.  In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.

With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years.  Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).

Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:

So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.

Are these “good” improvements?  The table below may provide some perspective:

2023 vs 2010 speedup 1-core BW 1-core GFLOPS all-core BW all-core GFLOPS
Intel ~2x ~5x ~10x/DDR5 ~30x/HBM >40x
AMD ~5x ~5x ~20x ~30x

 

The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket.  The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.

These observations naturally give rise to a variety of questions.  Some I will address today, and some in the next blog entry.

Questions and Answers:

  1. Why use a “read-only” memory access pattern?  Why not something like STREAM?
  2. Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
    • Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle.  At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM.  I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
  3. Why is the single-core bandwidth increasing so slowly?
    • To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
    • That is the topic of the next blog entry.  (“Real Soon Now”)
  4. Can this problem be fixed?
    • Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
    • The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth.   On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks),  I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.

 

Stay tuned!

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on The evolution of single-core bandwidth in multicore processors

Why don’t we talk about bisection bandwidth any more?

Posted by John D. McCalpin, Ph.D. on 11th April 2022


Is it unimportant just because we are not talking about it?

I was recently asked for comments about the value of increased bisection bandwidth in computer clusters for high performance computing. That got me thinking that the architecture of most internet news/comment infrastructures is built around “engagement” — effectively amplifying signals that are already strong and damping those that do not meet an interest threshold. While that approach certainly has a role, as a scientist it is also important for me to take note of topics that might be important, but which are currently generating little content. Bisection bandwidth of clustered computers seems to be one such topic….


Some notes relevant to the history of the discussion of bisection bandwidth in High Performance Computing

There has not been a lot of recent discussion of high-bisection-bandwidth computers, but this was a topic that came up frequently in the late 1990’s and early 2000’s. A good reference is the 2002 Report on High End Computing for the National Security Community, which I participated in as the representative of IBM (ref 1). The section starting on page 35 (“Systems, Architecture, Programmability, and Components Working Group”) discusses two approaches to “supercomputing” – one focusing on aggregating cost-effective peak performance (the “type T” (Transistor) systems), and the other focusing on providing the tightest integration and interconnect performance (“type C” (Communication) systems). A major influence on this distinction was Burton Smith, founder of Tera Computer Company and developer of the Tera MTA system. The Tera MTA was a unique architecture with no caches, with memory distributed/interleaved/hashed across the entire system, and with processors designed to tolerate the memory latency and to effectively synchronize on memory accesses (using “full/empty” metadata bits in the memory).

The 2002 report led fairly directly to the DARPA High Productivity Computing Systems (HPCS) project (2002-2010), which provided direct funding to several companies to develop hardware and software technologies to make supercomputers significantly easier to use. Phase 1 was just some seed money to write proposals, and included about 7 companies. Phase 2 was a much larger ($50M over 3 years to each recipient, if I recall correctly) set of grants for the companies to do significant high-level design work. Phase 3 grants ($200M-$250M over 5 years to each recipient) were awarded to Cray and IBM (ref 2). I was (briefly) the large-scale systems architecture team lead on the IBM Phase 2 project in 2004.

Both the Cray and IBM projects were characterized by a desire to improve effective bisection bandwidth, and both used hierarchical all-to-all interconnect topologies.

(If I recall correctly), the Cray project funded the development of the “Cascade” interconnect which eventually led to the Cray XC30 series of supercomputers (ref 3). Note that the Cray project funded only the interconnect development, while standard AMD and/or Intel processors were used for compute. The inability to influence the processor design limited what Cray was able to do with the interconnect. The IBM grant paid for the development of the “Torrent” bridge/switch chip for an alternate version of an IBM Power7-based server. Because IBM was developing both the processor and the interconnect chip, there was more opportunity to innovate.

I left IBM at the end of 2005 (near the end of Phase 2). IBM did eventually complete the implementation funded by DARPA, but backed out of the “Blue Waters” system at NCSA (ref 4) – an NSF-funded supercomputer that was very deliberately pitched as a “High Productivity Computing System” to benefit from the DARPA HPCS development projects. I have no inside information from the period that IBM backed out, but it is easy to suspect that IBM wanted/needed more money than was available in the budget to deliver the full-scale system. The “Blue Waters” system was replaced by a Cray – but since the DARPA-funded “Cascade” interconnect was not ready, Blue Waters used an older implementation (and older AMD processors). IBM sold only a handful of small Power7 systems with the “Torrent” interconnect, mostly to weather forecasting centers. As far as I can tell, none of the systems were large enough for the Torrent interconnect to show off its (potential) usefulness.

So after about 10 years of effort and a DARPA HPCS budget of over $600M, the high performance computing community got the interconnect for the Cray XC30. I think it was a reasonably good interconnect, but I never used it. TACC just retired our Cray XC40 and its interconnect performance was fine, but not revolutionary. After this unimpressive return on investment, it is perhaps not surprising that there has been a lack of interest in funding for high-bisection-bandwidth systems.

That is not to say that there is an unambiguous market for high-bisection-bandwidth systems! It is easy enough to identify application areas and user communities who will claim to “need” increased bandwidth, but historically they have failed to follow up with actual purchases of the few high-bandwidth systems that have been offered over the years. A common factor in the history of the HPC market has been that given enough time, most users who “demand” special characteristics will figure out how to work around the lack with more software optimization efforts, or with changes to their strategy for computing. A modest fraction will give up on scaling their computations to larger sizes and will just make do with the more modest ongoing improvements in single-node performance to advance their work.

Architectural Considerations

When considering either local memory bandwidth or global bandwidth, a key principle to keep in mind is concurrency. Almost all processor designs have very limited available concurrency per core, so lots of cores are needed just to generate enough cache misses to fill the local memory pipeline. As an example, TACC has recently deployed a few hundred nodes using the Intel Xeon Platinum 8380 processor (3rd generation Xeon Scalable Processor, aka “Ice Lake Xeon”). This has 40 cores and 8 channels of DDR4/3200 DRAM. The latency for memory access is about 80ns. So the latency-bandwidth product is

204.8 GB/s * 80 nanoseconds = 16384 bytes, or 256 cache lines

Each core can directly generate about 10-12 cache misses, and can indirectly generate a few more via the L2 hardware prefetchers – call it 16 concurrent cache line requests per core, so 16 of the (up to) 40 cores are required just to fill the pipeline to DRAM.

For IO-based interconnects, latencies are much higher, but bandwidths are lower. Current state of the art for a node-level interconnect is 200 Gbit, such as the InfiniBand HDR fabric. The high latency favors a “put” model of communication, with one-way RDMA put latency at about 800ns (unloaded) through a couple of switch hops. The latency-bandwidth product is (200/8)*800 = 20000 Bytes. This is only slightly higher than the case for local memory, but it depends on two factors: (1) The bandwidth is fairly low, and (2) the latency is only available for “put” operations – it must be doubled for “get” operations.

A couple of notes:

  • The local latency of 80 ns is dominated by a serialized check of the three levels of the cache (with the 3rd level hashed/interleaved around the chip in Intel processors), along with the need to traverse many asynchronous boundaries. (An old discussion on this blog Memory Latency Components 2011-03-10 is a good start, but newer systems have a 3rd level of cache and on on-chip 2D mesh interconnect adding to the latency components.) The effective local latency could be reduced significantly by using a “scratchpad” memory architecture and a more general block-oriented interface (to more effectively use the open page features of the DRAM), but it would very challenging to get below about 30 ns read latency.
  • The IO latency of ~800 ns for a put operation is dominated by the incredibly archaic architecture of IO systems. It is hard to get good data, but it typically requires 300-400 ns for a core’s uncached store operation to reach its target PCIe-connected device. This is absolutely not necessary from a physics perspective, but it cannot easily be fixed without a wholistic approach to re-architecting the system to support communication and synchronization as first-class features. One could certainly design processors that could put IO requests on external pins in 20ns or less – then the speed of light term becomes important in the latency equation (as it should be).

It has been a while since I reviewed the capabilities of optical systems, but the considerations above usually make extremely high bisection bandwidth effectively un-exploitable with current processor architectures.

Future technology developments (e.g, AttoJoule Optoelectronics for Low-Energy Information Processing and Communication) may dramatically reduce the cost, but a new architecture will be required to reduce the latency enough to make the bandwidth useful.

References:

  1. DOD High Performance Computing for the National Security Community Final Report_2003-04-09
  2. HPCWire 2006-11-24 DARPA selects Cray and IBM for final phase of HPCS-1
  3. HPCWire 2012-11-08 Cray launches Cascade: embraces Intel-based supercomputing
  4. Wikipedia: Blue Waters

Posted in Computer Architecture, Computer Hardware, Computer Interconnects | Comments Off on Why don’t we talk about bisection bandwidth any more?

Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Posted by John D. McCalpin, Ph.D. on 27th October 2021

Defect rates and chip yields in the fabrication of complex semiconductor chips (like processors) are typically very tightly held secrets.  In the current era of multicore processors even the definition of “yield” requires careful thinking — companies have adapted their designs to tolerate defects in single processor cores, allowing them to sell “partial good” die at lower prices.  This has been good for everyone — the vendors get to sell more of the “stuff” that comes off the manufacturing line, and the customers have a wider range of products to choose from.

Knowing that many processor chips use “partial good” die does not usually help customers to infer anything about the yield of a multicore chip at various core counts, even when purchasing thousands of chips.  It is possible that the Xeon Phi x200 (“Knights Landing”, “KNL”) processors are in a different category — one that allows statistically interesting inferences to be drawn about core defect rates.

Why is the Xeon Phi x200 different?

  • It was developed for (and is of interest to) almost exclusively customers in the High Performance Computing (HPC) market.
  • The chip has 76 cores (in 38 pairs), and the only three product offerings had 64, 68, or 72 cores enabled.
    •  No place to sell chips with many defects.
  • The processor core is slower than most mainstream Xeon processors in both frequency and instruction level parallelism.
    • No place to sell chips that don’t meet the frequency requirements.

The Texas Advanced Computing Center currently runs 4200 compute servers using the Xeon Phi 7250 (68-core) processor.   The first 504 were installed in June 2016 and the remaining 3696 were installed in April 2017.  Unlike the mainstream Xeon processors, the Xeon Phi x200 enables any user to determine which physical cores on the die are disabled, simply by running the CPUID instruction on each active logical processor to obtain that core’s X2APIC ID (used by the interrupt controller).  There is a 1:1 correspondence between the X2APIC IDs and the physical core locations on the die, so any cores that are disabled will result in missing X2APIC values in the list.  More details on the X2APIC IDs are in the technical report “Observations on Core Numbering and “Core ID’s” in Intel Processors” and more details on the mapping of X2APIC IDs to locations on the die are in the technical report Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 (“Knights Landing”, “KNL”) Processors.

The lists of disabled cores were collected at various points over the last 4.5 years and at some point during the COVID-19 pandemic I decided to look at them.  The first result was completely expected — cores are always enabled/disabled in pairs.  This matches the way they are placed on the die: each of the 38 “tiles” has 2 cores, a 1MiB shared L2 cache, and a coherence agent making up a “tile”.   The second result was unexpected — although every tile had disabled cores in at least some processors, there were four tile positions where the cores were disabled 15x-20x more than average.   In “Figure 5” below, these “preferred” tiles were the ones immediately above and below the memory controllers IMC0 and IMC1 on the left and right sides of the chip — numbers 2, 8, 27, 37.

Numbering and locations of CHAs and memory controllers in Xeon Phi x200 processors.

After reviewing the patterns in more detail, it seemed that these four “preferred” locations could be considered “spares”.  The cores at the other 34 tiles would be enabled if they were functional, and if any of those tiles had a defect, a “spare” would be enabled to compensate.  If true, this would be a a very exciting result because it means that even though every one of the 4200 chips has exactly 4 tiles with disabled cores, the presence of disabled cores anywhere other than the “preferred” locations indicated a defect.  If there were no defects on the chip (or only defects in the spare tiles themselves), then the only four tiles with disabled cores would be 2, 8, 27, 37.  This was actually the case for about 1290 of the 4200 chips.

The number of chips with disabled cores at each of the 34 “standard” (non-preferred) locations varied rather widely, but looked random.    Was there any way to evaluate whether the results were consistent with a model of a small number of random defects, with those cores being replaced by activating cores in the spare tiles?  Yes, there is, and for the statistically minded you can read all about it in the technical report Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 (“Knights Landing”) Processors. The report contains all sorts of mind-numbing discussions of “truncated binomial distributions”, corrections for visibility of defects, and statistical significance tests for several different views of the data — but it does have brightly colored charts and graphs to attempt to offset those soporific effects.

For the less statistically minded, the short description is:

  • For the 504 processors deployed in June 2016, the average number of “defects” was 1.38 per chip.
  • For the 3696 processors deployed in April 2017, the average number of “defects” was 1.19 per chip.
  • The difference in these counts was very strongly statistically significant (3.7 standard deviations).
  • Although some of the observed values are slightly outside the ranges expected for a purely random process, the overall pattern is strongly consistent with a model of random, independent defects.

These are very good numbers — for the full cluster the average number of defects is projected to be 1.36 per chip (including an estimate of defects in the unused “spare” tiles).  With these defect rates, only about 1% of the chips would be expected to have more than 4 defects — and almost all of these would still suffice for the 64-core model.

So does this have anything to do with “yield”?  Probably not a whole lot — all of these chips require that all 8 Embedded DRAM Controllers (EDCs) are fully functional, all 38 Coherence Agents are fully functional, both DDR4 memory controllers are fully functional, and the IO blocks are fully functional.  There is no way to infer how many chips might be lost due to failures in any of those parts because there were no product offerings that allowed any of those blocks to be disabled.  But from the subset of chips that had all the “non-core” parts working, these results paint an encouraging picture with regard to defect rates for the cores.

Posted in Computer Architecture, Computer Hardware | Comments Off on Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on 10th September 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Posted in Cache Coherence Implementations, Computer Architecture | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

Die Locations of Cores and L3 Slices for Intel Xeon Processors

Posted by John D. McCalpin, Ph.D. on 27th May 2021

Intel provides nice schematic diagrams of the layouts of their processor chips, but provides no guidance on how the user-visible core numbers and L3 slice numbers map to the locations on the die.
Most of the time there is no “need” to know the locations of the units, but there are many performance analyses that do require it, and it is often extremely helpful to be able to visualize the flow of data (and commands and acknowledgements) on the two-dimensional mesh.

In 2018 I spent a fair amount of time developing methodologies to determine the locations of the user-visible core and CHA/SF/LLC numbers on the Xeon Phi 7250 and Xeon Platinum 8160 processors. It required a fair amount of time because some tricks that Intel used to make it easier to design the photolithographic masks had the side effect of modifying the directions of up/down/left/right in different parts of the chip! When combined with the unknown locations of the disabled cores and CHAs, this was quite perplexing….

The Xeon Scalable Processors (Skylake, Cascade Lake, and the new Ice Lake Xeon) “mirror” the photolithographic mask for the “Tile” (Core + CHA/SF/LLC) in alternating columns, causing the meanings of “left” and “right” in the mesh traffic performance counters to alternate as well. This is vaguely hinted by some of the block diagrams of the chip (such as XeonScalableFamilyTechnicalOverviewFigure5, but is more clear in the die photo:

Intel Skylake Xeon die photo

Here I have added light blue boxes around the 14 (of 28) Tile locations on the die that have the normal meanings of “left” and “right” in the mesh data traffic counters. The Tiles that don’t have blue boxes around them are left-right mirror images of the “normal” cores, and at these locations the mesh data traffic counters report mesh traffic with “left” and “right” reversed. NOTE that the 3rd Generation Intel Xeon Scalable Processors (Ice Lake Xeon) show the same column-by-column reversal as the Skylake Xeon, leading to the same behavior in the mesh data traffic counters.


TACC Frontera System

For the Xeon Platinum 8280 processors in the TACC Frontera system, all 28 of the Tiles are fully enabled, so there are no disabled units at unknown locations to cause the layout and numbering to differ from socket to socket. In each socket, the CHA/SF/LLC blocks are numbered top-to-bottom, left-to-right, skipping over the memory controllers:

The pattern of Logical Processor numbers will vary depending on whether the numbers alternate between sockets (even in 0, odd in 1) or are block-distributed (first half in 0, second half in 1). For the TACC Frontera system, all of the nodes are configured with logical processors alternating between sockets, so all even-numbered logical processors are in socket 0 and all odd-numbered logical processors are in socket 1. For this configuration, the locations of the Logical Processor numbers in socket 0 are:

In socket 1 the layout is the same, but with each Logical Processor number incremented by 1.

More details are in TR-2021-01b (link below in the references).

TACC Stampede2 System

“Skylake Xeon” partitions

For the Xeon Platinum 8160 processors in the TACC Stampede2 system, 24 of the Tiles are fully enabled and the remaining 4 Tiles have disabled Cores and disabled CHA/SF/LLCs. For these processors, administrative privileges are required to read the configuration registers that allow one to determine the locations of the CHA/SF/LLC units and the Logical Processors. There are approximately 120 different patterns of disabled tiles across the 3472 Xeon Platinum 8160 processors (1736 2-socket nodes) in the Stampede2 “SKX” partitions. The pattern of disabled cores generally has negligible impact on performance, but one needs to know the locations of the cores and CHA/SF/LLC blocks to make any sense of the traffic on the 2D mesh. Fortunately only one piece of information is needed on these systems — the CAPID6 register tells which CHA locations on the die are enabled, and these systems have a fixed mapping of Logical Processor numbers to co-located CHA numbers — so it would not be hard to make this information available to interested users (if any exist).

More details are in TR-2021-01b (link below in the references).

“Knights Landing” (“KNL”) partitions

For the 4200 Stampede2 nodes with Xeon Phi 7250 processors, all 38 CHA/SF units are active in each chip, and 34 of the 38 tiles have an active pair of cores. Since all 38 CHAs are active, their locations are the same from node to node:


For these processors the information required to determine the locations of the cores is available from user space (i.e., without special privileges). The easiest way to do this is to simply use the “/proc/cpuinfo” device to get the “core id” field for each “processor” field. Since each core supports four threads, each of the “core id” fields should appear four times. Each tile has two cores, so we take the even-numbered “core id” fields and divide them by two to get the tile number where each of the active cores is located. A specific example showing the Logical Processor number, the “core id”, and the corresponding “tile” location:

c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:28 $ grep ^processor /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.procs
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:55 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.coreids
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:22 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print int($NF/2)}' > tbl.tiles
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:32 $ paste tbl.procs tbl.coreids tbl.tiles 
0	0	0
1	1	0
2	2	1
3	3	1
4	4	2
5	5	2
6	6	3
7	7	3
8	8	4
9	9	4
10	10	5
11	11	5
12	12	6
13	13	6
14	14	7
15	15	7
16	16	8
17	17	8
18	18	9
19	19	9
20	22	11
21	23	11
22	24	12
23	25	12
24	26	13
25	27	13
26	28	14
27	29	14
28	30	15
29	31	15
30	32	16
31	33	16
32	34	17
33	35	17
34	36	18
35	37	18
36	38	19
37	39	19
38	40	20
39	41	20
40	42	21
41	43	21
42	44	22
43	45	22
44	46	23
45	47	23
46	48	24
47	49	24
48	50	25
49	51	25
50	56	28
51	57	28
52	58	29
53	59	29
54	60	30
55	61	30
56	62	31
57	63	31
58	64	32
59	65	32
60	66	33
61	67	33
62	68	34
63	69	34
64	70	35
65	71	35
66	72	36
67	73	36

For each Logical Processor (column 1), the tile number is in column 3, and the location of the tile is in the figure above.

Since the tile numbers are [0..37], from this list we see that 10, 26, 27, and 37 are missing, so these are the tiles with disabled cores.

More details are in TR-2020-01 and in TR-2021-02 (links below in the references).


Presentations:

Detailed References:


What is a CHA/SF/LLC ? This is a portion of each Tile containing a “Coherence and Home Agent” slice, a “Snoop Filter” slice, and a “Last Level Cache” slice. Each physical address in the system is mapped to exactly one of the CHA/SF/LLC blocks for cache coherence and last-level caching, so that (1) any core in the system will automatically make use of all of the LLC slices, and (2) each CHA/SF/LLC has to handle approximately equal amounts of work when all the cores are active.


Posted in Computer Architecture, Computer Hardware, Performance Counters | Comments Off on Die Locations of Cores and L3 Slices for Intel Xeon Processors

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on 18th February 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Intel’s future “CLDEMOTE” instruction

SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Posted by John D. McCalpin, Ph.D. on 7th January 2019

Here are the annotated slides from my SC18 presentation on Snoop Filter Conflicts that cause performance variability in HPL and DGEMM on the Xeon Platinum 8160 processor.

This slide presentation includes data (not included in the paper) showing that Snoop Filter Conflicts occur in all Intel Scalable Processors (a.k.a., “Skylake Xeon”) with 18 cores or more, and also occurs on the Xeon Phi x200 processors (“Knights Landing”).

The published paper is available (with ACM subscription) at https://dl.acm.org/citation.cfm?id=3291680


This is less boring than it sounds!


A more exciting version of the title.


This story is very abridged — please read the paper!



Execution times only — no performance counters yet.

500 nodes tested, but only 392 nodes had the 7 runs needed for a good computation of the median performance.

Dozens of different nodes showed slowdowns of greater than 5%.


I measured memory bandwidth first simply because I had the tools to do this easily.
Read memory controller performance counters before and after each execution and compute DRAM traffic.
Write traffic was almost constant — only the read traffic showed significant variability.


It is important to decouple the sockets for several reasons.  (1) Each socket manages its frequency independently to remain within the Running Average Power Limit. (2) Cache coherence is managed differently within and between sockets.
The performance counter infrastructure is at https://github.com/jdmccalpin/periodic-performance-counters
Over 25,000 DGEMM runs in total, generating over 240 GiB of performance counter output.


I already saw that slow runs were associated with higher DRAM traffic, but needed to find out which level(s) of the cache were experience extra load misses.
The strongest correlation between execution time and cache miss counts was with L2 misses (measured here as L2 cache fills).

The variation of L2 fills for the full-speed runs is surprisingly large, but the slow runs all have L2 fill counts that are at least 1.5x the minimum value.
Some runs tolerate increased L2 fill counts up to 2x the minimum value, but all cases with >2x L2 fills are slow.

This chart looks at the sum of L2 fills for all the cores on the chip — next I will look at whether these misses are uniform across the cores.


I picked 15-20 cases in which a “good” trial (at or above median performance) was followed immediately by a “slow” trial (at least 20% below median performance).
This shows the L2 Fills by core for a “good” trial — the red dashed line corresponds to the minimum L2 fill count from the previous chart divided by 24 to get the minimum per-core value.
Different sets of cores and different numbers of cores had high counts in each run — even on the same node.


This adds the “slow” execution that immediately followed the “good” execution.
For the slow runs, most of cores had highly elevated L2 counts.  Again, different sets of cores and different numbers of cores had high counts in each run.

This data provides a critical clue:  Since L2 caches are private and 2MiB caches fully control the L2 cache index bits, the extra L2 cache misses must be caused by something *external* to the cores.


The Snoop Filter is essentially the same as the directory tags of the inclusive L3 cache of previous Intel Xeon processors, but without room to store the data for the cache lines tracked.
The key concept is “inclusivity” — lines tracked by a Snoop Filter entry must be invalidated before that Snoop Filter entry can be freed to track a different cache line address.


I initially found some poorly documented core counters that looked like they were related to Snoop Filter evictions, then later found counters in the “uncore” that count Snoop Filter evictions directly.
This allowed direct confirmation of my hypothesis, as summarized in the next slides.


About 1% of the runs are more than 10% slower than the fastest run.


Snoop Filter Evictions clearly account for the majority of the excess L2 fills.

But there is one more feature of the “slow” runs….


For all of the “slow” runs, the DRAM traffic is increased.  This means that a fraction of the data evicted from the L2 caches by the Snoop Filter evictions was also evicted from the L3 cache, and so must be retrieved from DRAM.

At high Snoop Filter conflict rates (>4e10 Snoop Filter evictions), all of the cases have elevated DRAM traffic, with something like 10%-15% of the Snoop Filter evictions missing in the L3 cache.

There are some cases in the range of 100-110 seconds that have elevated snoop filter evictions, but not elevated DRAM reads, that show minimal slowdowns.

This suggests that DGEMM can tolerate the extra latency of L2 miss/L3 hit for its data, but not the extra latency of L2 miss/L3 miss/DRAM hit.


Based on my experience in processor design groups at SGI, IBM, and AMD, I wondered if using contiguous physical addresses might avoid these snoop filter conflicts….


Baseline with 2MiB pages.


With 1GiB pages, the tail almost completely disappears in both width and depth.


Zooming in on the slowest 10% of the runs shows no evidence of systematic slowdowns when using 1GiB pages.
The performance counter data confirms that the snoop filter eviction rate is very small.

So we have a fix for single-socket DGEMM, what about HPL?


Intel provided a test version of their optimized HPL benchmark in December 2017 that supported 1GiB pages.

First, I verified that the performance variability for single-node (2-socket) HPL runs was eliminated by using 1GiB pages.

The variation across nodes is strong (due to different thermal characteristics), but the variation across runs on each node is extremely small.

The 8.6% range of average performance for this set of 31 nodes increases to >12% when considering the full 1736-node SKX partition of the Stampede2 system.

So we have a fix for single-node HPL, what about the full system?


Intel provided a full version of their optimized HPL benchmark in March 2018 and we ran it on the full system in April 2018.

The estimated breakdown of performance improvement into individual contributions is a ballpark estimate — it would be a huge project to measure the details at this scale.

The “practical peak performance” of this system is 8.77 PFLOPS on the KNLs plus 3.73 PFLOPS on the SKX nodes, for 12.5 PFLOPS “practical peak”.  The 10.68 PFLOPS obtained is about 85% of this peak performance.


During the review of the paper, I was able to simplify the test further to allow quick testing on other systems (and larger ensembles).

This is mostly new material (not in the paper).


https://github.com/jdmccalpin/SKX-SF-Conflicts

This lets me more directly address my hypothesis about conflicts with contiguous physical addresses, since each 1GiB page is much larger than the 24 MiB of aggregate L2 cache.


It turns out I was wrong — Snoop Filter Conflicts can occur with contiguous physical addresses on this processor.

The pattern repeats every 256 MiB.

If the re-used space is in the 1st 32 MiB of any 1GiB page, there will be no Snoop Filter Conflicts.

What about other processors?


I tested Skylake Xeon processors with 14, 16, 18, 20, 22, 24, 26, 28 cores, and a 68-core KNL (Xeon Phi 7250).

These four processors are the only ones that show Snoop Filter Conflicts with contiguous physical addresses.

But with random 2MiB pages, all processors with more than 16 cores show Snoop Filter conflicts for some combinations of addresses…..


These are average L2 miss rates — individual cores can have significantly higher miss rates (and the maximum miss rate may be the controlling factor in performance for multi-threaded codes).

The details are interesting, but no time in the current presentation….



Overall, the uncertainty associated with this performance variability is probably more important than the performance loss.

Using performance counter measurements to look for codes that are subject to this issue is a serious “needle in haystack” problem — it is probably easier to choose codes that might have the properties above and test them explicitly.


Cache-contained shallow water model, cache-contained FFTs.


The new DGEMM implementation uses dynamic scheduling of the block updates to decouple the memory access patterns.  There is no guarantee that this will alleviate the Snoop Filter Conflict problem, but in this case it does.


I now have a model that predicts all Snoop Filter Conflicts involving 2MiB pages on the 24-core SKX processors.
Unfortunately, the zonesort approach won’t work under Linux because pages are allocated on a “first-come, first-served” basis, so the fine control required is not possible.

An OS with support for page coloring (such as BSD) could me modified to provide this mitigation.


Again, the inability of Linux to use the virtual address as a criterion for selecting the physical page to use will prevent any sorting-based approach from working.


Intel has prepared a response.  If you are interested, you should ask your Intel representative for a copy.





 

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Linux, Performance Counters | Comments Off on SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Posted by John D. McCalpin, Ph.D. on 17th September 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).

Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:

  1. Neither Logical Processor active
  2. Logical Processor 0 Active, Logical Processor 1 Inactive
  3. Logical Processor 0 Inactive, Logical Processor 1 Active
  4. Both Logical Processors Active

It was not immediately obvious how to obtain this split from the available performance counters.

Every recent Intel processor has:

  • An invariant, non-stop Time Stamp Counter (TSC)
  • Three “fixed-function” performance counters per logical processor
    1. Fixed-Function Counter 0: Instructions retired (not used here)
    2. Fixed-Function Counter 1: Actual Cycles Not Halted
    3. Fixed-Function Counter 2: Reference Cycles Not Halted
  • Two or more (typically 4) programmable performance counters per logical processor
    • A few of the “events” are common across all processors, but most are model-specific.

The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories.   There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both  Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval.    On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.

Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core.  This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….

The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.

  • The TSC increments at a fixed rate.
    • For most Intel processors this rate is the same as the “nominal” processor frequency.
      • Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
    • It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
  • The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
    • Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
    • For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
    • For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
      • This clock was also used in the definition of the processor frequencies.
      • For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
    • For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
      • For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
  • The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
    • Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
      • No additional scaling needed.
    • Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
      • Results must be scaled by the base CPU ratio.
    • Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
      • Note that the name still includes “XCLK”, but the definition has changed!
      • Results must be scaled by 4 times the base CPU ratio.

Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.

For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:

  • TSC (on either logical processor)
  • Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
  • Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)

So each Logical Processor makes two measurements, but they are asymmetric.

From these results, the algebra required to split the counts into the desired categories is not entirely obvious.  I eventually worked up the following sequence:

  1. Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
  2. Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
  3. Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
  4. Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”

Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive.  This can certainly be used to obtain the same results, but it does not appear to save any effort.   My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter.   Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.

Posted in Performance, Performance Counters, Reference | Comments Off on Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Why I hate MPI (from a performance analysis perspective)

Posted by John D. McCalpin, Ph.D. on 1st August 2018

According to Dr. Bandwidth, performance analysis has two recurring themes:

  1. How fast should this code (or “simple” variations on this code) run on this hardware?
  2. If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?

For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack.  This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood.  This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.

MPI is a commonly used standard to implement message-passing programs on computer clusters.   The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.

A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.

Interacting components in the execution of an MPI job — a brief outline (from memory):

  1. The user source code, which contains an ordered set of calls to MPI routines.
    • There are many different types of MPI calls that can be used to perform the required communication.
    • For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
    • The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
  2. The user execution environment.
    • This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
    • The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
  3. The MPI runtime library.
    • An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
      • The source code to the library may not be available.
      • Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
    • An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
      • In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
    • MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
      • The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
      • Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
  4. The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
    • These vary in complexity and sophistication.
    • The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
    • Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
  5. The processor hardware available to support shared-memory transport.
    • Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
    • Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support.  Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
    • For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
      • Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
      • Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
      • Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
  6. The networking hardware.
    • Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
    • Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
    • The MPI standards were not designed to support highly efficient implementations on most networking hardware.
  7. Runtime contention
    • Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
    • Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
      • It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
      • Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
      • Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.

Understanding any one of these components is not usually too hard (unless the implementation is undocumented).

Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.

Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.

To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on Why I hate MPI (from a performance analysis perspective)

Comments on timing short code sections on Intel processors

Posted by John D. McCalpin, Ph.D. on 23rd July 2018

(From a recent post of mine on the Intel software developer forums — some potentially useful words to go along with my new low-overhead-timers project…)

Updates on 2019-01-23 in blue.

There are lots of topics that you need to be aware of when attempting fine-grain timing.  A few of the more important ones are:

  • The RDTSC instruction increments at the rate of the “base” (or “nominal”) processor frequency, while instructions are executed at the “core frequency”.  The “core frequency” may be higher or lower than the “base” frequency, and it may change during your measurement interval.
    • If you have the ability to “pin” the processor frequency to match the “base” frequency, interpreting the results is often easier.
    • Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results.  More on this below.
  • With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active.  When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
  • Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
    • I usually precede the code that I want to test with a “warm-up” loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
  • Always pin the thread you want to test to a single logical processor (if possible).
    • This allows you to use the RDPMC instruction to read the logical processor’s fixed-function performance counters.
    • It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

  • Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
    • The duration of these halts varies by product and in some cases by the amount of the frequency change.  I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100’s of cycles)

  • The RDTSC instruction is not ordered with respect to the execution of other instructions.  Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect — in either direction.
  • The RDTSCP instruction is partially ordered — it will not execute until all prior instructions in program order have executed.
    • RDTSCP can still be executed later than expected, but not earlier.
    • This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
  • The LFENCE (“Load Fence”) instruction was originally intended to order memory load operations, but was later extended (architecturally) to order instruction execution.
    • The LFENCE instruction will not execute until all prior instructions have “completed locally”, and no later instructions will begin execution (even speculatively) until the LFENCE instruction completes.
      • It may not be safe to assume that “completed locally” and “retired” are equivalent.
      • “Completed locally” is definitely not the same as “globally visible” — SFENCE is still required if you need to ensure that stores are globally visible before continuing.
    • LFENCE is a fairly lightweight instruction, though the cost depends on the nature of the surrounding instructions. 
      • The repeat rate for consecutive LFENCE instructions is 4 cycles for mainstream Intel processors starting with Sandy Bridge (through at least Skylake).
    • The combination “LFENCE; RDTSC” has a slightly stronger ordering than RDTSCP.
      • RDTSCP waits until prior instructions have completed, but does not prevent later instructions from beginning execution before the RDTSCP instruction executes.  
        • If you want a lower bound on the execution time between “start” and “stop” instructions, the minimum requirement would be to add “RDTSC; LFENCE” before the “start” instruction and to add “RDTSCP” after the “stop” instruction. 
  • The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
    • If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will “remember” which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
    • If you increase the inner loop trip count to 35 or more, the branch predictor will not “remember” which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
    • This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
    • This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

  • A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
  • I recommend measuring a minimum of four values:
    • Elapsed TSC cycles (using RDTSC or RDTSCP)
    • Instructions — using the RDPMC instruction with counter number (1<<30)+0
    • Core Cycles not halted — using the RDPMC instruction with counter number (1<<30)+1
    • Reference Cycles not halted — using the RDPMC instruction with counter number (1<<30)+2
  • If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
    • Instructions executed in kernel mode.
    • Core cycles not halted in kernel mode.
  • Compute these metrics:
    • Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
      • If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
    • Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
      • This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
    • Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
      • This will tell you how much of your expected frequency has been lost due to processor halts.
    • Instructions Retired / Instructions Expected
      • For simple loops, you can look at the assembly code and count instructions.
      • This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
      • This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
      • For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
    • Kernel instructions / Total instructions
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
    • Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Comments on timing short code sections on Intel processors