John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Architecture' Category

“Memory directories” in Intel processors

Posted by John D. McCalpin, Ph.D. on 28th August 2023

One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”.   This is used in multi-socket systems to reduce cache coherence traffic between sockets.

I have referred to this in various presentations as:

“A Memory Directory is one or more bits per cache line in DRAM that tell the processor whether another socket might have a dirty copy of the cache line.”

When asked for references to public information about the Intel’s memory directory implementation, I had to go back and find the various tiny bits of information scattered around different places.   That was boring, so I am posting my notes here so I can find them again — maybe others will find this useful as well….


Existence:

The clearest admission that the memory directory feature exists is in this “technical overview” presentation – in the section “Directory-Based Coherency” – https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html. The language is not as precise as I would prefer, but there are actually quite a few interesting details in that section.

History of Implementations:

Intel has disclosed various details in some presentation at the Hot Chips series of conferences:

(I apologize in advance if the links are broken — it is hard to keep up with web site reorganizations — but the presentations should be relatively easy to find using web search services.)

Implicit Information:

Much of the detailed understanding of the memory directory implementation comes from studying the uncore performance counter events that make reference to memory directories.

As an example, there are direct references to memory directories in several sections of the Intel Uncore Performance Monitoring Guides for their various processor families.    These documents are linked from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The version that I have spent the most time with is for the 1st and 2nd generations of Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”):  “Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual”, Intel document 336274.

Specific references in this manual include:

  • CHA Events “DIR_LOOKUP” and “DIR_UPDATE”
  • IMC Event
    • IMC_WRITES_COUNT includes: “NOTE: Directory bits are stored in memory. Remote socket RFOs will result in a directory update which, in turn, will cause a write command.”
    • Aside: it would be easy to be misled by this description.  Remote socket RFOs will result in a directory update but so will ordinary reads that return data in E state (the default for data reads that hit unshared lines).
  • M2M Events “DIRECTORY_HIT”, “DIRECTORY_LOOKUP”, “DIRECTORY_MISS”, “DIRECTORY_UPDATE”
    • These include information on directory states – very helpful for understanding the implementation.
  • Section 3.1.3 “Reference for M2M Packet Matching”, Table 3-9 “SMI3 Opcodes”, mentions the directory in the description of 10 of the 18 transaction types.

It is helpful to compare and contrast the documentation from all of the recent processor generations.  The specific items disclosed in each generation are not the same, and sometimes a generation will have a more verbose explanation that fills in gaps for the other generations.

Reading the documentation is seldom enough to understand the implementation — I typically have to create customized microbenchmarks to generate known counts of transactions of various types and then compare the measured performance counter values with the values I expected to generate.   When there are significant differences more thinking is required.  Thinking does not always help — sometimes the performance counter event is just broken, sometimes there is just not enough information disclosed to derive reasonable bounds on the implementation, and sometimes the implementation has strongly dynamic/adaptive behavior based on activity metrics that are not visible or not controllable using available HW features.

OEM Vendor Disclosures:

I have run across a few other bits of information scattered around the interwebs:

Encoding the Memory Directory bits:

In various presentations I have said that the “one or more” memory directory bits were “hidden” in the ECC bits.  This is easy to do – standard 64-bit DRAM interfaces with ECC provide 72 bits of data with every “64-bit” data transfer and generate 8 contiguous transfers per transaction.   64 bits of data requires 8 bits of ECC for SECDED for every transfer.  Computing the ECC over P transfers only requires log2(P) additional bits, while the ECC gives you 8*(P-1) extra bits.  Intel used this aggregation approach very aggressively on the Knights Corner accelerator to store the SECDED ECC bits for 64 Byte cache lines in regular memory – reserving every 32ndcache line for that purpose.


Are there any other important references that I am missing?

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on “Memory directories” in Intel processors

The evolution of single-core bandwidth in multicore processors

Posted by John D. McCalpin, Ph.D. on 25th April 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores.  For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core.  This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread.  In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.

With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years.  Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).

Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:

So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.

Are these “good” improvements?  The table below may provide some perspective:

2023 vs 2010 speedup 1-core BW 1-core GFLOPS all-core BW all-core GFLOPS
Intel ~2x ~5x ~10x/DDR5 ~30x/HBM >40x
AMD ~5x ~5x ~20x ~30x

 

The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket.  The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.

These observations naturally give rise to a variety of questions.  Some I will address today, and some in the next blog entry.

Questions and Answers:

  1. Why use a “read-only” memory access pattern?  Why not something like STREAM?
  2. Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
    • Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle.  At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM.  I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
  3. Why is the single-core bandwidth increasing so slowly?
    • To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
    • That is the topic of the next blog entry.  (“Real Soon Now”)
  4. Can this problem be fixed?
    • Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
    • The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth.   On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks),  I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.

 

Stay tuned!

Posted in Computer Architecture, Computer Hardware, Performance | Comments Off on The evolution of single-core bandwidth in multicore processors

Why don’t we talk about bisection bandwidth any more?

Posted by John D. McCalpin, Ph.D. on 11th April 2022


Is it unimportant just because we are not talking about it?

I was recently asked for comments about the value of increased bisection bandwidth in computer clusters for high performance computing. That got me thinking that the architecture of most internet news/comment infrastructures is built around “engagement” — effectively amplifying signals that are already strong and damping those that do not meet an interest threshold. While that approach certainly has a role, as a scientist it is also important for me to take note of topics that might be important, but which are currently generating little content. Bisection bandwidth of clustered computers seems to be one such topic….


Some notes relevant to the history of the discussion of bisection bandwidth in High Performance Computing

There has not been a lot of recent discussion of high-bisection-bandwidth computers, but this was a topic that came up frequently in the late 1990’s and early 2000’s. A good reference is the 2002 Report on High End Computing for the National Security Community, which I participated in as the representative of IBM (ref 1). The section starting on page 35 (“Systems, Architecture, Programmability, and Components Working Group”) discusses two approaches to “supercomputing” – one focusing on aggregating cost-effective peak performance (the “type T” (Transistor) systems), and the other focusing on providing the tightest integration and interconnect performance (“type C” (Communication) systems). A major influence on this distinction was Burton Smith, founder of Tera Computer Company and developer of the Tera MTA system. The Tera MTA was a unique architecture with no caches, with memory distributed/interleaved/hashed across the entire system, and with processors designed to tolerate the memory latency and to effectively synchronize on memory accesses (using “full/empty” metadata bits in the memory).

The 2002 report led fairly directly to the DARPA High Productivity Computing Systems (HPCS) project (2002-2010), which provided direct funding to several companies to develop hardware and software technologies to make supercomputers significantly easier to use. Phase 1 was just some seed money to write proposals, and included about 7 companies. Phase 2 was a much larger ($50M over 3 years to each recipient, if I recall correctly) set of grants for the companies to do significant high-level design work. Phase 3 grants ($200M-$250M over 5 years to each recipient) were awarded to Cray and IBM (ref 2). I was (briefly) the large-scale systems architecture team lead on the IBM Phase 2 project in 2004.

Both the Cray and IBM projects were characterized by a desire to improve effective bisection bandwidth, and both used hierarchical all-to-all interconnect topologies.

(If I recall correctly), the Cray project funded the development of the “Cascade” interconnect which eventually led to the Cray XC30 series of supercomputers (ref 3). Note that the Cray project funded only the interconnect development, while standard AMD and/or Intel processors were used for compute. The inability to influence the processor design limited what Cray was able to do with the interconnect. The IBM grant paid for the development of the “Torrent” bridge/switch chip for an alternate version of an IBM Power7-based server. Because IBM was developing both the processor and the interconnect chip, there was more opportunity to innovate.

I left IBM at the end of 2005 (near the end of Phase 2). IBM did eventually complete the implementation funded by DARPA, but backed out of the “Blue Waters” system at NCSA (ref 4) – an NSF-funded supercomputer that was very deliberately pitched as a “High Productivity Computing System” to benefit from the DARPA HPCS development projects. I have no inside information from the period that IBM backed out, but it is easy to suspect that IBM wanted/needed more money than was available in the budget to deliver the full-scale system. The “Blue Waters” system was replaced by a Cray – but since the DARPA-funded “Cascade” interconnect was not ready, Blue Waters used an older implementation (and older AMD processors). IBM sold only a handful of small Power7 systems with the “Torrent” interconnect, mostly to weather forecasting centers. As far as I can tell, none of the systems were large enough for the Torrent interconnect to show off its (potential) usefulness.

So after about 10 years of effort and a DARPA HPCS budget of over $600M, the high performance computing community got the interconnect for the Cray XC30. I think it was a reasonably good interconnect, but I never used it. TACC just retired our Cray XC40 and its interconnect performance was fine, but not revolutionary. After this unimpressive return on investment, it is perhaps not surprising that there has been a lack of interest in funding for high-bisection-bandwidth systems.

That is not to say that there is an unambiguous market for high-bisection-bandwidth systems! It is easy enough to identify application areas and user communities who will claim to “need” increased bandwidth, but historically they have failed to follow up with actual purchases of the few high-bandwidth systems that have been offered over the years. A common factor in the history of the HPC market has been that given enough time, most users who “demand” special characteristics will figure out how to work around the lack with more software optimization efforts, or with changes to their strategy for computing. A modest fraction will give up on scaling their computations to larger sizes and will just make do with the more modest ongoing improvements in single-node performance to advance their work.

Architectural Considerations

When considering either local memory bandwidth or global bandwidth, a key principle to keep in mind is concurrency. Almost all processor designs have very limited available concurrency per core, so lots of cores are needed just to generate enough cache misses to fill the local memory pipeline. As an example, TACC has recently deployed a few hundred nodes using the Intel Xeon Platinum 8380 processor (3rd generation Xeon Scalable Processor, aka “Ice Lake Xeon”). This has 40 cores and 8 channels of DDR4/3200 DRAM. The latency for memory access is about 80ns. So the latency-bandwidth product is

204.8 GB/s * 80 nanoseconds = 16384 bytes, or 256 cache lines

Each core can directly generate about 10-12 cache misses, and can indirectly generate a few more via the L2 hardware prefetchers – call it 16 concurrent cache line requests per core, so 16 of the (up to) 40 cores are required just to fill the pipeline to DRAM.

For IO-based interconnects, latencies are much higher, but bandwidths are lower. Current state of the art for a node-level interconnect is 200 Gbit, such as the InfiniBand HDR fabric. The high latency favors a “put” model of communication, with one-way RDMA put latency at about 800ns (unloaded) through a couple of switch hops. The latency-bandwidth product is (200/8)*800 = 20000 Bytes. This is only slightly higher than the case for local memory, but it depends on two factors: (1) The bandwidth is fairly low, and (2) the latency is only available for “put” operations – it must be doubled for “get” operations.

A couple of notes:

  • The local latency of 80 ns is dominated by a serialized check of the three levels of the cache (with the 3rd level hashed/interleaved around the chip in Intel processors), along with the need to traverse many asynchronous boundaries. (An old discussion on this blog Memory Latency Components 2011-03-10 is a good start, but newer systems have a 3rd level of cache and on on-chip 2D mesh interconnect adding to the latency components.) The effective local latency could be reduced significantly by using a “scratchpad” memory architecture and a more general block-oriented interface (to more effectively use the open page features of the DRAM), but it would very challenging to get below about 30 ns read latency.
  • The IO latency of ~800 ns for a put operation is dominated by the incredibly archaic architecture of IO systems. It is hard to get good data, but it typically requires 300-400 ns for a core’s uncached store operation to reach its target PCIe-connected device. This is absolutely not necessary from a physics perspective, but it cannot easily be fixed without a wholistic approach to re-architecting the system to support communication and synchronization as first-class features. One could certainly design processors that could put IO requests on external pins in 20ns or less – then the speed of light term becomes important in the latency equation (as it should be).

It has been a while since I reviewed the capabilities of optical systems, but the considerations above usually make extremely high bisection bandwidth effectively un-exploitable with current processor architectures.

Future technology developments (e.g, AttoJoule Optoelectronics for Low-Energy Information Processing and Communication) may dramatically reduce the cost, but a new architecture will be required to reduce the latency enough to make the bandwidth useful.

References:

  1. DOD High Performance Computing for the National Security Community Final Report_2003-04-09
  2. HPCWire 2006-11-24 DARPA selects Cray and IBM for final phase of HPCS-1
  3. HPCWire 2012-11-08 Cray launches Cascade: embraces Intel-based supercomputing
  4. Wikipedia: Blue Waters

Posted in Computer Architecture, Computer Hardware, Computer Interconnects | Comments Off on Why don’t we talk about bisection bandwidth any more?

Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Posted by John D. McCalpin, Ph.D. on 27th October 2021

Defect rates and chip yields in the fabrication of complex semiconductor chips (like processors) are typically very tightly held secrets.  In the current era of multicore processors even the definition of “yield” requires careful thinking — companies have adapted their designs to tolerate defects in single processor cores, allowing them to sell “partial good” die at lower prices.  This has been good for everyone — the vendors get to sell more of the “stuff” that comes off the manufacturing line, and the customers have a wider range of products to choose from.

Knowing that many processor chips use “partial good” die does not usually help customers to infer anything about the yield of a multicore chip at various core counts, even when purchasing thousands of chips.  It is possible that the Xeon Phi x200 (“Knights Landing”, “KNL”) processors are in a different category — one that allows statistically interesting inferences to be drawn about core defect rates.

Why is the Xeon Phi x200 different?

  • It was developed for (and is of interest to) almost exclusively customers in the High Performance Computing (HPC) market.
  • The chip has 76 cores (in 38 pairs), and the only three product offerings had 64, 68, or 72 cores enabled.
    •  No place to sell chips with many defects.
  • The processor core is slower than most mainstream Xeon processors in both frequency and instruction level parallelism.
    • No place to sell chips that don’t meet the frequency requirements.

The Texas Advanced Computing Center currently runs 4200 compute servers using the Xeon Phi 7250 (68-core) processor.   The first 504 were installed in June 2016 and the remaining 3696 were installed in April 2017.  Unlike the mainstream Xeon processors, the Xeon Phi x200 enables any user to determine which physical cores on the die are disabled, simply by running the CPUID instruction on each active logical processor to obtain that core’s X2APIC ID (used by the interrupt controller).  There is a 1:1 correspondence between the X2APIC IDs and the physical core locations on the die, so any cores that are disabled will result in missing X2APIC values in the list.  More details on the X2APIC IDs are in the technical report “Observations on Core Numbering and “Core ID’s” in Intel Processors” and more details on the mapping of X2APIC IDs to locations on the die are in the technical report Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 (“Knights Landing”, “KNL”) Processors.

The lists of disabled cores were collected at various points over the last 4.5 years and at some point during the COVID-19 pandemic I decided to look at them.  The first result was completely expected — cores are always enabled/disabled in pairs.  This matches the way they are placed on the die: each of the 38 “tiles” has 2 cores, a 1MiB shared L2 cache, and a coherence agent making up a “tile”.   The second result was unexpected — although every tile had disabled cores in at least some processors, there were four tile positions where the cores were disabled 15x-20x more than average.   In “Figure 5” below, these “preferred” tiles were the ones immediately above and below the memory controllers IMC0 and IMC1 on the left and right sides of the chip — numbers 2, 8, 27, 37.

Numbering and locations of CHAs and memory controllers in Xeon Phi x200 processors.

After reviewing the patterns in more detail, it seemed that these four “preferred” locations could be considered “spares”.  The cores at the other 34 tiles would be enabled if they were functional, and if any of those tiles had a defect, a “spare” would be enabled to compensate.  If true, this would be a a very exciting result because it means that even though every one of the 4200 chips has exactly 4 tiles with disabled cores, the presence of disabled cores anywhere other than the “preferred” locations indicated a defect.  If there were no defects on the chip (or only defects in the spare tiles themselves), then the only four tiles with disabled cores would be 2, 8, 27, 37.  This was actually the case for about 1290 of the 4200 chips.

The number of chips with disabled cores at each of the 34 “standard” (non-preferred) locations varied rather widely, but looked random.    Was there any way to evaluate whether the results were consistent with a model of a small number of random defects, with those cores being replaced by activating cores in the spare tiles?  Yes, there is, and for the statistically minded you can read all about it in the technical report Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 (“Knights Landing”) Processors. The report contains all sorts of mind-numbing discussions of “truncated binomial distributions”, corrections for visibility of defects, and statistical significance tests for several different views of the data — but it does have brightly colored charts and graphs to attempt to offset those soporific effects.

For the less statistically minded, the short description is:

  • For the 504 processors deployed in June 2016, the average number of “defects” was 1.38 per chip.
  • For the 3696 processors deployed in April 2017, the average number of “defects” was 1.19 per chip.
  • The difference in these counts was very strongly statistically significant (3.7 standard deviations).
  • Although some of the observed values are slightly outside the ranges expected for a purely random process, the overall pattern is strongly consistent with a model of random, independent defects.

These are very good numbers — for the full cluster the average number of defects is projected to be 1.36 per chip (including an estimate of defects in the unused “spare” tiles).  With these defect rates, only about 1% of the chips would be expected to have more than 4 defects — and almost all of these would still suffice for the 64-core model.

So does this have anything to do with “yield”?  Probably not a whole lot — all of these chips require that all 8 Embedded DRAM Controllers (EDCs) are fully functional, all 38 Coherence Agents are fully functional, both DDR4 memory controllers are fully functional, and the IO blocks are fully functional.  There is no way to infer how many chips might be lost due to failures in any of those parts because there were no product offerings that allowed any of those blocks to be disabled.  But from the subset of chips that had all the “non-core” parts working, these results paint an encouraging picture with regard to defect rates for the cores.

Posted in Computer Architecture, Computer Hardware | Comments Off on Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on 10th September 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Posted in Cache Coherence Implementations, Computer Architecture | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

Die Locations of Cores and L3 Slices for Intel Xeon Processors

Posted by John D. McCalpin, Ph.D. on 27th May 2021

Intel provides nice schematic diagrams of the layouts of their processor chips, but provides no guidance on how the user-visible core numbers and L3 slice numbers map to the locations on the die.
Most of the time there is no “need” to know the locations of the units, but there are many performance analyses that do require it, and it is often extremely helpful to be able to visualize the flow of data (and commands and acknowledgements) on the two-dimensional mesh.

In 2018 I spent a fair amount of time developing methodologies to determine the locations of the user-visible core and CHA/SF/LLC numbers on the Xeon Phi 7250 and Xeon Platinum 8160 processors. It required a fair amount of time because some tricks that Intel used to make it easier to design the photolithographic masks had the side effect of modifying the directions of up/down/left/right in different parts of the chip! When combined with the unknown locations of the disabled cores and CHAs, this was quite perplexing….

The Xeon Scalable Processors (Skylake, Cascade Lake, and the new Ice Lake Xeon) “mirror” the photolithographic mask for the “Tile” (Core + CHA/SF/LLC) in alternating columns, causing the meanings of “left” and “right” in the mesh traffic performance counters to alternate as well. This is vaguely hinted by some of the block diagrams of the chip (such as XeonScalableFamilyTechnicalOverviewFigure5, but is more clear in the die photo:

Intel Skylake Xeon die photo

Here I have added light blue boxes around the 14 (of 28) Tile locations on the die that have the normal meanings of “left” and “right” in the mesh data traffic counters. The Tiles that don’t have blue boxes around them are left-right mirror images of the “normal” cores, and at these locations the mesh data traffic counters report mesh traffic with “left” and “right” reversed. NOTE that the 3rd Generation Intel Xeon Scalable Processors (Ice Lake Xeon) show the same column-by-column reversal as the Skylake Xeon, leading to the same behavior in the mesh data traffic counters.


TACC Frontera System

For the Xeon Platinum 8280 processors in the TACC Frontera system, all 28 of the Tiles are fully enabled, so there are no disabled units at unknown locations to cause the layout and numbering to differ from socket to socket. In each socket, the CHA/SF/LLC blocks are numbered top-to-bottom, left-to-right, skipping over the memory controllers:

The pattern of Logical Processor numbers will vary depending on whether the numbers alternate between sockets (even in 0, odd in 1) or are block-distributed (first half in 0, second half in 1). For the TACC Frontera system, all of the nodes are configured with logical processors alternating between sockets, so all even-numbered logical processors are in socket 0 and all odd-numbered logical processors are in socket 1. For this configuration, the locations of the Logical Processor numbers in socket 0 are:

In socket 1 the layout is the same, but with each Logical Processor number incremented by 1.

More details are in TR-2021-01b (link below in the references).

TACC Stampede2 System

“Skylake Xeon” partitions

For the Xeon Platinum 8160 processors in the TACC Stampede2 system, 24 of the Tiles are fully enabled and the remaining 4 Tiles have disabled Cores and disabled CHA/SF/LLCs. For these processors, administrative privileges are required to read the configuration registers that allow one to determine the locations of the CHA/SF/LLC units and the Logical Processors. There are approximately 120 different patterns of disabled tiles across the 3472 Xeon Platinum 8160 processors (1736 2-socket nodes) in the Stampede2 “SKX” partitions. The pattern of disabled cores generally has negligible impact on performance, but one needs to know the locations of the cores and CHA/SF/LLC blocks to make any sense of the traffic on the 2D mesh. Fortunately only one piece of information is needed on these systems — the CAPID6 register tells which CHA locations on the die are enabled, and these systems have a fixed mapping of Logical Processor numbers to co-located CHA numbers — so it would not be hard to make this information available to interested users (if any exist).

More details are in TR-2021-01b (link below in the references).

“Knights Landing” (“KNL”) partitions

For the 4200 Stampede2 nodes with Xeon Phi 7250 processors, all 38 CHA/SF units are active in each chip, and 34 of the 38 tiles have an active pair of cores. Since all 38 CHAs are active, their locations are the same from node to node:


For these processors the information required to determine the locations of the cores is available from user space (i.e., without special privileges). The easiest way to do this is to simply use the “/proc/cpuinfo” device to get the “core id” field for each “processor” field. Since each core supports four threads, each of the “core id” fields should appear four times. Each tile has two cores, so we take the even-numbered “core id” fields and divide them by two to get the tile number where each of the active cores is located. A specific example showing the Logical Processor number, the “core id”, and the corresponding “tile” location:

c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:28 $ grep ^processor /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.procs
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:55 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.coreids
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:22 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print int($NF/2)}' > tbl.tiles
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:32 $ paste tbl.procs tbl.coreids tbl.tiles 
0	0	0
1	1	0
2	2	1
3	3	1
4	4	2
5	5	2
6	6	3
7	7	3
8	8	4
9	9	4
10	10	5
11	11	5
12	12	6
13	13	6
14	14	7
15	15	7
16	16	8
17	17	8
18	18	9
19	19	9
20	22	11
21	23	11
22	24	12
23	25	12
24	26	13
25	27	13
26	28	14
27	29	14
28	30	15
29	31	15
30	32	16
31	33	16
32	34	17
33	35	17
34	36	18
35	37	18
36	38	19
37	39	19
38	40	20
39	41	20
40	42	21
41	43	21
42	44	22
43	45	22
44	46	23
45	47	23
46	48	24
47	49	24
48	50	25
49	51	25
50	56	28
51	57	28
52	58	29
53	59	29
54	60	30
55	61	30
56	62	31
57	63	31
58	64	32
59	65	32
60	66	33
61	67	33
62	68	34
63	69	34
64	70	35
65	71	35
66	72	36
67	73	36

For each Logical Processor (column 1), the tile number is in column 3, and the location of the tile is in the figure above.

Since the tile numbers are [0..37], from this list we see that 10, 26, 27, and 37 are missing, so these are the tiles with disabled cores.

More details are in TR-2020-01 and in TR-2021-02 (links below in the references).


Presentations:

Detailed References:


What is a CHA/SF/LLC ? This is a portion of each Tile containing a “Coherence and Home Agent” slice, a “Snoop Filter” slice, and a “Last Level Cache” slice. Each physical address in the system is mapped to exactly one of the CHA/SF/LLC blocks for cache coherence and last-level caching, so that (1) any core in the system will automatically make use of all of the LLC slices, and (2) each CHA/SF/LLC has to handle approximately equal amounts of work when all the cores are active.


Posted in Computer Architecture, Computer Hardware, Performance Counters | Comments Off on Die Locations of Cores and L3 Slices for Intel Xeon Processors

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on 2nd April 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Slide01


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.


An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.


This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….


(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)


Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…



Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.


The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.


The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.


Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.


Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.


This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).


Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).


The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.



Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.


Drum roll, please….




Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.


These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.


Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?


On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.


Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.



These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)







Posted in Computer Architecture, Performance | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on 18th February 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Intel’s future “CLDEMOTE” instruction

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on 9th January 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Posted in Computer Architecture, Performance | 2 Comments »

SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Posted by John D. McCalpin, Ph.D. on 7th January 2019

Here are the annotated slides from my SC18 presentation on Snoop Filter Conflicts that cause performance variability in HPL and DGEMM on the Xeon Platinum 8160 processor.

This slide presentation includes data (not included in the paper) showing that Snoop Filter Conflicts occur in all Intel Scalable Processors (a.k.a., “Skylake Xeon”) with 18 cores or more, and also occurs on the Xeon Phi x200 processors (“Knights Landing”).

The published paper is available (with ACM subscription) at https://dl.acm.org/citation.cfm?id=3291680


This is less boring than it sounds!


A more exciting version of the title.


This story is very abridged — please read the paper!



Execution times only — no performance counters yet.

500 nodes tested, but only 392 nodes had the 7 runs needed for a good computation of the median performance.

Dozens of different nodes showed slowdowns of greater than 5%.


I measured memory bandwidth first simply because I had the tools to do this easily.
Read memory controller performance counters before and after each execution and compute DRAM traffic.
Write traffic was almost constant — only the read traffic showed significant variability.


It is important to decouple the sockets for several reasons.  (1) Each socket manages its frequency independently to remain within the Running Average Power Limit. (2) Cache coherence is managed differently within and between sockets.
The performance counter infrastructure is at https://github.com/jdmccalpin/periodic-performance-counters
Over 25,000 DGEMM runs in total, generating over 240 GiB of performance counter output.


I already saw that slow runs were associated with higher DRAM traffic, but needed to find out which level(s) of the cache were experience extra load misses.
The strongest correlation between execution time and cache miss counts was with L2 misses (measured here as L2 cache fills).

The variation of L2 fills for the full-speed runs is surprisingly large, but the slow runs all have L2 fill counts that are at least 1.5x the minimum value.
Some runs tolerate increased L2 fill counts up to 2x the minimum value, but all cases with >2x L2 fills are slow.

This chart looks at the sum of L2 fills for all the cores on the chip — next I will look at whether these misses are uniform across the cores.


I picked 15-20 cases in which a “good” trial (at or above median performance) was followed immediately by a “slow” trial (at least 20% below median performance).
This shows the L2 Fills by core for a “good” trial — the red dashed line corresponds to the minimum L2 fill count from the previous chart divided by 24 to get the minimum per-core value.
Different sets of cores and different numbers of cores had high counts in each run — even on the same node.


This adds the “slow” execution that immediately followed the “good” execution.
For the slow runs, most of cores had highly elevated L2 counts.  Again, different sets of cores and different numbers of cores had high counts in each run.

This data provides a critical clue:  Since L2 caches are private and 2MiB caches fully control the L2 cache index bits, the extra L2 cache misses must be caused by something *external* to the cores.


The Snoop Filter is essentially the same as the directory tags of the inclusive L3 cache of previous Intel Xeon processors, but without room to store the data for the cache lines tracked.
The key concept is “inclusivity” — lines tracked by a Snoop Filter entry must be invalidated before that Snoop Filter entry can be freed to track a different cache line address.


I initially found some poorly documented core counters that looked like they were related to Snoop Filter evictions, then later found counters in the “uncore” that count Snoop Filter evictions directly.
This allowed direct confirmation of my hypothesis, as summarized in the next slides.


About 1% of the runs are more than 10% slower than the fastest run.


Snoop Filter Evictions clearly account for the majority of the excess L2 fills.

But there is one more feature of the “slow” runs….


For all of the “slow” runs, the DRAM traffic is increased.  This means that a fraction of the data evicted from the L2 caches by the Snoop Filter evictions was also evicted from the L3 cache, and so must be retrieved from DRAM.

At high Snoop Filter conflict rates (>4e10 Snoop Filter evictions), all of the cases have elevated DRAM traffic, with something like 10%-15% of the Snoop Filter evictions missing in the L3 cache.

There are some cases in the range of 100-110 seconds that have elevated snoop filter evictions, but not elevated DRAM reads, that show minimal slowdowns.

This suggests that DGEMM can tolerate the extra latency of L2 miss/L3 hit for its data, but not the extra latency of L2 miss/L3 miss/DRAM hit.


Based on my experience in processor design groups at SGI, IBM, and AMD, I wondered if using contiguous physical addresses might avoid these snoop filter conflicts….


Baseline with 2MiB pages.


With 1GiB pages, the tail almost completely disappears in both width and depth.


Zooming in on the slowest 10% of the runs shows no evidence of systematic slowdowns when using 1GiB pages.
The performance counter data confirms that the snoop filter eviction rate is very small.

So we have a fix for single-socket DGEMM, what about HPL?


Intel provided a test version of their optimized HPL benchmark in December 2017 that supported 1GiB pages.

First, I verified that the performance variability for single-node (2-socket) HPL runs was eliminated by using 1GiB pages.

The variation across nodes is strong (due to different thermal characteristics), but the variation across runs on each node is extremely small.

The 8.6% range of average performance for this set of 31 nodes increases to >12% when considering the full 1736-node SKX partition of the Stampede2 system.

So we have a fix for single-node HPL, what about the full system?


Intel provided a full version of their optimized HPL benchmark in March 2018 and we ran it on the full system in April 2018.

The estimated breakdown of performance improvement into individual contributions is a ballpark estimate — it would be a huge project to measure the details at this scale.

The “practical peak performance” of this system is 8.77 PFLOPS on the KNLs plus 3.73 PFLOPS on the SKX nodes, for 12.5 PFLOPS “practical peak”.  The 10.68 PFLOPS obtained is about 85% of this peak performance.


During the review of the paper, I was able to simplify the test further to allow quick testing on other systems (and larger ensembles).

This is mostly new material (not in the paper).


https://github.com/jdmccalpin/SKX-SF-Conflicts

This lets me more directly address my hypothesis about conflicts with contiguous physical addresses, since each 1GiB page is much larger than the 24 MiB of aggregate L2 cache.


It turns out I was wrong — Snoop Filter Conflicts can occur with contiguous physical addresses on this processor.

The pattern repeats every 256 MiB.

If the re-used space is in the 1st 32 MiB of any 1GiB page, there will be no Snoop Filter Conflicts.

What about other processors?


I tested Skylake Xeon processors with 14, 16, 18, 20, 22, 24, 26, 28 cores, and a 68-core KNL (Xeon Phi 7250).

These four processors are the only ones that show Snoop Filter Conflicts with contiguous physical addresses.

But with random 2MiB pages, all processors with more than 16 cores show Snoop Filter conflicts for some combinations of addresses…..


These are average L2 miss rates — individual cores can have significantly higher miss rates (and the maximum miss rate may be the controlling factor in performance for multi-threaded codes).

The details are interesting, but no time in the current presentation….



Overall, the uncertainty associated with this performance variability is probably more important than the performance loss.

Using performance counter measurements to look for codes that are subject to this issue is a serious “needle in haystack” problem — it is probably easier to choose codes that might have the properties above and test them explicitly.


Cache-contained shallow water model, cache-contained FFTs.


The new DGEMM implementation uses dynamic scheduling of the block updates to decouple the memory access patterns.  There is no guarantee that this will alleviate the Snoop Filter Conflict problem, but in this case it does.


I now have a model that predicts all Snoop Filter Conflicts involving 2MiB pages on the 24-core SKX processors.
Unfortunately, the zonesort approach won’t work under Linux because pages are allocated on a “first-come, first-served” basis, so the fine control required is not possible.

An OS with support for page coloring (such as BSD) could me modified to provide this mitigation.


Again, the inability of Linux to use the virtual address as a criterion for selecting the physical page to use will prevent any sorting-based approach from working.


Intel has prepared a response.  If you are interested, you should ask your Intel representative for a copy.





 

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Linux, Performance Counters | Comments Off on SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors