John McCalpin's blog

Dr. Bandwidth explains all….

“Memory directories” in Intel processors

Posted by John D. McCalpin, Ph.D. on August 28, 2023

One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”.   This is used in multi-socket systems to reduce cache coherence traffic between sockets.

I have referred to this in various presentations as:

“A Memory Directory is one or more bits per cache line in DRAM that tell the processor whether another socket might have a dirty copy of the cache line.”

When asked for references to public information about the Intel’s memory directory implementation, I had to go back and find the various tiny bits of information scattered around different places.   That was boring, so I am posting my notes here so I can find them again — maybe others will find this useful as well….


Existence:

The clearest admission that the memory directory feature exists is in this “technical overview” presentation – in the section “Directory-Based Coherency” – https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html. The language is not as precise as I would prefer, but there are actually quite a few interesting details in that section.

History of Implementations:

Intel has disclosed various details in some presentation at the Hot Chips series of conferences:

(I apologize in advance if the links are broken — it is hard to keep up with web site reorganizations — but the presentations should be relatively easy to find using web search services.)

Implicit Information:

Much of the detailed understanding of the memory directory implementation comes from studying the uncore performance counter events that make reference to memory directories.

As an example, there are direct references to memory directories in several sections of the Intel Uncore Performance Monitoring Guides for their various processor families.    These documents are linked from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The version that I have spent the most time with is for the 1st and 2nd generations of Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”):  “Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual”, Intel document 336274.

Specific references in this manual include:

  • CHA Events “DIR_LOOKUP” and “DIR_UPDATE”
  • IMC Event
    • IMC_WRITES_COUNT includes: “NOTE: Directory bits are stored in memory. Remote socket RFOs will result in a directory update which, in turn, will cause a write command.”
    • Aside: it would be easy to be misled by this description.  Remote socket RFOs will result in a directory update but so will ordinary reads that return data in E state (the default for data reads that hit unshared lines).
  • M2M Events “DIRECTORY_HIT”, “DIRECTORY_LOOKUP”, “DIRECTORY_MISS”, “DIRECTORY_UPDATE”
    • These include information on directory states – very helpful for understanding the implementation.
  • Section 3.1.3 “Reference for M2M Packet Matching”, Table 3-9 “SMI3 Opcodes”, mentions the directory in the description of 10 of the 18 transaction types.

It is helpful to compare and contrast the documentation from all of the recent processor generations.  The specific items disclosed in each generation are not the same, and sometimes a generation will have a more verbose explanation that fills in gaps for the other generations.

Reading the documentation is seldom enough to understand the implementation — I typically have to create customized microbenchmarks to generate known counts of transactions of various types and then compare the measured performance counter values with the values I expected to generate.   When there are significant differences more thinking is required.  Thinking does not always help — sometimes the performance counter event is just broken, sometimes there is just not enough information disclosed to derive reasonable bounds on the implementation, and sometimes the implementation has strongly dynamic/adaptive behavior based on activity metrics that are not visible or not controllable using available HW features.

OEM Vendor Disclosures:

I have run across a few other bits of information scattered around the interwebs:

Encoding the Memory Directory bits:

In various presentations I have said that the “one or more” memory directory bits were “hidden” in the ECC bits.  This is easy to do – standard 64-bit DRAM interfaces with ECC provide 72 bits of data with every “64-bit” data transfer and generate 8 contiguous transfers per transaction.   64 bits of data requires 8 bits of ECC for SECDED for every transfer.  Computing the ECC over P transfers only requires log2(P) additional bits, while the ECC gives you 8*(P-1) extra bits.  Intel used this aggregation approach very aggressively on the Knights Corner accelerator to store the SECDED ECC bits for 64 Byte cache lines in regular memory – reserving every 32ndcache line for that purpose.


Are there any other important references that I am missing?

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Tagged: , , | Comments Off on “Memory directories” in Intel processors

The evolution of single-core bandwidth in multicore processors

Posted by John D. McCalpin, Ph.D. on April 25, 2023

The primary metric for memory bandwidth in multicore processors is that maximum sustained performance when using many cores.  For most high-end processors these values have remained in the range of 75% to 85% of the peak DRAM bandwidth of the system over the past 15-20 years — an amazing accomplishment given the increase in core count (with its associated cache coherence issues), number of DRAM channels, and ever-increasing pipelining of the DRAMs themselves.

This post is about a secondary performance characteristic — sustained memory bandwidth for a single thread running on a single core.  This metric is interesting because we don’t always have the luxury of parallelizing every application we run, and our operating systems almost always process each call (e.g., buffer copies for filesystem access) with a single thread.  In my own experience, I have found that systems with higher single-core bandwidth feel “snappier” when used for interactive work — editing, compiling, debugging, etc.

With that in mind, I decided to mine some historical data (and run a few new experiments) to see how single-thread/single-core memory bandwidth has evolved over the last 10-15 years.  Some of the results were initially surprising, but were all consistent with the fundamental “physics” of bandwidth represented by Little’s Law (lots more on that below).

Looking at sustained single-core bandwidth for a kernel composed of 100% reads, the trends for a large set of high-end AMD and Intel processors are shown in the figure below:

So from 2010 to 2023, the sustainable single-core bandwidth increased by about 2x on Intel processors and about 5x on AMD processors.

Are these “good” improvements?  The table below may provide some perspective:

2023 vs 2010 speedup 1-core BW 1-core GFLOPS all-core BW all-core GFLOPS
Intel ~2x ~5x ~10x/DDR5 ~30x/HBM >40x
AMD ~5x ~5x ~20x ~30x

 

The single-core bandwidth on the Intel systems is clearly continuing to fall behind the single-core compute capability, and a single core is able to exploit a smaller and smaller fraction of the bandwidth available to a socket.  The single-core bandwidth on AMD systems is increasing at about the same rate as the single-core compute capability, but is also able to sustain a decreasing fraction of the bandwidth of the socket.

These observations naturally give rise to a variety of questions.  Some I will address today, and some in the next blog entry.

Questions and Answers:

  1. Why use a “read-only” memory access pattern?  Why not something like STREAM?
  2. Multicore processors have huge amounts of available DRAM bandwidth – maybe it does not even make sense for a single core to try to use that much?
    • Any recent Intel processor core (Skylake Xeon or newer) has a peak cache bandwidth of (at least) two 64-Byte reads plus one 64-Byte write per cycle.  At a single-core frequency of 3.0 GHz, this is a read BW of 384 GB/s – higher than the full socket bandwidth of 307.2 GBs with 8 channels of DDR5/4800 DRAM.  I don’t expect all of that, but the core can clearly make use of more than 20 GB/s.
  3. Why is the single-core bandwidth increasing so slowly?
    • To understand what is happening here, we need to understand the way memory bandwidth interacts with memory latency and the concurrency (parallelism) of memory accesses.
    • That is the topic of the next blog entry.  (“Real Soon Now”)
  4. Can this problem be fixed?
    • Sure! We don’t need to violate any physical laws to increase single-core bandwidth — we just need the design support the very high levels of memory parallelism we need, and provide us with a way of generating that parallelism from application codes.
    • The NEC Vector Engine processors provide a demonstration of very high single-core bandwidth.   On a VE20B (8 cores, 1.6 GHz, 1530 GB/s peak BW from 6 HBM stacks),  I see single-thread sustained memory bandwidth of 304 GB/s on the ReadOnly benchmark used here.

 

Stay tuned!

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Performance | Tagged: , , | Comments Off on The evolution of single-core bandwidth in multicore processors

Why don’t we talk about bisection bandwidth any more?

Posted by John D. McCalpin, Ph.D. on April 11, 2022


Is it unimportant just because we are not talking about it?

I was recently asked for comments about the value of increased bisection bandwidth in computer clusters for high performance computing. That got me thinking that the architecture of most internet news/comment infrastructures is built around “engagement” — effectively amplifying signals that are already strong and damping those that do not meet an interest threshold. While that approach certainly has a role, as a scientist it is also important for me to take note of topics that might be important, but which are currently generating little content. Bisection bandwidth of clustered computers seems to be one such topic….


Some notes relevant to the history of the discussion of bisection bandwidth in High Performance Computing

There has not been a lot of recent discussion of high-bisection-bandwidth computers, but this was a topic that came up frequently in the late 1990’s and early 2000’s. A good reference is the 2002 Report on High End Computing for the National Security Community, which I participated in as the representative of IBM (ref 1). The section starting on page 35 (“Systems, Architecture, Programmability, and Components Working Group”) discusses two approaches to “supercomputing” – one focusing on aggregating cost-effective peak performance (the “type T” (Transistor) systems), and the other focusing on providing the tightest integration and interconnect performance (“type C” (Communication) systems). A major influence on this distinction was Burton Smith, founder of Tera Computer Company and developer of the Tera MTA system. The Tera MTA was a unique architecture with no caches, with memory distributed/interleaved/hashed across the entire system, and with processors designed to tolerate the memory latency and to effectively synchronize on memory accesses (using “full/empty” metadata bits in the memory).

The 2002 report led fairly directly to the DARPA High Productivity Computing Systems (HPCS) project (2002-2010), which provided direct funding to several companies to develop hardware and software technologies to make supercomputers significantly easier to use. Phase 1 was just some seed money to write proposals, and included about 7 companies. Phase 2 was a much larger ($50M over 3 years to each recipient, if I recall correctly) set of grants for the companies to do significant high-level design work. Phase 3 grants ($200M-$250M over 5 years to each recipient) were awarded to Cray and IBM (ref 2). I was (briefly) the large-scale systems architecture team lead on the IBM Phase 2 project in 2004.

Both the Cray and IBM projects were characterized by a desire to improve effective bisection bandwidth, and both used hierarchical all-to-all interconnect topologies.

(If I recall correctly), the Cray project funded the development of the “Cascade” interconnect which eventually led to the Cray XC30 series of supercomputers (ref 3). Note that the Cray project funded only the interconnect development, while standard AMD and/or Intel processors were used for compute. The inability to influence the processor design limited what Cray was able to do with the interconnect. The IBM grant paid for the development of the “Torrent” bridge/switch chip for an alternate version of an IBM Power7-based server. Because IBM was developing both the processor and the interconnect chip, there was more opportunity to innovate.

I left IBM at the end of 2005 (near the end of Phase 2). IBM did eventually complete the implementation funded by DARPA, but backed out of the “Blue Waters” system at NCSA (ref 4) – an NSF-funded supercomputer that was very deliberately pitched as a “High Productivity Computing System” to benefit from the DARPA HPCS development projects. I have no inside information from the period that IBM backed out, but it is easy to suspect that IBM wanted/needed more money than was available in the budget to deliver the full-scale system. The “Blue Waters” system was replaced by a Cray – but since the DARPA-funded “Cascade” interconnect was not ready, Blue Waters used an older implementation (and older AMD processors). IBM sold only a handful of small Power7 systems with the “Torrent” interconnect, mostly to weather forecasting centers. As far as I can tell, none of the systems were large enough for the Torrent interconnect to show off its (potential) usefulness.

So after about 10 years of effort and a DARPA HPCS budget of over $600M, the high performance computing community got the interconnect for the Cray XC30. I think it was a reasonably good interconnect, but I never used it. TACC just retired our Cray XC40 and its interconnect performance was fine, but not revolutionary. After this unimpressive return on investment, it is perhaps not surprising that there has been a lack of interest in funding for high-bisection-bandwidth systems.

That is not to say that there is an unambiguous market for high-bisection-bandwidth systems! It is easy enough to identify application areas and user communities who will claim to “need” increased bandwidth, but historically they have failed to follow up with actual purchases of the few high-bandwidth systems that have been offered over the years. A common factor in the history of the HPC market has been that given enough time, most users who “demand” special characteristics will figure out how to work around the lack with more software optimization efforts, or with changes to their strategy for computing. A modest fraction will give up on scaling their computations to larger sizes and will just make do with the more modest ongoing improvements in single-node performance to advance their work.

Architectural Considerations

When considering either local memory bandwidth or global bandwidth, a key principle to keep in mind is concurrency. Almost all processor designs have very limited available concurrency per core, so lots of cores are needed just to generate enough cache misses to fill the local memory pipeline. As an example, TACC has recently deployed a few hundred nodes using the Intel Xeon Platinum 8380 processor (3rd generation Xeon Scalable Processor, aka “Ice Lake Xeon”). This has 40 cores and 8 channels of DDR4/3200 DRAM. The latency for memory access is about 80ns. So the latency-bandwidth product is

204.8 GB/s * 80 nanoseconds = 16384 bytes, or 256 cache lines

Each core can directly generate about 10-12 cache misses, and can indirectly generate a few more via the L2 hardware prefetchers – call it 16 concurrent cache line requests per core, so 16 of the (up to) 40 cores are required just to fill the pipeline to DRAM.

For IO-based interconnects, latencies are much higher, but bandwidths are lower. Current state of the art for a node-level interconnect is 200 Gbit, such as the InfiniBand HDR fabric. The high latency favors a “put” model of communication, with one-way RDMA put latency at about 800ns (unloaded) through a couple of switch hops. The latency-bandwidth product is (200/8)*800 = 20000 Bytes. This is only slightly higher than the case for local memory, but it depends on two factors: (1) The bandwidth is fairly low, and (2) the latency is only available for “put” operations – it must be doubled for “get” operations.

A couple of notes:

  • The local latency of 80 ns is dominated by a serialized check of the three levels of the cache (with the 3rd level hashed/interleaved around the chip in Intel processors), along with the need to traverse many asynchronous boundaries. (An old discussion on this blog Memory Latency Components 2011-03-10 is a good start, but newer systems have a 3rd level of cache and on on-chip 2D mesh interconnect adding to the latency components.) The effective local latency could be reduced significantly by using a “scratchpad” memory architecture and a more general block-oriented interface (to more effectively use the open page features of the DRAM), but it would very challenging to get below about 30 ns read latency.
  • The IO latency of ~800 ns for a put operation is dominated by the incredibly archaic architecture of IO systems. It is hard to get good data, but it typically requires 300-400 ns for a core’s uncached store operation to reach its target PCIe-connected device. This is absolutely not necessary from a physics perspective, but it cannot easily be fixed without a wholistic approach to re-architecting the system to support communication and synchronization as first-class features. One could certainly design processors that could put IO requests on external pins in 20ns or less – then the speed of light term becomes important in the latency equation (as it should be).

It has been a while since I reviewed the capabilities of optical systems, but the considerations above usually make extremely high bisection bandwidth effectively un-exploitable with current processor architectures.

Future technology developments (e.g, AttoJoule Optoelectronics for Low-Energy Information Processing and Communication) may dramatically reduce the cost, but a new architecture will be required to reduce the latency enough to make the bandwidth useful.

References:

  1. DOD High Performance Computing for the National Security Community Final Report_2003-04-09
  2. HPCWire 2006-11-24 DARPA selects Cray and IBM for final phase of HPCS-1
  3. HPCWire 2012-11-08 Cray launches Cascade: embraces Intel-based supercomputing
  4. Wikipedia: Blue Waters
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Computer Interconnects | Comments Off on Why don’t we talk about bisection bandwidth any more?

Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Posted by John D. McCalpin, Ph.D. on October 27, 2021

Defect rates and chip yields in the fabrication of complex semiconductor chips (like processors) are typically very tightly held secrets.  In the current era of multicore processors even the definition of “yield” requires careful thinking — companies have adapted their designs to tolerate defects in single processor cores, allowing them to sell “partial good” die at lower prices.  This has been good for everyone — the vendors get to sell more of the “stuff” that comes off the manufacturing line, and the customers have a wider range of products to choose from.

Knowing that many processor chips use “partial good” die does not usually help customers to infer anything about the yield of a multicore chip at various core counts, even when purchasing thousands of chips.  It is possible that the Xeon Phi x200 (“Knights Landing”, “KNL”) processors are in a different category — one that allows statistically interesting inferences to be drawn about core defect rates.

Why is the Xeon Phi x200 different?

  • It was developed for (and is of interest to) almost exclusively customers in the High Performance Computing (HPC) market.
  • The chip has 76 cores (in 38 pairs), and the only three product offerings had 64, 68, or 72 cores enabled.
    •  No place to sell chips with many defects.
  • The processor core is slower than most mainstream Xeon processors in both frequency and instruction level parallelism.
    • No place to sell chips that don’t meet the frequency requirements.

The Texas Advanced Computing Center currently runs 4200 compute servers using the Xeon Phi 7250 (68-core) processor.   The first 504 were installed in June 2016 and the remaining 3696 were installed in April 2017.  Unlike the mainstream Xeon processors, the Xeon Phi x200 enables any user to determine which physical cores on the die are disabled, simply by running the CPUID instruction on each active logical processor to obtain that core’s X2APIC ID (used by the interrupt controller).  There is a 1:1 correspondence between the X2APIC IDs and the physical core locations on the die, so any cores that are disabled will result in missing X2APIC values in the list.  More details on the X2APIC IDs are in the technical report “Observations on Core Numbering and “Core ID’s” in Intel Processors” and more details on the mapping of X2APIC IDs to locations on the die are in the technical report Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 (“Knights Landing”, “KNL”) Processors.

The lists of disabled cores were collected at various points over the last 4.5 years and at some point during the COVID-19 pandemic I decided to look at them.  The first result was completely expected — cores are always enabled/disabled in pairs.  This matches the way they are placed on the die: each of the 38 “tiles” has 2 cores, a 1MiB shared L2 cache, and a coherence agent making up a “tile”.   The second result was unexpected — although every tile had disabled cores in at least some processors, there were four tile positions where the cores were disabled 15x-20x more than average.   In “Figure 5” below, these “preferred” tiles were the ones immediately above and below the memory controllers IMC0 and IMC1 on the left and right sides of the chip — numbers 2, 8, 27, 37.

Numbering and locations of CHAs and memory controllers in Xeon Phi x200 processors.

After reviewing the patterns in more detail, it seemed that these four “preferred” locations could be considered “spares”.  The cores at the other 34 tiles would be enabled if they were functional, and if any of those tiles had a defect, a “spare” would be enabled to compensate.  If true, this would be a a very exciting result because it means that even though every one of the 4200 chips has exactly 4 tiles with disabled cores, the presence of disabled cores anywhere other than the “preferred” locations indicated a defect.  If there were no defects on the chip (or only defects in the spare tiles themselves), then the only four tiles with disabled cores would be 2, 8, 27, 37.  This was actually the case for about 1290 of the 4200 chips.

The number of chips with disabled cores at each of the 34 “standard” (non-preferred) locations varied rather widely, but looked random.    Was there any way to evaluate whether the results were consistent with a model of a small number of random defects, with those cores being replaced by activating cores in the spare tiles?  Yes, there is, and for the statistically minded you can read all about it in the technical report Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 (“Knights Landing”) Processors. The report contains all sorts of mind-numbing discussions of “truncated binomial distributions”, corrections for visibility of defects, and statistical significance tests for several different views of the data — but it does have brightly colored charts and graphs to attempt to offset those soporific effects.

For the less statistically minded, the short description is:

  • For the 504 processors deployed in June 2016, the average number of “defects” was 1.38 per chip.
  • For the 3696 processors deployed in April 2017, the average number of “defects” was 1.19 per chip.
  • The difference in these counts was very strongly statistically significant (3.7 standard deviations).
  • Although some of the observed values are slightly outside the ranges expected for a purely random process, the overall pattern is strongly consistent with a model of random, independent defects.

These are very good numbers — for the full cluster the average number of defects is projected to be 1.36 per chip (including an estimate of defects in the unused “spare” tiles).  With these defect rates, only about 1% of the chips would be expected to have more than 4 defects — and almost all of these would still suffice for the 64-core model.

So does this have anything to do with “yield”?  Probably not a whole lot — all of these chips require that all 8 Embedded DRAM Controllers (EDCs) are fully functional, all 38 Coherence Agents are fully functional, both DDR4 memory controllers are fully functional, and the IO blocks are fully functional.  There is no way to infer how many chips might be lost due to failures in any of those parts because there were no product offerings that allowed any of those blocks to be disabled.  But from the subset of chips that had all the “non-core” parts working, these results paint an encouraging picture with regard to defect rates for the cores.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware | Tagged: | Comments Off on Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on September 10, 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Computer Architecture | Tagged: , , , , | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

Die Locations of Cores and L3 Slices for Intel Xeon Processors

Posted by John D. McCalpin, Ph.D. on May 27, 2021

Intel provides nice schematic diagrams of the layouts of their processor chips, but provides no guidance on how the user-visible core numbers and L3 slice numbers map to the locations on the die.
Most of the time there is no “need” to know the locations of the units, but there are many performance analyses that do require it, and it is often extremely helpful to be able to visualize the flow of data (and commands and acknowledgements) on the two-dimensional mesh.

In 2018 I spent a fair amount of time developing methodologies to determine the locations of the user-visible core and CHA/SF/LLC numbers on the Xeon Phi 7250 and Xeon Platinum 8160 processors. It required a fair amount of time because some tricks that Intel used to make it easier to design the photolithographic masks had the side effect of modifying the directions of up/down/left/right in different parts of the chip! When combined with the unknown locations of the disabled cores and CHAs, this was quite perplexing….

The Xeon Scalable Processors (Skylake, Cascade Lake, and the new Ice Lake Xeon) “mirror” the photolithographic mask for the “Tile” (Core + CHA/SF/LLC) in alternating columns, causing the meanings of “left” and “right” in the mesh traffic performance counters to alternate as well. This is vaguely hinted by some of the block diagrams of the chip (such as XeonScalableFamilyTechnicalOverviewFigure5, but is more clear in the die photo:

Intel Skylake Xeon die photo

Here I have added light blue boxes around the 14 (of 28) Tile locations on the die that have the normal meanings of “left” and “right” in the mesh data traffic counters. The Tiles that don’t have blue boxes around them are left-right mirror images of the “normal” cores, and at these locations the mesh data traffic counters report mesh traffic with “left” and “right” reversed. NOTE that the 3rd Generation Intel Xeon Scalable Processors (Ice Lake Xeon) show the same column-by-column reversal as the Skylake Xeon, leading to the same behavior in the mesh data traffic counters.


TACC Frontera System

For the Xeon Platinum 8280 processors in the TACC Frontera system, all 28 of the Tiles are fully enabled, so there are no disabled units at unknown locations to cause the layout and numbering to differ from socket to socket. In each socket, the CHA/SF/LLC blocks are numbered top-to-bottom, left-to-right, skipping over the memory controllers:

The pattern of Logical Processor numbers will vary depending on whether the numbers alternate between sockets (even in 0, odd in 1) or are block-distributed (first half in 0, second half in 1). For the TACC Frontera system, all of the nodes are configured with logical processors alternating between sockets, so all even-numbered logical processors are in socket 0 and all odd-numbered logical processors are in socket 1. For this configuration, the locations of the Logical Processor numbers in socket 0 are:

In socket 1 the layout is the same, but with each Logical Processor number incremented by 1.

More details are in TR-2021-01b (link below in the references).

TACC Stampede2 System

“Skylake Xeon” partitions

For the Xeon Platinum 8160 processors in the TACC Stampede2 system, 24 of the Tiles are fully enabled and the remaining 4 Tiles have disabled Cores and disabled CHA/SF/LLCs. For these processors, administrative privileges are required to read the configuration registers that allow one to determine the locations of the CHA/SF/LLC units and the Logical Processors. There are approximately 120 different patterns of disabled tiles across the 3472 Xeon Platinum 8160 processors (1736 2-socket nodes) in the Stampede2 “SKX” partitions. The pattern of disabled cores generally has negligible impact on performance, but one needs to know the locations of the cores and CHA/SF/LLC blocks to make any sense of the traffic on the 2D mesh. Fortunately only one piece of information is needed on these systems — the CAPID6 register tells which CHA locations on the die are enabled, and these systems have a fixed mapping of Logical Processor numbers to co-located CHA numbers — so it would not be hard to make this information available to interested users (if any exist).

More details are in TR-2021-01b (link below in the references).

“Knights Landing” (“KNL”) partitions

For the 4200 Stampede2 nodes with Xeon Phi 7250 processors, all 38 CHA/SF units are active in each chip, and 34 of the 38 tiles have an active pair of cores. Since all 38 CHAs are active, their locations are the same from node to node:


For these processors the information required to determine the locations of the cores is available from user space (i.e., without special privileges). The easiest way to do this is to simply use the “/proc/cpuinfo” device to get the “core id” field for each “processor” field. Since each core supports four threads, each of the “core id” fields should appear four times. Each tile has two cores, so we take the even-numbered “core id” fields and divide them by two to get the tile number where each of the active cores is located. A specific example showing the Logical Processor number, the “core id”, and the corresponding “tile” location:

c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:28 $ grep ^processor /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.procs
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:55 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.coreids
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:22 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print int($NF/2)}' > tbl.tiles
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:32 $ paste tbl.procs tbl.coreids tbl.tiles 
0	0	0
1	1	0
2	2	1
3	3	1
4	4	2
5	5	2
6	6	3
7	7	3
8	8	4
9	9	4
10	10	5
11	11	5
12	12	6
13	13	6
14	14	7
15	15	7
16	16	8
17	17	8
18	18	9
19	19	9
20	22	11
21	23	11
22	24	12
23	25	12
24	26	13
25	27	13
26	28	14
27	29	14
28	30	15
29	31	15
30	32	16
31	33	16
32	34	17
33	35	17
34	36	18
35	37	18
36	38	19
37	39	19
38	40	20
39	41	20
40	42	21
41	43	21
42	44	22
43	45	22
44	46	23
45	47	23
46	48	24
47	49	24
48	50	25
49	51	25
50	56	28
51	57	28
52	58	29
53	59	29
54	60	30
55	61	30
56	62	31
57	63	31
58	64	32
59	65	32
60	66	33
61	67	33
62	68	34
63	69	34
64	70	35
65	71	35
66	72	36
67	73	36

For each Logical Processor (column 1), the tile number is in column 3, and the location of the tile is in the figure above.

Since the tile numbers are [0..37], from this list we see that 10, 26, 27, and 37 are missing, so these are the tiles with disabled cores.

More details are in TR-2020-01 and in TR-2021-02 (links below in the references).


Presentations:

Detailed References:


What is a CHA/SF/LLC ? This is a portion of each Tile containing a “Coherence and Home Agent” slice, a “Snoop Filter” slice, and a “Last Level Cache” slice. Each physical address in the system is mapped to exactly one of the CHA/SF/LLC blocks for cache coherence and last-level caching, so that (1) any core in the system will automatically make use of all of the LLC slices, and (2) each CHA/SF/LLC has to handle approximately equal amounts of work when all the cores are active.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Performance Counters | Comments Off on Die Locations of Cores and L3 Slices for Intel Xeon Processors

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on April 2, 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Slide01


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.


An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.


This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….


(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)


Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…



Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.


The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.


The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.


Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.


Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.


This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).


Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).


The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.



Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.


Drum roll, please….




Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.


These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.


Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?


On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.


Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.



These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)







Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Timing Methodology for MPI Programs

Posted by John D. McCalpin, Ph.D. on March 4, 2019

While working on the implementation of the MPI version of the STREAM benchmark, I realized that there were some subtleties in timing that could easily lead to inaccurate and/or misleading results.  This post is a transcription of my notes as I looked at the issues….

Primary requirement: I want a measure of wall clock time that is guaranteed to start before any rank does work and to end after all ranks have finished their work.

Secondary goal: I also want the start time to be as late as possible relative to the initiation of work by any rank, and for the end time to be as early as possible relative to the completion of the work by all ranks.

I am not particularly concerned about OS scheduling issues, so I can assume that timers will be very close to the execution time of the the preceding statement completion and the subsequent statement initiation.  Any deviations caused by stalls between timers, barriers, and work must be in the direction of increasing the reported time, not decreasing it.  (This is a corollary of my initial requirement.)
The discussion here will be based on a simple example, where the “t” variables are (local) wall clock times for MPI rank k and WORK() represents the parallel workload that I am testing.
Generically, I want:
      t_start(k) = time()
      WORK()
      t_end(k) = time()
but for an MPI job, the methodology needs to be provably correct for arbitrary (real) skew across nodes as well as for arbitrary offsets between the absolute time across nodes.  (I am deliberately ignoring the rare case in which a clock is modified on one or more nodes during a run — most time protocols try hard to avoid such shifts, and instead change the rate at which the clock is incremented to drive synchronization.)
After some thinking, I came up with this pseudo-code, which is executed independently by each MPI rank (indexed by “k”):
      t0(k) = time()
      MPI_barrier()
      t1(k) = time()

      WORK()

      t2(k) = time()
      MPI_barrier()
      t3(k) = time()
If the clocks are synchronized, then all I need is:
    tstart = min(t1(k)), k=1..numranks
    tstop  = max(t2(k)), k=1..numranks
If the clocks are not synchronized, then I need to make some use of the barriers — but exactly how?
In the pseudo-code above, the barriers ensure that the following two statements are true:
  • For the start time, t0(k) is guaranteed to be earlier than the initiation of any work.
  • For the end time, t3(k) is guaranteed to be later than the completion of any work.
These statements are true for each rank individually, so the tightest bound available from the collection of t0(k) and t3(k) values is:
      tstop - tstart > min(t3(k)-t0(k)),   k=1..numranks
This gives a (tstop – tstart) that is at least as large as the time required for the actual work plus the time required for the two MPI_barrier() operations.
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance, Reference | Tagged: | Comments Off on Timing Methodology for MPI Programs

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on February 18, 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Tagged: , | Comments Off on Intel’s future “CLDEMOTE” instruction

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on January 9, 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | 2 Comments »