John McCalpin's blog

Dr. Bandwidth explains all….

Why don’t we talk about bisection bandwidth any more?

Posted by John D. McCalpin, Ph.D. on April 11, 2022


Is it unimportant just because we are not talking about it?

I was recently asked for comments about the value of increased bisection bandwidth in computer clusters for high performance computing. That got me thinking that the architecture of most internet news/comment infrastructures is built around “engagement” — effectively amplifying signals that are already strong and damping those that do not meet an interest threshold. While that approach certainly has a role, as a scientist it is also important for me to take note of topics that might be important, but which are currently generating little content. Bisection bandwidth of clustered computers seems to be one such topic….


Some notes relevant to the history of the discussion of bisection bandwidth in High Performance Computing

There has not been a lot of recent discussion of high-bisection-bandwidth computers, but this was a topic that came up frequently in the late 1990’s and early 2000’s. A good reference is the 2002 Report on High End Computing for the National Security Community, which I participated in as the representative of IBM (ref 1). The section starting on page 35 (“Systems, Architecture, Programmability, and Components Working Group”) discusses two approaches to “supercomputing” – one focusing on aggregating cost-effective peak performance (the “type T” (Transistor) systems), and the other focusing on providing the tightest integration and interconnect performance (“type C” (Communication) systems). A major influence on this distinction was Burton Smith, founder of Tera Computer Company and developer of the Tera MTA system. The Tera MTA was a unique architecture with no caches, with memory distributed/interleaved/hashed across the entire system, and with processors designed to tolerate the memory latency and to effectively synchronize on memory accesses (using “full/empty” metadata bits in the memory).

The 2002 report led fairly directly to the DARPA High Productivity Computing Systems (HPCS) project (2002-2010), which provided direct funding to several companies to develop hardware and software technologies to make supercomputers significantly easier to use. Phase 1 was just some seed money to write proposals, and included about 7 companies. Phase 2 was a much larger ($50M over 3 years to each recipient, if I recall correctly) set of grants for the companies to do significant high-level design work. Phase 3 grants ($200M-$250M over 5 years to each recipient) were awarded to Cray and IBM (ref 2). I was (briefly) the large-scale systems architecture team lead on the IBM Phase 2 project in 2004.

Both the Cray and IBM projects were characterized by a desire to improve effective bisection bandwidth, and both used hierarchical all-to-all interconnect topologies.

(If I recall correctly), the Cray project funded the development of the “Cascade” interconnect which eventually led to the Cray XC30 series of supercomputers (ref 3). Note that the Cray project funded only the interconnect development, while standard AMD and/or Intel processors were used for compute. The inability to influence the processor design limited what Cray was able to do with the interconnect. The IBM grant paid for the development of the “Torrent” bridge/switch chip for an alternate version of an IBM Power7-based server. Because IBM was developing both the processor and the interconnect chip, there was more opportunity to innovate.

I left IBM at the end of 2005 (near the end of Phase 2). IBM did eventually complete the implementation funded by DARPA, but backed out of the “Blue Waters” system at NCSA (ref 4) – an NSF-funded supercomputer that was very deliberately pitched as a “High Productivity Computing System” to benefit from the DARPA HPCS development projects. I have no inside information from the period that IBM backed out, but it is easy to suspect that IBM wanted/needed more money than was available in the budget to deliver the full-scale system. The “Blue Waters” system was replaced by a Cray – but since the DARPA-funded “Cascade” interconnect was not ready, Blue Waters used an older implementation (and older AMD processors). IBM sold only a handful of small Power7 systems with the “Torrent” interconnect, mostly to weather forecasting centers. As far as I can tell, none of the systems were large enough for the Torrent interconnect to show off its (potential) usefulness.

So after about 10 years of effort and a DARPA HPCS budget of over $600M, the high performance computing community got the interconnect for the Cray XC30. I think it was a reasonably good interconnect, but I never used it. TACC just retired our Cray XC40 and its interconnect performance was fine, but not revolutionary. After this unimpressive return on investment, it is perhaps not surprising that there has been a lack of interest in funding for high-bisection-bandwidth systems.

That is not to say that there is an unambiguous market for high-bisection-bandwidth systems! It is easy enough to identify application areas and user communities who will claim to “need” increased bandwidth, but historically they have failed to follow up with actual purchases of the few high-bandwidth systems that have been offered over the years. A common factor in the history of the HPC market has been that given enough time, most users who “demand” special characteristics will figure out how to work around the lack with more software optimization efforts, or with changes to their strategy for computing. A modest fraction will give up on scaling their computations to larger sizes and will just make do with the more modest ongoing improvements in single-node performance to advance their work.

Architectural Considerations

When considering either local memory bandwidth or global bandwidth, a key principle to keep in mind is concurrency. Almost all processor designs have very limited available concurrency per core, so lots of cores are needed just to generate enough cache misses to fill the local memory pipeline. As an example, TACC has recently deployed a few hundred nodes using the Intel Xeon Platinum 8380 processor (3rd generation Xeon Scalable Processor, aka “Ice Lake Xeon”). This has 40 cores and 8 channels of DDR4/3200 DRAM. The latency for memory access is about 80ns. So the latency-bandwidth product is

204.8 GB/s * 80 nanoseconds = 16384 bytes, or 256 cache lines

Each core can directly generate about 10-12 cache misses, and can indirectly generate a few more via the L2 hardware prefetchers – call it 16 concurrent cache line requests per core, so 16 of the (up to) 40 cores are required just to fill the pipeline to DRAM.

For IO-based interconnects, latencies are much higher, but bandwidths are lower. Current state of the art for a node-level interconnect is 200 Gbit, such as the InfiniBand HDR fabric. The high latency favors a “put” model of communication, with one-way RDMA put latency at about 800ns (unloaded) through a couple of switch hops. The latency-bandwidth product is (200/8)*800 = 20000 Bytes. This is only slightly higher than the case for local memory, but it depends on two factors: (1) The bandwidth is fairly low, and (2) the latency is only available for “put” operations – it must be doubled for “get” operations.

A couple of notes:

  • The local latency of 80 ns is dominated by a serialized check of the three levels of the cache (with the 3rd level hashed/interleaved around the chip in Intel processors), along with the need to traverse many asynchronous boundaries. (An old discussion on this blog Memory Latency Components 2011-03-10 is a good start, but newer systems have a 3rd level of cache and on on-chip 2D mesh interconnect adding to the latency components.) The effective local latency could be reduced significantly by using a “scratchpad” memory architecture and a more general block-oriented interface (to more effectively use the open page features of the DRAM), but it would very challenging to get below about 30 ns read latency.
  • The IO latency of ~800 ns for a put operation is dominated by the incredibly archaic architecture of IO systems. It is hard to get good data, but it typically requires 300-400 ns for a core’s uncached store operation to reach its target PCIe-connected device. This is absolutely not necessary from a physics perspective, but it cannot easily be fixed without a wholistic approach to re-architecting the system to support communication and synchronization as first-class features. One could certainly design processors that could put IO requests on external pins in 20ns or less – then the speed of light term becomes important in the latency equation (as it should be).

It has been a while since I reviewed the capabilities of optical systems, but the considerations above usually make extremely high bisection bandwidth effectively un-exploitable with current processor architectures.

Future technology developments (e.g, AttoJoule Optoelectronics for Low-Energy Information Processing and Communication) may dramatically reduce the cost, but a new architecture will be required to reduce the latency enough to make the bandwidth useful.

References:

  1. DOD High Performance Computing for the National Security Community Final Report_2003-04-09
  2. HPCWire 2006-11-24 DARPA selects Cray and IBM for final phase of HPCS-1
  3. HPCWire 2012-11-08 Cray launches Cascade: embraces Intel-based supercomputing
  4. Wikipedia: Blue Waters
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Computer Interconnects | Comments Off on Why don’t we talk about bisection bandwidth any more?

Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Posted by John D. McCalpin, Ph.D. on October 27, 2021

Defect rates and chip yields in the fabrication of complex semiconductor chips (like processors) are typically very tightly held secrets.  In the current era of multicore processors even the definition of “yield” requires careful thinking — companies have adapted their designs to tolerate defects in single processor cores, allowing them to sell “partial good” die at lower prices.  This has been good for everyone — the vendors get to sell more of the “stuff” that comes off the manufacturing line, and the customers have a wider range of products to choose from.

Knowing that many processor chips use “partial good” die does not usually help customers to infer anything about the yield of a multicore chip at various core counts, even when purchasing thousands of chips.  It is possible that the Xeon Phi x200 (“Knights Landing”, “KNL”) processors are in a different category — one that allows statistically interesting inferences to be drawn about core defect rates.

Why is the Xeon Phi x200 different?

  • It was developed for (and is of interest to) almost exclusively customers in the High Performance Computing (HPC) market.
  • The chip has 76 cores (in 38 pairs), and the only three product offerings had 64, 68, or 72 cores enabled.
    •  No place to sell chips with many defects.
  • The processor core is slower than most mainstream Xeon processors in both frequency and instruction level parallelism.
    • No place to sell chips that don’t meet the frequency requirements.

The Texas Advanced Computing Center currently runs 4200 compute servers using the Xeon Phi 7250 (68-core) processor.   The first 504 were installed in June 2016 and the remaining 3696 were installed in April 2017.  Unlike the mainstream Xeon processors, the Xeon Phi x200 enables any user to determine which physical cores on the die are disabled, simply by running the CPUID instruction on each active logical processor to obtain that core’s X2APIC ID (used by the interrupt controller).  There is a 1:1 correspondence between the X2APIC IDs and the physical core locations on the die, so any cores that are disabled will result in missing X2APIC values in the list.  More details on the X2APIC IDs are in the technical report “Observations on Core Numbering and “Core ID’s” in Intel Processors” and more details on the mapping of X2APIC IDs to locations on the die are in the technical report Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 (“Knights Landing”, “KNL”) Processors.

The lists of disabled cores were collected at various points over the last 4.5 years and at some point during the COVID-19 pandemic I decided to look at them.  The first result was completely expected — cores are always enabled/disabled in pairs.  This matches the way they are placed on the die: each of the 38 “tiles” has 2 cores, a 1MiB shared L2 cache, and a coherence agent making up a “tile”.   The second result was unexpected — although every tile had disabled cores in at least some processors, there were four tile positions where the cores were disabled 15x-20x more than average.   In “Figure 5” below, these “preferred” tiles were the ones immediately above and below the memory controllers IMC0 and IMC1 on the left and right sides of the chip — numbers 2, 8, 27, 37.

Numbering and locations of CHAs and memory controllers in Xeon Phi x200 processors.

After reviewing the patterns in more detail, it seemed that these four “preferred” locations could be considered “spares”.  The cores at the other 34 tiles would be enabled if they were functional, and if any of those tiles had a defect, a “spare” would be enabled to compensate.  If true, this would be a a very exciting result because it means that even though every one of the 4200 chips has exactly 4 tiles with disabled cores, the presence of disabled cores anywhere other than the “preferred” locations indicated a defect.  If there were no defects on the chip (or only defects in the spare tiles themselves), then the only four tiles with disabled cores would be 2, 8, 27, 37.  This was actually the case for about 1290 of the 4200 chips.

The number of chips with disabled cores at each of the 34 “standard” (non-preferred) locations varied rather widely, but looked random.    Was there any way to evaluate whether the results were consistent with a model of a small number of random defects, with those cores being replaced by activating cores in the spare tiles?  Yes, there is, and for the statistically minded you can read all about it in the technical report Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 (“Knights Landing”) Processors. The report contains all sorts of mind-numbing discussions of “truncated binomial distributions”, corrections for visibility of defects, and statistical significance tests for several different views of the data — but it does have brightly colored charts and graphs to attempt to offset those soporific effects.

For the less statistically minded, the short description is:

  • For the 504 processors deployed in June 2016, the average number of “defects” was 1.38 per chip.
  • For the 3696 processors deployed in April 2017, the average number of “defects” was 1.19 per chip.
  • The difference in these counts was very strongly statistically significant (3.7 standard deviations).
  • Although some of the observed values are slightly outside the ranges expected for a purely random process, the overall pattern is strongly consistent with a model of random, independent defects.

These are very good numbers — for the full cluster the average number of defects is projected to be 1.36 per chip (including an estimate of defects in the unused “spare” tiles).  With these defect rates, only about 1% of the chips would be expected to have more than 4 defects — and almost all of these would still suffice for the 64-core model.

So does this have anything to do with “yield”?  Probably not a whole lot — all of these chips require that all 8 Embedded DRAM Controllers (EDCs) are fully functional, all 38 Coherence Agents are fully functional, both DDR4 memory controllers are fully functional, and the IO blocks are fully functional.  There is no way to infer how many chips might be lost due to failures in any of those parts because there were no product offerings that allowed any of those blocks to be disabled.  But from the subset of chips that had all the “non-core” parts working, these results paint an encouraging picture with regard to defect rates for the cores.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware | Tagged: | Comments Off on Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Mapping addresses to L3/CHA slices in Intel processors

Posted by John D. McCalpin, Ph.D. on September 10, 2021

Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.

As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:

  • binary permutation “0” of a sequence is just the original sequence
  • binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
    • Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
  • binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
    • Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
  • binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
  • binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
    • Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]

Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2.  When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times.  As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each.   This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.

Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.

Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.

Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors.  The set of systems includes:

  • Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
  • Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
  • Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices

The results for the Xeon Scalable Processors (all generations) are based on my own measurements.  The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.

The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595

Have fun!

Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….

 

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Computer Architecture | Tagged: , , , , | Comments Off on Mapping addresses to L3/CHA slices in Intel processors

Die Locations of Cores and L3 Slices for Intel Xeon Processors

Posted by John D. McCalpin, Ph.D. on May 27, 2021

Intel provides nice schematic diagrams of the layouts of their processor chips, but provides no guidance on how the user-visible core numbers and L3 slice numbers map to the locations on the die.
Most of the time there is no “need” to know the locations of the units, but there are many performance analyses that do require it, and it is often extremely helpful to be able to visualize the flow of data (and commands and acknowledgements) on the two-dimensional mesh.

In 2018 I spent a fair amount of time developing methodologies to determine the locations of the user-visible core and CHA/SF/LLC numbers on the Xeon Phi 7250 and Xeon Platinum 8160 processors. It required a fair amount of time because some tricks that Intel used to make it easier to design the photolithographic masks had the side effect of modifying the directions of up/down/left/right in different parts of the chip! When combined with the unknown locations of the disabled cores and CHAs, this was quite perplexing….

The Xeon Scalable Processors (Skylake, Cascade Lake, and the new Ice Lake Xeon) “mirror” the photolithographic mask for the “Tile” (Core + CHA/SF/LLC) in alternating columns, causing the meanings of “left” and “right” in the mesh traffic performance counters to alternate as well. This is vaguely hinted by some of the block diagrams of the chip (such as XeonScalableFamilyTechnicalOverviewFigure5, but is more clear in the die photo:

Intel Skylake Xeon die photo

Here I have added light blue boxes around the 14 (of 28) Tile locations on the die that have the normal meanings of “left” and “right” in the mesh data traffic counters. The Tiles that don’t have blue boxes around them are left-right mirror images of the “normal” cores, and at these locations the mesh data traffic counters report mesh traffic with “left” and “right” reversed. NOTE that the 3rd Generation Intel Xeon Scalable Processors (Ice Lake Xeon) show the same column-by-column reversal as the Skylake Xeon, leading to the same behavior in the mesh data traffic counters.


TACC Frontera System

For the Xeon Platinum 8280 processors in the TACC Frontera system, all 28 of the Tiles are fully enabled, so there are no disabled units at unknown locations to cause the layout and numbering to differ from socket to socket. In each socket, the CHA/SF/LLC blocks are numbered top-to-bottom, left-to-right, skipping over the memory controllers:

The pattern of Logical Processor numbers will vary depending on whether the numbers alternate between sockets (even in 0, odd in 1) or are block-distributed (first half in 0, second half in 1). For the TACC Frontera system, all of the nodes are configured with logical processors alternating between sockets, so all even-numbered logical processors are in socket 0 and all odd-numbered logical processors are in socket 1. For this configuration, the locations of the Logical Processor numbers in socket 0 are:

In socket 1 the layout is the same, but with each Logical Processor number incremented by 1.

More details are in TR-2021-01b (link below in the references).

TACC Stampede2 System

“Skylake Xeon” partitions

For the Xeon Platinum 8160 processors in the TACC Stampede2 system, 24 of the Tiles are fully enabled and the remaining 4 Tiles have disabled Cores and disabled CHA/SF/LLCs. For these processors, administrative privileges are required to read the configuration registers that allow one to determine the locations of the CHA/SF/LLC units and the Logical Processors. There are approximately 120 different patterns of disabled tiles across the 3472 Xeon Platinum 8160 processors (1736 2-socket nodes) in the Stampede2 “SKX” partitions. The pattern of disabled cores generally has negligible impact on performance, but one needs to know the locations of the cores and CHA/SF/LLC blocks to make any sense of the traffic on the 2D mesh. Fortunately only one piece of information is needed on these systems — the CAPID6 register tells which CHA locations on the die are enabled, and these systems have a fixed mapping of Logical Processor numbers to co-located CHA numbers — so it would not be hard to make this information available to interested users (if any exist).

More details are in TR-2021-01b (link below in the references).

“Knights Landing” (“KNL”) partitions

For the 4200 Stampede2 nodes with Xeon Phi 7250 processors, all 38 CHA/SF units are active in each chip, and 34 of the 38 tiles have an active pair of cores. Since all 38 CHAs are active, their locations are the same from node to node:


For these processors the information required to determine the locations of the cores is available from user space (i.e., without special privileges). The easiest way to do this is to simply use the “/proc/cpuinfo” device to get the “core id” field for each “processor” field. Since each core supports four threads, each of the “core id” fields should appear four times. Each tile has two cores, so we take the even-numbered “core id” fields and divide them by two to get the tile number where each of the active cores is located. A specific example showing the Logical Processor number, the “core id”, and the corresponding “tile” location:

c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:28 $ grep ^processor /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.procs
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:39:55 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print $NF}' > tbl.coreids
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:22 $ grep "^core id" /proc/cpuinfo  | head -68 | awk '{print int($NF/2)}' > tbl.tiles
c455-003.stampede2:~/Stampede2/Configurations:2021-05-27T12:40:32 $ paste tbl.procs tbl.coreids tbl.tiles 
0	0	0
1	1	0
2	2	1
3	3	1
4	4	2
5	5	2
6	6	3
7	7	3
8	8	4
9	9	4
10	10	5
11	11	5
12	12	6
13	13	6
14	14	7
15	15	7
16	16	8
17	17	8
18	18	9
19	19	9
20	22	11
21	23	11
22	24	12
23	25	12
24	26	13
25	27	13
26	28	14
27	29	14
28	30	15
29	31	15
30	32	16
31	33	16
32	34	17
33	35	17
34	36	18
35	37	18
36	38	19
37	39	19
38	40	20
39	41	20
40	42	21
41	43	21
42	44	22
43	45	22
44	46	23
45	47	23
46	48	24
47	49	24
48	50	25
49	51	25
50	56	28
51	57	28
52	58	29
53	59	29
54	60	30
55	61	30
56	62	31
57	63	31
58	64	32
59	65	32
60	66	33
61	67	33
62	68	34
63	69	34
64	70	35
65	71	35
66	72	36
67	73	36

For each Logical Processor (column 1), the tile number is in column 3, and the location of the tile is in the figure above.

Since the tile numbers are [0..37], from this list we see that 10, 26, 27, and 37 are missing, so these are the tiles with disabled cores.

More details are in TR-2020-01 and in TR-2021-02 (links below in the references).


Presentations:

Detailed References:


What is a CHA/SF/LLC ? This is a portion of each Tile containing a “Coherence and Home Agent” slice, a “Snoop Filter” slice, and a “Last Level Cache” slice. Each physical address in the system is mapped to exactly one of the CHA/SF/LLC blocks for cache coherence and last-level caching, so that (1) any core in the system will automatically make use of all of the LLC slices, and (2) each CHA/SF/LLC has to handle approximately equal amounts of work when all the cores are active.


Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Performance Counters | Comments Off on Die Locations of Cores and L3 Slices for Intel Xeon Processors

The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Posted by John D. McCalpin, Ph.D. on April 2, 2020

This was a keynote presentation at the “2nd International Workshop on Performance Modeling: Methods and Applications” (PMMA16), June 23, 2016, Frankfurt, Germany (in conjunction with ISC16).

The presentation discusses a family of simple performance models that I developed over the last 20 years — originally in support of processor and system design at SGI (1996-1999), IBM (1999-2005), and AMD (2006-2008), but more recently in support of system procurements at The Texas Advanced Computing Center (TACC) (2009-present).

More notes interspersed below….


Slide01


Most of TACC’s supercomputer systems are national resources, open to (unclassified) scientific research in all areas. We have over 5,000 direct users (logging into the systems and running jobs) and tens of thousands of indirect users (who access TACC resources via web portals). With a staff of slightly over 175 full-time employees (less than 1/2 in consulting roles), we must therefore focus on highly-leveraged performance analysis projects, rather than labor-intensive ones.


An earlier presentation on this topic (including extensions of the method to incorporate cost modeling) is from 2007: “System Performance Balance, System Cost Balance, Application Balance, & the SPEC CPU2000/CPU2006 Benchmarks” (invited presentation at the SPEC Benchmarking Joint US/Europe Colloquium, June 22, 2007, Dresden, Germany.


This data is from the 2007 presentation. All of the SPECfp_rate2000 results were downloaded from www.spec.org, the results were sorted by processor type, and “peak floating-point operations per cycle” was manually added for each processor type. This includes all architectures, all compilers, all operating systems, and all system configurations. It is not surprising that there is a lot of scatter, but the factor of four range in Peak MFLOPS at fixed SPECfp_rate2000/core and the factor of four range in SPECfp_rate2000/core at fixed Peak MFLOPS was higher than I expected….


(Also from the 2007 presentation.) To show that I can criticize my own work as well, here I show that sustained memory bandwidth (using an approximation to the STREAM Benchmark) is also inadequate as a single figure of metric. (It is better than peak MFLOPS, but still has roughly a factor of three range when projecting in either direction.)


Here I assumed a particular analytical function for the amount of memory traffic as a function of cache size to scale the bandwidth time.
Details are not particularly important since I am trying to model something that is a geometric mean of 14 individual values and the results are across many architectures and compilers.
Doing separate models for the 14 benchmarks does not reduce the variance much further – there is about 15% that remains unexplainable in such a broad dataset.

The model can provide much better fit to the data if the HW and SW are restricted, as we will see in the next section…



Why no overlap? The model actually includes some kinds of overlap — this will be discussed in the context of specific models below — an can be extended to include overlap between components. The specific models and results that will be presented here fit the data better when it is assumed that there is no overlap between components. Bounds on overlap are discussed near the end of the presentation, in the slides titled “Analysis”.


The approach is opportunistic. When I started this work over 20 years ago, most of the parameters I was varying could only be changed in the vendor’s laboratory. Over time, the mechanisms introduced for reducing energy consumption (first in laptops) became available more broadly. In most current machines, memory frequency can be configured by the user at boot time, while CPU frequency can be varied on a live system.


The assumption that “memory accesses overlap with other memory accesses about as well as they do in the STREAM Benchmark” is based on trying lots of other formulations and getting poor consistency with observations.
Note that “compute” is not a simple metric like “instructions” or “floating-point operations”. T_cpu is best understood as the time that the core requires to execute the particular workload of interest in the absence of memory references.


Only talking about CPU2006 results today – the CPU2000 results look similar (see the 2007 presentation linked above), but the CPU2000 benchmark codes are less closely related to real applications.


Building separate models for each of the benchmarks was required to get the correct asymptotic properties. The geometric mean used to combine the individual benchmark results into a single metric is the right way to combine relative performance improvements with equal weighting for each code, but it is inconsistent with the underlying “physics” of computer performance for each of the individual benchmark results.


This system would be considered a “high-memory-bandwidth” system at the time these results were collected. In other words, this system would be expected to be CPU-limited more often than other systems (when running the same workload), because it would be memory-bandwidth limited less often. This system also had significantly lower memory latency than many contemporary systems (which were still using front-side bus architectures and separate “NorthBridge” chips).


Many of these applications (e.g., NAMD, Gamess, Gromacs, DealII, WRF, and MILC) are major consumers of cycles on TACC supercomputers (albeit newer versions and different datasets).


The published SPEC benchmarks are no longer useful to support this sensitivity-based modeling approach for two main reasons:

  1. Running N independent copies of a benchmark simultaneously on N cores has a lot of similarities with running a parallelized implementation of the benchmark when N is small (2 or 4, or maybe a little higher), but the performance characteristics diverge as N gets larger (certainly dubious by the time on reaches even half the core count of today’s high-end processors).
  2. For marketing reasons, the published results since 2007 have been limited almost exclusively to the configurations that give the very best results. This includes always running with HyperThreading enabled (and running one copy of the executable on each “logical processor”), always running with automatic parallelization enabled (making it very difficult to compare “speed” and “rate” results, since it is not clear how many cores are used in the “speed” tests), always running with the optimum memory configuration, etc.

The “CONUS 12km” benchmark is a simulation of the weather over the “CONtinental US” at 12km horizontal resolution. Although it is a relatively small benchmark, the performance characteristics have been verified to be quite similar to the “on-node” performance characteristics of higher-resolution test cases (e.g., “CONUS 2.5km”) — especially when the higher-resolution cases are parallelized across multiple nodes.



Note that the execution time varies from about 120 seconds to 210 seconds — this range is large compared to the deviations between the model and the observed execution time.

Note also that the slope of the Model 1 fit is almost 6% off of the desired value of 1.0, while the second model is within 1%.

  • In the 2007 SPECfp_rate tests, a similar phenomenon was seen, and required the addition of a third component to the model: memory latency.
  • In these tests, we did not have the same ability to vary memory latency that I had with the 2007 Opteron systems. In these “real-application” tests, IO is not negligible (while it is required to be <1% of execution time for the SPEC benchmarks), and allowing for a small invariant IO time gave much better results.

Bottom bars are model CPU time – easy to see the quantization.
Middle bars are model Memory time.
Top bars are (constant) IO time.


Drum roll, please….




Ordered differently, but the same sort of picture.
Here the quantization of memory time is visible across the four groups of varying CPU frequencies.


These NAMD results are not at all surprising — NAMD has extremely high cache re-use and therefore very low rates of main memory access — but it was important to see if this testing methodology replicated this expected result.


Big change of direction here….
At the beginning I said that I was assuming that there would be no overlap across the execution times associated with the various work components.
The extremely close agreement between the model results and observations strongly supports the effectiveness of this assumption.
On the other hand, overlap is certainly possible, so what can this methodology provide for us in the way of bounds on the degree of overlap?


On some HW it is possible (but very rare) to get timings (slightly) outside these bounds – I ignore such cases here.
Note that maximum ratio of upper bound over lower bound is equal to “N” – the degrees of freedom of the model! This is an uncomfortably large range of uncertainty – making it even more important to understand bounds on overlap.


Physically this is saying that there can’t be so much work in any of the components that processing that work would exceed the total time observed.
But the goal is to learn something about the ratios of the work components, so we need to go further.



These numbers come from plugging in synthetic performance numbers from a model with variable overlap into the bounds analysis.
Message 1: If you want tight formal bounds on overlap, you need to be able to vary the “rate” parameters over a large range — probably too large to be practical.
Message 2: If one of the estimated time components is small and you cannot vary the corresponding rate parameter over a large enough range, it may be impossible to tell whether the work component is “fuller overlapped” or is associated with a negligible amount of work (e.g., the lower bound in the “2:1 R_mem” case in this figure). (See next slide.)







Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | Comments Off on The Surprising Effectiveness of Non-Overlapping, Sensitivity-Based Performance Models

Timing Methodology for MPI Programs

Posted by John D. McCalpin, Ph.D. on March 4, 2019

While working on the implementation of the MPI version of the STREAM benchmark, I realized that there were some subtleties in timing that could easily lead to inaccurate and/or misleading results.  This post is a transcription of my notes as I looked at the issues….

Primary requirement: I want a measure of wall clock time that is guaranteed to start before any rank does work and to end after all ranks have finished their work.

Secondary goal: I also want the start time to be as late as possible relative to the initiation of work by any rank, and for the end time to be as early as possible relative to the completion of the work by all ranks.

I am not particularly concerned about OS scheduling issues, so I can assume that timers will be very close to the execution time of the the preceding statement completion and the subsequent statement initiation.  Any deviations caused by stalls between timers, barriers, and work must be in the direction of increasing the reported time, not decreasing it.  (This is a corollary of my initial requirement.)
The discussion here will be based on a simple example, where the “t” variables are (local) wall clock times for MPI rank k and WORK() represents the parallel workload that I am testing.
Generically, I want:
      t_start(k) = time()
      WORK()
      t_end(k) = time()
but for an MPI job, the methodology needs to be provably correct for arbitrary (real) skew across nodes as well as for arbitrary offsets between the absolute time across nodes.  (I am deliberately ignoring the rare case in which a clock is modified on one or more nodes during a run — most time protocols try hard to avoid such shifts, and instead change the rate at which the clock is incremented to drive synchronization.)
After some thinking, I came up with this pseudo-code, which is executed independently by each MPI rank (indexed by “k”):
      t0(k) = time()
      MPI_barrier()
      t1(k) = time()

      WORK()

      t2(k) = time()
      MPI_barrier()
      t3(k) = time()
If the clocks are synchronized, then all I need is:
    tstart = min(t1(k)), k=1..numranks
    tstop  = max(t2(k)), k=1..numranks
If the clocks are not synchronized, then I need to make some use of the barriers — but exactly how?
In the pseudo-code above, the barriers ensure that the following two statements are true:
  • For the start time, t0(k) is guaranteed to be earlier than the initiation of any work.
  • For the end time, t3(k) is guaranteed to be later than the completion of any work.
These statements are true for each rank individually, so the tightest bound available from the collection of t0(k) and t3(k) values is:
      tstop - tstart > min(t3(k)-t0(k)),   k=1..numranks
This gives a (tstop – tstart) that is at least as large as the time required for the actual work plus the time required for the two MPI_barrier() operations.
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance, Reference | Tagged: | Comments Off on Timing Methodology for MPI Programs

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on February 18, 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Tagged: , | Comments Off on Intel’s future “CLDEMOTE” instruction

New Year’s Updates

Posted by John D. McCalpin, Ph.D. on January 9, 2019

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Performance | Tagged: , , | 2 Comments »

SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Posted by John D. McCalpin, Ph.D. on January 7, 2019

Here are the annotated slides from my SC18 presentation on Snoop Filter Conflicts that cause performance variability in HPL and DGEMM on the Xeon Platinum 8160 processor.

This slide presentation includes data (not included in the paper) showing that Snoop Filter Conflicts occur in all Intel Scalable Processors (a.k.a., “Skylake Xeon”) with 18 cores or more, and also occurs on the Xeon Phi x200 processors (“Knights Landing”).

The published paper is available (with ACM subscription) at https://dl.acm.org/citation.cfm?id=3291680


This is less boring than it sounds!


A more exciting version of the title.


This story is very abridged — please read the paper!



Execution times only — no performance counters yet.

500 nodes tested, but only 392 nodes had the 7 runs needed for a good computation of the median performance.

Dozens of different nodes showed slowdowns of greater than 5%.


I measured memory bandwidth first simply because I had the tools to do this easily.
Read memory controller performance counters before and after each execution and compute DRAM traffic.
Write traffic was almost constant — only the read traffic showed significant variability.


It is important to decouple the sockets for several reasons.  (1) Each socket manages its frequency independently to remain within the Running Average Power Limit. (2) Cache coherence is managed differently within and between sockets.
The performance counter infrastructure is at https://github.com/jdmccalpin/periodic-performance-counters
Over 25,000 DGEMM runs in total, generating over 240 GiB of performance counter output.


I already saw that slow runs were associated with higher DRAM traffic, but needed to find out which level(s) of the cache were experience extra load misses.
The strongest correlation between execution time and cache miss counts was with L2 misses (measured here as L2 cache fills).

The variation of L2 fills for the full-speed runs is surprisingly large, but the slow runs all have L2 fill counts that are at least 1.5x the minimum value.
Some runs tolerate increased L2 fill counts up to 2x the minimum value, but all cases with >2x L2 fills are slow.

This chart looks at the sum of L2 fills for all the cores on the chip — next I will look at whether these misses are uniform across the cores.


I picked 15-20 cases in which a “good” trial (at or above median performance) was followed immediately by a “slow” trial (at least 20% below median performance).
This shows the L2 Fills by core for a “good” trial — the red dashed line corresponds to the minimum L2 fill count from the previous chart divided by 24 to get the minimum per-core value.
Different sets of cores and different numbers of cores had high counts in each run — even on the same node.


This adds the “slow” execution that immediately followed the “good” execution.
For the slow runs, most of cores had highly elevated L2 counts.  Again, different sets of cores and different numbers of cores had high counts in each run.

This data provides a critical clue:  Since L2 caches are private and 2MiB caches fully control the L2 cache index bits, the extra L2 cache misses must be caused by something *external* to the cores.


The Snoop Filter is essentially the same as the directory tags of the inclusive L3 cache of previous Intel Xeon processors, but without room to store the data for the cache lines tracked.
The key concept is “inclusivity” — lines tracked by a Snoop Filter entry must be invalidated before that Snoop Filter entry can be freed to track a different cache line address.


I initially found some poorly documented core counters that looked like they were related to Snoop Filter evictions, then later found counters in the “uncore” that count Snoop Filter evictions directly.
This allowed direct confirmation of my hypothesis, as summarized in the next slides.


About 1% of the runs are more than 10% slower than the fastest run.


Snoop Filter Evictions clearly account for the majority of the excess L2 fills.

But there is one more feature of the “slow” runs….


For all of the “slow” runs, the DRAM traffic is increased.  This means that a fraction of the data evicted from the L2 caches by the Snoop Filter evictions was also evicted from the L3 cache, and so must be retrieved from DRAM.

At high Snoop Filter conflict rates (>4e10 Snoop Filter evictions), all of the cases have elevated DRAM traffic, with something like 10%-15% of the Snoop Filter evictions missing in the L3 cache.

There are some cases in the range of 100-110 seconds that have elevated snoop filter evictions, but not elevated DRAM reads, that show minimal slowdowns.

This suggests that DGEMM can tolerate the extra latency of L2 miss/L3 hit for its data, but not the extra latency of L2 miss/L3 miss/DRAM hit.


Based on my experience in processor design groups at SGI, IBM, and AMD, I wondered if using contiguous physical addresses might avoid these snoop filter conflicts….


Baseline with 2MiB pages.


With 1GiB pages, the tail almost completely disappears in both width and depth.


Zooming in on the slowest 10% of the runs shows no evidence of systematic slowdowns when using 1GiB pages.
The performance counter data confirms that the snoop filter eviction rate is very small.

So we have a fix for single-socket DGEMM, what about HPL?


Intel provided a test version of their optimized HPL benchmark in December 2017 that supported 1GiB pages.

First, I verified that the performance variability for single-node (2-socket) HPL runs was eliminated by using 1GiB pages.

The variation across nodes is strong (due to different thermal characteristics), but the variation across runs on each node is extremely small.

The 8.6% range of average performance for this set of 31 nodes increases to >12% when considering the full 1736-node SKX partition of the Stampede2 system.

So we have a fix for single-node HPL, what about the full system?


Intel provided a full version of their optimized HPL benchmark in March 2018 and we ran it on the full system in April 2018.

The estimated breakdown of performance improvement into individual contributions is a ballpark estimate — it would be a huge project to measure the details at this scale.

The “practical peak performance” of this system is 8.77 PFLOPS on the KNLs plus 3.73 PFLOPS on the SKX nodes, for 12.5 PFLOPS “practical peak”.  The 10.68 PFLOPS obtained is about 85% of this peak performance.


During the review of the paper, I was able to simplify the test further to allow quick testing on other systems (and larger ensembles).

This is mostly new material (not in the paper).


https://github.com/jdmccalpin/SKX-SF-Conflicts

This lets me more directly address my hypothesis about conflicts with contiguous physical addresses, since each 1GiB page is much larger than the 24 MiB of aggregate L2 cache.


It turns out I was wrong — Snoop Filter Conflicts can occur with contiguous physical addresses on this processor.

The pattern repeats every 256 MiB.

If the re-used space is in the 1st 32 MiB of any 1GiB page, there will be no Snoop Filter Conflicts.

What about other processors?


I tested Skylake Xeon processors with 14, 16, 18, 20, 22, 24, 26, 28 cores, and a 68-core KNL (Xeon Phi 7250).

These four processors are the only ones that show Snoop Filter Conflicts with contiguous physical addresses.

But with random 2MiB pages, all processors with more than 16 cores show Snoop Filter conflicts for some combinations of addresses…..


These are average L2 miss rates — individual cores can have significantly higher miss rates (and the maximum miss rate may be the controlling factor in performance for multi-threaded codes).

The details are interesting, but no time in the current presentation….



Overall, the uncertainty associated with this performance variability is probably more important than the performance loss.

Using performance counter measurements to look for codes that are subject to this issue is a serious “needle in haystack” problem — it is probably easier to choose codes that might have the properties above and test them explicitly.


Cache-contained shallow water model, cache-contained FFTs.


The new DGEMM implementation uses dynamic scheduling of the block updates to decouple the memory access patterns.  There is no guarantee that this will alleviate the Snoop Filter Conflict problem, but in this case it does.


I now have a model that predicts all Snoop Filter Conflicts involving 2MiB pages on the 24-core SKX processors.
Unfortunately, the zonesort approach won’t work under Linux because pages are allocated on a “first-come, first-served” basis, so the fine control required is not possible.

An OS with support for page coloring (such as BSD) could me modified to provide this mitigation.


Again, the inability of Linux to use the virtual address as a criterion for selecting the physical page to use will prevent any sorting-based approach from working.


Intel has prepared a response.  If you are interested, you should ask your Intel representative for a copy.





 

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture, Linux, Performance Counters | Tagged: , | Comments Off on SC18 paper: HPL and DGEMM performance variability on Intel Xeon Platinum 8160 processors

Using hardware performance counters to determine how often both logical processors are active on an Intel CPU

Posted by John D. McCalpin, Ph.D. on September 17, 2018

Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).

Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:

  1. Neither Logical Processor active
  2. Logical Processor 0 Active, Logical Processor 1 Inactive
  3. Logical Processor 0 Inactive, Logical Processor 1 Active
  4. Both Logical Processors Active

It was not immediately obvious how to obtain this split from the available performance counters.

Every recent Intel processor has:

  • An invariant, non-stop Time Stamp Counter (TSC)
  • Three “fixed-function” performance counters per logical processor
    1. Fixed-Function Counter 0: Instructions retired (not used here)
    2. Fixed-Function Counter 1: Actual Cycles Not Halted
    3. Fixed-Function Counter 2: Reference Cycles Not Halted
  • Two or more (typically 4) programmable performance counters per logical processor
    • A few of the “events” are common across all processors, but most are model-specific.

The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories.   There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both  Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval.    On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.

Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core.  This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….

The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.

  • The TSC increments at a fixed rate.
    • For most Intel processors this rate is the same as the “nominal” processor frequency.
      • Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
    • It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
  • The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
    • Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
    • For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
    • For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
      • This clock was also used in the definition of the processor frequencies.
      • For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
    • For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
      • For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
  • The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
    • Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
      • No additional scaling needed.
    • Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
      • Results must be scaled by the base CPU ratio.
    • Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
      • Note that the name still includes “XCLK”, but the definition has changed!
      • Results must be scaled by 4 times the base CPU ratio.

Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.

For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:

  • TSC (on either logical processor)
  • Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
  • Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)

So each Logical Processor makes two measurements, but they are asymmetric.

From these results, the algebra required to split the counts into the desired categories is not entirely obvious.  I eventually worked up the following sequence:

  1. Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
  2. Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
  3. Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
  4. Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”

Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive.  This can certainly be used to obtain the same results, but it does not appear to save any effort.   My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter.   Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Performance, Performance Counters, Reference | Comments Off on Using hardware performance counters to determine how often both logical processors are active on an Intel CPU