Most Intel microprocessors support “HyperThreading” (Intel’s trademark for their implementation of “simultaneous multithreading”) — which allows the hardware to support (typically) two “Logical Processors” for each physical core. Processes running on the two Logical Processors share most of the processor resources (particularly caches and execution units). Some workloads (particularly heterogeneous ones) benefit from assigning processes to all logical processors, while other workloads (particularly homogeneous workloads, or cache-capacity-sensitive workloads) provide the best performance when running only one process on each physical core (i.e., leaving half of the Logical Processors idle).
Last year I was trying to diagnose a mild slowdown in a code, and wanted to be able to use the hardware performance counters to divide processor activity into four categories:
- Neither Logical Processor active
- Logical Processor 0 Active, Logical Processor 1 Inactive
- Logical Processor 0 Inactive, Logical Processor 1 Active
- Both Logical Processors Active
It was not immediately obvious how to obtain this split from the available performance counters.
Every recent Intel processor has:
- An invariant, non-stop Time Stamp Counter (TSC)
- Three “fixed-function” performance counters per logical processor
- Fixed-Function Counter 0: Instructions retired (not used here)
- Fixed-Function Counter 1: Actual Cycles Not Halted
- Fixed-Function Counter 2: Reference Cycles Not Halted
- Two or more (typically 4) programmable performance counters per logical processor
- A few of the “events” are common across all processors, but most are model-specific.
The fixed-function “Reference Cycles Not Halted” counter increments at the same rate as the TSC, but only while the Logical Processor is not halted. So for any interval, I can divide the change in Reference Cycles Not Halted by the change in the TSC to get the “utilization” — the fraction of the time that the Logical Processor was Not Halted. This value can be computed independently for each Logical Processor, but more information is needed to assign cycles to the four categories. There are some special cases where partial information is available — for example, if the “utilization” is close to 1.0 for both Logical Processors for an interval, then the processor must have had “Both Logical Processors Active” (category 4) for most of that interval. On the other hand, if the utilization on each Logical Processor was close to 0.5 for an interval, the two logical processors could have been active at the same time for 1/2 of the cycles (50% idle + 50% both active), or the two logical processors could have been active at separate times (50% logical processor 0 only + 50% logical processor 1 only), or somewhere in between.
Both the fixed-function counters and the programmable counters have a configuration bit called “AnyThread” that, when set, causes the counter to increment if the corresponding event occurs on any logical processor of the core. This is definitely going to be helpful, but the both the algebra and the specific programming of the counters have some subtleties….
The first subtlety is related to some confusing changes in the clocks of various processors and how the performance counter values are scaled.
- The TSC increments at a fixed rate.
- For most Intel processors this rate is the same as the “nominal” processor frequency.
- Starting with Skylake (client) processors, the story is complicated and I won’t go into it here.
- It is not clear exactly how often (or how much) the TSC is incremented, since the hardware instruction to read the TSC (RDTSC) requires between ~20 and ~40 cycles to execute, depending on the processor frequency and processor generation.
- For most Intel processors this rate is the same as the “nominal” processor frequency.
- The Fixed-Function “Unhalted Reference Cycles” counts at the same rate as the TSC, but only when the processor is not halted.
- Unlike the TSC, the Fixed-Function “Unhalted Reference Cycles” counter increments by a fixed amount at each increment of a slower clock.
- For Nehalem and Westmere processors, the slower clock was a 133 MHz reference clock.
- For Sandy Bridge through Broadwell processors, the “slower clock” was the 100 MHz reference clock referred to as the “XCLK”.
- This clock was also used in the definition of the processor frequencies.
- For example, the Xeon E5-2680 processor had a nominal frequency of 2.7 GHz, so the TSC would increment (more-or-less continuously) at 2.7 GHz, while the Fixed-Function “Unhalted Reference Cycles” counter would increment by 27 once every 10 ns (i.e., once every tick of the 100 MHz XCLK).
- For Skylake and newer processors, the processor frequencies are still defined in reference to a 100 MHz reference clock, but the Fixed-Function “Unhalted Reference Cycles” counter is incremented less frequently.
- For the Xeon Platinum 8160 (nominally 2.1 GHz), the 25 MHz “core crystal clock” is used, so the counter increments by 84 once every 40 ns, rather than by 21 once every 10 ns.
- The programmable performance counter event that most closely corresponds to the Fixed-Function “Unhalted Reference Cycles” counter has changed names and definitions several times
- Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
- No additional scaling needed.
- Sandy Bridge through Broadwell: “CPU_CLK_THREAD_UNHALTED.REF_XCLK” increments at the rate of the 100 MHz XCLK (not scaled!) when the processor is not halted.
- Results must be scaled by the base CPU ratio.
- Skylake and newer: “CPU_CLK_UNHALTED.REF_XCLK” increments at the rate of the “core crystal clock” (25 MHz on Xeon Scalable processors) when the processor is not halted.
- Note that the name still includes “XCLK”, but the definition has changed!
- Results must be scaled by 4 times the base CPU ratio.
- Nehalem & Westmere: “CPU_CLK_UNHALTED.REF_P” increments at the same rate as the TSC when the processor is not halted.
Once the scaling for the programmable performance counter event is handled correctly, we get to move on to the algebra of converting the measurements from what is available to what I want.
For each interval, I assume that I have the following measurements before and after, with the measurements taken as close to simultaneously as possible on the two Logical Processors:
- TSC (on either logical processor)
- Fixed-Function “Unhalted Reference Cycles” (on each logical processor)
- Programmable CPU_CLK_UNHALTED.REF_XCLK with the “AnyThread” bit set (on either Logical Processor)
So each Logical Processor makes two measurements, but they are asymmetric.
From these results, the algebra required to split the counts into the desired categories is not entirely obvious. I eventually worked up the following sequence:
- Neither Logical Processor Active == Elapsed TSC – CPU_CLK_UNHALTED.REF_XCLK*scaling_factor
- Logical Processor 0 Active, Logical Processor 1 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
- Logical Processor 1 Active, Logical Processor 0 Inactive == Elapsed TSC – “Neither Logical Processor Active” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)”
- Both Logical Processors Active == CPU_CLK_UNHALTED.REF_XCLK*scaling_factor – “Fixed-Function Reference Cycles Not Halted (Logical Processor 0)” – “Fixed-Function Reference Cycles Not Halted (Logical Processor 1)”
Starting with the Skylake core, there is an additional sub-event of the programmable CPU_CLK_UNHALTED event that increments only when the current Logical Processor is active and the sibling Logical Processor is inactive. This can certainly be used to obtain the same results, but it does not appear to save any effort. My approach uses only one programmable counter on one of the two Logical Processors — a number that cannot be reduced by using an alternate programmable counter. Comparison of the two approaches shows that the results are the same, so in the interest of backward compatibility, I continue to use my original approach.