Disabled Core Patterns and Core Defect Rates in Intel Xeon Phi x200 (Knights Landing) Processors

Defect rates and chip yields in the fabrication of complex semiconductor chips (like processors) are typically very tightly held secrets. In the current era of multicore processors even the definition of “yield” requires careful thinking — companies have adapted their designs to tolerate defects in single processor cores, allowing them to sell “partial good” die at lower prices. This has been good for everyone — the vendors get to sell more of the “stuff” that comes off the manufacturing line, and the customers have a wider range of products to choose from.

Knowing that many processor chips use “partial good” die does not usually help customers to infer anything about the yield of a multicore chip at various core counts, even when purchasing thousands of chips. It is possible that the Xeon Phi x200 (“Knights Landing”, “KNL”) processors are in a different category — one that allows statistically interesting inferences to be drawn about core defect rates.

Why is the Xeon Phi x200 different?

It was developed for (and is of interest to) almost exclusively customers in the High Performance Computing (HPC) market.
The chip has 76 cores (in 38 pairs), and the only three product offerings had 64, 68, or 72 cores enabled.
- No place to sell chips with many defects.
The processor core is slower than most mainstream Xeon processors in both frequency and instruction level parallelism.
- No place to sell chips that don’t meet the frequency requirements.

The Texas Advanced Computing Center currently runs 4200 compute servers using the Xeon Phi 7250 (68-core) processor. The first 504 were installed in June 2016 and the remaining 3696 were installed in April 2017. Unlike the mainstream Xeon processors, the Xeon Phi x200 enables any user to determine which physical cores on the die are disabled, simply by running the CPUID instruction on each active logical processor to obtain that core’s X2APIC ID (used by the interrupt controller). There is a 1:1 correspondence between the X2APIC IDs and the physical core locations on the die, so any cores that are disabled will result in missing X2APIC values in the list. More details on the X2APIC IDs are in the technical report “Observations on Core Numbering and “Core ID’s” in Intel Processors” and more details on the mapping of X2APIC IDs to locations on the die are in the technical report Mapping Core, CHA, and Memory Controller Numbers to Die Locations in Intel Xeon Phi x200 (“Knights Landing”, “KNL”) Processors.

The lists of disabled cores were collected at various points over the last 4.5 years and at some point during the COVID-19 pandemic I decided to look at them. The first result was completely expected — cores are always enabled/disabled in pairs. This matches the way they are placed on the die: each of the 38 “tiles” has 2 cores, a 1MiB shared L2 cache, and a coherence agent making up a “tile”. The second result was unexpected — although every tile had disabled cores in at least some processors, there were four tile positions where the cores were disabled 15x-20x more than average. In “Figure 5” below, these “preferred” tiles were the ones immediately above and below the memory controllers IMC0 and IMC1 on the left and right sides of the chip — numbers 2, 8, 27, 37.

Numbering and locations of CHAs and memory controllers in Xeon Phi x200 processors.

After reviewing the patterns in more detail, it seemed that these four “preferred” locations could be considered “spares”. The cores at the other 34 tiles would be enabled if they were functional, and if any of those tiles had a defect, a “spare” would be enabled to compensate. If true, this would be a a very exciting result because it means that even though every one of the 4200 chips has exactly 4 tiles with disabled cores, the presence of disabled cores anywhere other than the “preferred” locations indicated a defect. If there were no defects on the chip (or only defects in the spare tiles themselves), then the only four tiles with disabled cores would be 2, 8, 27, 37. This was actually the case for about 1290 of the 4200 chips.

The number of chips with disabled cores at each of the 34 “standard” (non-preferred) locations varied rather widely, but looked random. Was there any way to evaluate whether the results were consistent with a model of a small number of random defects, with those cores being replaced by activating cores in the spare tiles? Yes, there is, and for the statistically minded you can read all about it in the technical report Disabled Core Patterns and Core Defect Rates in Xeon Phi x200 (“Knights Landing”) Processors. The report contains all sorts of mind-numbing discussions of “truncated binomial distributions”, corrections for visibility of defects, and statistical significance tests for several different views of the data — but it does have brightly colored charts and graphs to attempt to offset those soporific effects.

For the less statistically minded, the short description is:

For the 504 processors deployed in June 2016, the average number of “defects” was 1.38 per chip.
For the 3696 processors deployed in April 2017, the average number of “defects” was 1.19 per chip.
The difference in these counts was very strongly statistically significant (3.7 standard deviations).
Although some of the observed values are slightly outside the ranges expected for a purely random process, the overall pattern is strongly consistent with a model of random, independent defects.

These are very good numbers — for the full cluster the average number of defects is projected to be 1.36 per chip (including an estimate of defects in the unused “spare” tiles). With these defect rates, only about 1% of the chips would be expected to have more than 4 defects — and almost all of these would still suffice for the 64-core model.

So does this have anything to do with “yield”? Probably not a whole lot — all of these chips require that all 8 Embedded DRAM Controllers (EDCs) are fully functional, all 38 Coherence Agents are fully functional, both DDR4 memory controllers are fully functional, and the IO blocks are fully functional. There is no way to infer how many chips might be lost due to failures in any of those parts because there were no product offerings that allowed any of those blocks to be disabled. But from the subset of chips that had all the “non-core” parts working, these results paint an encouraging picture with regard to defect rates for the cores.