Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core.
Each core’s L1 and L2 caches are local and private, but outside the L2 cache addresses are distributed in a random-looking way across the L3 slices all over the chip.
As an easy case, for the Xeon Gold 6142 processor (1st generation Xeon Scalable Processor with 16 cores and 16 L3 slices), every aligned group of 16 cache line addresses is mapped so that one of those 16 cache lines is assigned to each of the 16 L3 slices, using an undocumented permutation generator. The total number of possible permutations of the L3 slice numbers [0…15] is 16! (almost 21 trillion), but measurements on the hardware show that only 16 unique permutations are actually used. The observed sequences are the “binary permutations” of the sequence [0,1,2,3,…,15]. The “binary permutation” operator can be described in several different ways, but the structure is simple:
- binary permutation “0” of a sequence is just the original sequence
- binary permutation “1” of a sequence swaps elements in each even/odd pair, e.g.
- Binary permutation 1 of [0,1,2,3,4,5,6,7] is [1,0,3,2,5,4,7,6]
- binary permutation “2” of a sequence swaps pairs of elements in each set of four, e.g.,
- Binary permutation 2 of [0,1,2,3,4,5,6,7] is [2,3,0,1,6,7,4,5]
- binary permutation “3” of a sequence performs both permutation “1” and permutation “2”
- binary permutation “4” of a sequence swaps 4-element halves of each set of 8 element, e.g.,
- Binary permutation 4 of [0,1,2,3,4,5,6,7] is [4,5,6,7,0,1,2,3]
Binary permutation operators are very cheap to implement in hardware, but are limited to sequence lengths that are a power of 2. When the number of slices is not a power of 2, using binary permutations requires that you create a power-of-2-length sequence that is bigger than the number of slices, and which contains each slice number approximately an equal number of times. As an example, the Xeon Scalable Processors (gen1 and gen2) with 24 L3 slices use a 512 element sequence that contains each of the values 0…16 21 times each and each of the values 17…23 22 times each. This almost-uniform “base sequence” is then permuted using the 512 binary permutations that are possible for a 512-element sequence.
Intel does not publish the length of the base sequences, the values in the base sequences, or the formulas used to determine which binary permutation of the base sequence will be used for any particular address in memory.
Over the years, a few folks have investigated the properties of these mappings, and have published a small number of results — typically for smaller L3 slice counts.
Today I am happy to announce the availability of the full base sequences and the binary permutation selector equations for many Intel processors. The set of systems includes:
- Xeon Scalable Processors (gen1 “Skylake Xeon” and gen2 “Cascade Lake Xeon”) with 14, 16, 18, 20, 22, 24, 26, 28 L3 slices
- Xeon Scalable Processors (gen3 “Ice Lake Xeon”) with 28 L3 slices
- Xeon Phi x200 Processors (“Knights Landing”) with 38 Snoop Filter slices
The results for the Xeon Scalable Processors (all generations) are based on my own measurements. The results for the Xeon Phi x200 are based on the mappings published by Kommrusch, et al. (e.g., https://arxiv.org/abs/2011.05422), but re-interpreted in terms of the “base sequence” plus “binary permutation” model used for the other processors.
The technical report and data files (base sequences and permutation selector masks for each processor) are available at https://hdl.handle.net/2152/87595
Have fun!
Next up — using these address to L3/CHA mapping results to understand observed L3 and Snoop Filter conflicts in these processors….