Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

“Flat” vs “Cache”
- In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
  - The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
  - Sustained bandwidth from MCDRAM is highest in “Flat” mode.
- In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
  - In this mode the MCDRAM is invisible and effectively uncontrollable. I will discuss the performance characteristics of Cache mode at a later date.
“All-to-All” vs “Quadrant”
- In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
  - Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
- In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
  - This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
  - Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
    - Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
“Sub-NUMA-Cluster”
- There are several of these modes, only one of which will be discussed here.
- “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
  - “node 0” owns the 1st quarter of contiguous physical address space.
    - The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
    - Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
    - The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
  - Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory. The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of Operation	Flat-Quadrant	Flat-All2All	SNC4 local	SNC4 remote
MCDRAM maximum latency (ns)	156.1	158.3	153.6	164.7
MCDRAM average latency (ns)	154.0	155.9	150.5	156.8
MCDRAM minimum latency (ns)	152.3	154.4	148.3	150.3
MCDRAM standard deviation (ns)	1.0	1.0	0.9	3.1

Caveats:

My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
- E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of Operation	Flat-Quadrant	Flat-All2All	SNC4 local	SNC4 remote
DDR4 maximum latency (ns)	133.3	136.8	130.0	141.5
DDR4 average latency (ns)	130.4	131.8	128.2	133.1
DDR4 minimum latency (ns)	128.2	128.5	125.4	126.5
DDR4 standard deviation (ns)	1.2	2.4	1.1	3.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.