Memory Bandwidth on Xeon Phi (Knights Corner)

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

Official STREAM submission at: http://www.cs.virginia.edu/stream/stream_mail/2013/0015.html
Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
Configured with an array size of 64 million elements per array and 10 iterations.
Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.

Function	Best Rate MB/s	Avg time (sec)	Min time (sec)	Max time (sec)
Copy	169446.8	0.0062	0.0060	0.0063
Scale	169173.1	0.0062	0.0061	0.0063
Add	174824.3	0.0090	0.0088	0.0091
Triad	174663.2	0.0089	0.0088	0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels. Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface. Each of the 32 GDDR5 DRAM chips has 16 banks. This gives a *raw* total of 512 DRAM banks. BUT:

The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface. (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface). This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface. Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource. Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines. (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks. On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1). Some data from the STREAM benchmark supports this hypothesis:

Kernel	Reads	Writes	2/core	3/core	4/core
Copy	1	1	-0.8%	-5.2%	-7.3%
Scale	1	1	-1.0%	-3.3%	-6.7%
Add	2	1	-3.1%	-12.0%	-13.6%
Triad	2	1	-3.6%	-11.2%	-13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization. The default mode is to give priority to reads while buffering stores. Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads. The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012). So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams. For STREAM Triad, the observed bandwidth is ~175 GB/s. Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time. This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller. For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times. If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….