I need to use the whiteboard in my office for something else, so it is time to transcribe the table of Opteron system memory latencies from there to here.
The following table presents my most reliable & accurate measurements of open page memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller.
Processor Family/Revision | 1 socket systems | 2 socket systems | 4-socket systems |
---|---|---|---|
Opteron K8, RevE/RevF 2.4 – 3.0 GHz) | 60 ns | 60 ns | 95 ns |
Family10h, Rev B3 (2.3 GHz) | (not yet measured) | 85 ns | 100 ns / 130 ns |
Family10h, Rev C2 (2.9 GHz) | 54 ns | 74 ns | (not yet measured) |
Family10h, Rev D0 (2.6 GHz) | (not yet measured) | 78 ns | 54 ns |
Notes:
- Results Updated 2010-10-04 to provide separate 1 socket and 2 socket numbers!!
- Memory latency is weakly dependent on CPU frequency, Northbridge frequency, and DRAM frequency in Opteron systems.
- Memory latency is controlled by the longer of the time required to get the data from DRAM and the time required to receive “snoop responses” from all the other chips in the system.
- Family10h Revision B3 Opterons have higher latency than the Family K8 Opterons for a variety of reasons:
- The memory controller in K8 Opterons runs at the full CPU speed, while the memory controller in Family10h Opterons runs at a lower frequency.
- The difference in frequencies between the CPU and memory controller in Family10h Opterons requires an asynchronous boundary between the two. This increases latency.
- Family10h Opterons have a shared L3 cache that must be checked before sending load requests to memory. Since the L3 cache is physically larger than the L2 caches and is shared across all the cores on the chip, extra latency is incurred for requests that miss in the L3 and go to memory.
- Family10h Opterons support the HyperTransport version 3 protocol (though Revisions B & C run in HyperTransport version 1 compatibility mode), which appears to add some latency to the probe responses.
- Family10h Revision B and C Opterons in 4-socket systems may have different latencies for different sockets, depending on the details of the HyperTransport interconnect. On the TACC “Ranger” system, the SunBlade x6420 nodes have two sockets that have direct connections to all the other sockets, and two sockets that are only connected to two of the other three sockets. The sockets with direct connections to all other sockets display a local memory latency of 100 ns, while the sockets that are not directly connected to all other sockets have to wait longer for some probe responses and show a latency of 130 ns.
- Family10h Revision D0 4-socket systems have decreased memory local memory latency because of the “HT Assist” feature, which uses 1 MB of the shared L3 to maintain a directory of lines that are potentially in a modified state in another processor’s cache. If the cache line is not listed in the directory, then the value in memory is current and probes do not need to be broadcast to other chips if all you want to do is read the data.
If anyone has a working 4-socket Family10h Revision C2 (“Shanghai”) system, I would like to run this benchmark to fill in the last entry in the table!
John McCalpin says
Someone asked how I avoided prefetching. In this case I avoided the prefetch by using a stride of 5 cache lines (320 Bytes) wrapping around each 4kB region until all the lines are read (then moving to the next 4kB region). The prefetcher in the Opteron core only looks for strides of +/- one cache line and the prefetcher in the Opteron memory controller only looks for strides of up to +/- four cache lines, so neither is activated. More trickiness might be required to avoid prefetch on Intel processors.