John McCalpin's blog

Dr. Bandwidth explains all….

AMD Opteron Local Memory Latency Chart

Posted by John D. McCalpin, Ph.D. on 1st October 2010

I need to use the whiteboard in my office for something else, so it is time to transcribe the table of Opteron system memory latencies from there to here.

The following table presents my most reliable & accurate measurements of open page memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller.

Processor Family/Revision 1 socket systems 2 socket systems 4-socket systems
Opteron K8, RevE/RevF 2.4 – 3.0 GHz) 60 ns 60 ns 95 ns
Family10h, Rev B3 (2.3 GHz) (not yet measured) 85 ns 100 ns / 130 ns
Family10h, Rev C2 (2.9 GHz) 54 ns 74 ns (not yet measured)
Family10h, Rev D0 (2.6 GHz) (not yet measured) 78 ns 54 ns

Notes:

  1. Results Updated 2010-10-04 to provide separate 1 socket and 2 socket numbers!!
  2. Memory latency is weakly dependent on CPU frequency, Northbridge frequency, and DRAM frequency in Opteron systems.
  3. Memory latency is controlled by the longer of the time required to get the data from DRAM and the time required to receive “snoop responses” from all the other chips in the system.
  4. Family10h Revision B3 Opterons have higher latency than the Family K8 Opterons for a variety of reasons:
    • The memory controller in K8 Opterons runs at the full CPU speed, while the memory controller in Family10h Opterons runs at a lower frequency.
    • The difference in frequencies between the CPU and memory controller in Family10h Opterons requires an asynchronous boundary between the two. This increases latency.
    • Family10h Opterons have a shared L3 cache that must be checked before sending load requests to memory. Since the L3 cache is physically larger than the L2 caches and is shared across all the cores on the chip, extra latency is incurred for requests that miss in the L3 and go to memory.
    • Family10h Opterons support the HyperTransport version 3 protocol (though Revisions B & C run in HyperTransport version 1 compatibility mode), which appears to add some latency to the probe responses.
  5. Family10h Revision B and C Opterons in 4-socket systems may have different latencies for different sockets, depending on the details of the HyperTransport interconnect. On the TACC “Ranger” system, the SunBlade x6420 nodes have two sockets that have direct connections to all the other sockets, and two sockets that are only connected to two of the other three sockets. The sockets with direct connections to all other sockets display a local memory latency of 100 ns, while the sockets that are not directly connected to all other sockets have to wait longer for some probe responses and show a latency of 130 ns.
  6. Family10h Revision D0 4-socket systems have decreased memory local memory latency because of the “HT Assist” feature, which uses 1 MB of the shared L3 to maintain a directory of lines that are potentially in a modified state in another processor’s cache. If the cache line is not listed in the directory, then the value in memory is current and probes do not need to be broadcast to other chips if all you want to do is read the data.

If anyone has a working 4-socket Family10h Revision C2 (“Shanghai”) system, I would like to run this benchmark to fill in the last entry in the table!

Posted in Computer Hardware | 1 Comment »

Welcome to Dr. Bandwidth’s Blog

Posted by John D. McCalpin, Ph.D. on 1st October 2010

Welcome to the University of Texas blog of John D. McCalpin, PhD — aka Dr. Bandwidth. JohnMcCalpin

This is the beginning of a serious of posts on the exciting topic of memory bandwidth in computer systems!   I hope these posts will serve as a reference for folks interested in the increasingly complex issues regarding memory bandwidth — it seems silly for me to write all this stuff down for my own use when I suspect that there is at least one person out there who may find this information useful.

Although I currently work at the Texas Advanced Computing Center of the University of Texas at Austin, my professional career has been split about 50-50 between academia and the computer hardware industry. These blog posts will primarily deal with concrete information about real systems that (in my experience) does not fit well with the priorities of most academic publications.  Most journals or conferences require some level of “research” or “novelty”, while most of these notes are more explanatory in nature.   That is not to say that none of this will ever be published, but much of the material to be posted here is valuable for other reasons.

My interest in memory bandwidth extends back to the late 1980’s, leading to the development of the STREAM Benchmark which I have maintained since 1991.   Here is what I have to say about STREAM:


streambench_logo

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.


Why should I care?

Computer CPUs are getting faster much more quickly than computer memory systems. As this progresses, more and more programs will be limited in performance by the memory bandwidth of the system, rather than by the computational performance of the CPU.

As an extreme example, most current high-end machines run simple arithmetic kernels for out-of-cache operands at 1-2% of their rated peak speeds — that means that they are spending 98-99% of their time idle and waiting for cache misses to be satisfied.

The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications.


Posted in Reference | Comments Off on Welcome to Dr. Bandwidth’s Blog