John McCalpin's blog

Dr. Bandwidth explains all….

Archive for October, 2010

Opteron/PhenomII STREAM Bandwidth vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.

This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂

Single-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 10,101 MB/s 9,813 MB/s 9,081 MB/s
2.5 GHz 10,077 MB/s 9,784 MB/s 9,053 MB/s
2.1 GHz 10,075 MB/s 9,783 MB/s 9,051 MB/s

Two-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 13,683 MB/s 12,851 MB/s 11,490 MB/s
2.5 GHz 13,337 MB/s 12,546 MB/s 11,252 MB/s
2.1 GHz 13,132 MB/s 12,367 MB/s 11,112 MB/s

So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.

Posted in Computer Hardware, Reference | 4 Comments »

Opteron Memory Latency vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

  • AMD Phenom II 555 Processor
    • Two cores
    • Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
    • Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
  • 4 GB of DDR3/1600 DRAM
    • Two 2GB unregistered, unbuffered DIMMs
    • Dual-rank DIMMs
    • Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066 DDR3/800
3.2 GHz 51.58 ns 54.18 ns 57.46 ns 66.18 ns
2.5 GHz 52.77 ns 54.97 ns 58.30 ns 65.74 ns
2.1 GHz 53.94 ns 55.37 ns 58.72 ns 65.79 ns
0.8 GHz 82.86 ns 87.42 ns 94.96 ns 94.51 ns

Notes:

  1. The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Posted in Computer Hardware, Reference | 2 Comments »

AMD Opteron Local Memory Latency Chart

Posted by John D. McCalpin, Ph.D. on 1st October 2010

I need to use the whiteboard in my office for something else, so it is time to transcribe the table of Opteron system memory latencies from there to here.

The following table presents my most reliable & accurate measurements of open page memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller.

Processor Family/Revision 1 socket systems 2 socket systems 4-socket systems
Opteron K8, RevE/RevF 2.4 – 3.0 GHz) 60 ns 60 ns 95 ns
Family10h, Rev B3 (2.3 GHz) (not yet measured) 85 ns 100 ns / 130 ns
Family10h, Rev C2 (2.9 GHz) 54 ns 74 ns (not yet measured)
Family10h, Rev D0 (2.6 GHz) (not yet measured) 78 ns 54 ns

Notes:

  1. Results Updated 2010-10-04 to provide separate 1 socket and 2 socket numbers!!
  2. Memory latency is weakly dependent on CPU frequency, Northbridge frequency, and DRAM frequency in Opteron systems.
  3. Memory latency is controlled by the longer of the time required to get the data from DRAM and the time required to receive “snoop responses” from all the other chips in the system.
  4. Family10h Revision B3 Opterons have higher latency than the Family K8 Opterons for a variety of reasons:
    • The memory controller in K8 Opterons runs at the full CPU speed, while the memory controller in Family10h Opterons runs at a lower frequency.
    • The difference in frequencies between the CPU and memory controller in Family10h Opterons requires an asynchronous boundary between the two. This increases latency.
    • Family10h Opterons have a shared L3 cache that must be checked before sending load requests to memory. Since the L3 cache is physically larger than the L2 caches and is shared across all the cores on the chip, extra latency is incurred for requests that miss in the L3 and go to memory.
    • Family10h Opterons support the HyperTransport version 3 protocol (though Revisions B & C run in HyperTransport version 1 compatibility mode), which appears to add some latency to the probe responses.
  5. Family10h Revision B and C Opterons in 4-socket systems may have different latencies for different sockets, depending on the details of the HyperTransport interconnect. On the TACC “Ranger” system, the SunBlade x6420 nodes have two sockets that have direct connections to all the other sockets, and two sockets that are only connected to two of the other three sockets. The sockets with direct connections to all other sockets display a local memory latency of 100 ns, while the sockets that are not directly connected to all other sockets have to wait longer for some probe responses and show a latency of 130 ns.
  6. Family10h Revision D0 4-socket systems have decreased memory local memory latency because of the “HT Assist” feature, which uses 1 MB of the shared L3 to maintain a directory of lines that are potentially in a modified state in another processor’s cache. If the cache line is not listed in the directory, then the value in memory is current and probes do not need to be broadcast to other chips if all you want to do is read the data.

If anyone has a working 4-socket Family10h Revision C2 (“Shanghai”) system, I would like to run this benchmark to fill in the last entry in the table!

Posted in Computer Hardware | 1 Comment »

Welcome to Dr. Bandwidth’s Blog

Posted by John D. McCalpin, Ph.D. on 1st October 2010

Welcome to the University of Texas blog of John D. McCalpin, PhD — aka Dr. Bandwidth. JohnMcCalpin

This is the beginning of a serious of posts on the exciting topic of memory bandwidth in computer systems!   I hope these posts will serve as a reference for folks interested in the increasingly complex issues regarding memory bandwidth — it seems silly for me to write all this stuff down for my own use when I suspect that there is at least one person out there who may find this information useful.

Although I currently work at the Texas Advanced Computing Center of the University of Texas at Austin, my professional career has been split about 50-50 between academia and the computer hardware industry. These blog posts will primarily deal with concrete information about real systems that (in my experience) does not fit well with the priorities of most academic publications.  Most journals or conferences require some level of “research” or “novelty”, while most of these notes are more explanatory in nature.   That is not to say that none of this will ever be published, but much of the material to be posted here is valuable for other reasons.

My interest in memory bandwidth extends back to the late 1980’s, leading to the development of the STREAM Benchmark which I have maintained since 1991.   Here is what I have to say about STREAM:


streambench_logo

What is STREAM?

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.


Why should I care?

Computer CPUs are getting faster much more quickly than computer memory systems. As this progresses, more and more programs will be limited in performance by the memory bandwidth of the system, rather than by the computational performance of the CPU.

As an extreme example, most current high-end machines run simple arithmetic kernels for out-of-cache operands at 1-2% of their rated peak speeds — that means that they are spending 98-99% of their time idle and waiting for cache misses to be satisfied.

The STREAM benchmark is specifically designed to work with datasets much larger than the available cache on any given system, so that the results are (presumably) more indicative of the performance of very large, vector style applications.


Posted in Reference | Comments Off on Welcome to Dr. Bandwidth’s Blog