John McCalpin's blog

Dr. Bandwidth explains all….

Archive for the 'Computer Hardware' Category

Opteron/PhenomII STREAM Bandwidth vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.

This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂

Single-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 10,101 MB/s 9,813 MB/s 9,081 MB/s
2.5 GHz 10,077 MB/s 9,784 MB/s 9,053 MB/s
2.1 GHz 10,075 MB/s 9,783 MB/s 9,051 MB/s

Two-Thread STREAM Triad Bandwidth (MB/s)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066
3.2 GHz 13,683 MB/s 12,851 MB/s 11,490 MB/s
2.5 GHz 13,337 MB/s 12,546 MB/s 11,252 MB/s
2.1 GHz 13,132 MB/s 12,367 MB/s 11,112 MB/s

So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.

Posted in Computer Hardware, Reference | 4 Comments »

Opteron Memory Latency vs CPU and DRAM Frequency

Posted by John D. McCalpin, Ph.D. on 7th October 2010

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

  • AMD Phenom II 555 Processor
    • Two cores
    • Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
    • Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
  • 4 GB of DDR3/1600 DRAM
    • Two 2GB unregistered, unbuffered DIMMs
    • Dual-rank DIMMs
    • Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency DDR3/1600 DDR3/1333 DDR3/1066 DDR3/800
3.2 GHz 51.58 ns 54.18 ns 57.46 ns 66.18 ns
2.5 GHz 52.77 ns 54.97 ns 58.30 ns 65.74 ns
2.1 GHz 53.94 ns 55.37 ns 58.72 ns 65.79 ns
0.8 GHz 82.86 ns 87.42 ns 94.96 ns 94.51 ns

Notes:

  1. The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Posted in Computer Hardware, Reference | 2 Comments »

AMD Opteron Local Memory Latency Chart

Posted by John D. McCalpin, Ph.D. on 1st October 2010

I need to use the whiteboard in my office for something else, so it is time to transcribe the table of Opteron system memory latencies from there to here.

The following table presents my most reliable & accurate measurements of open page memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller.

Processor Family/Revision 1 socket systems 2 socket systems 4-socket systems
Opteron K8, RevE/RevF 2.4 – 3.0 GHz) 60 ns 60 ns 95 ns
Family10h, Rev B3 (2.3 GHz) (not yet measured) 85 ns 100 ns / 130 ns
Family10h, Rev C2 (2.9 GHz) 54 ns 74 ns (not yet measured)
Family10h, Rev D0 (2.6 GHz) (not yet measured) 78 ns 54 ns

Notes:

  1. Results Updated 2010-10-04 to provide separate 1 socket and 2 socket numbers!!
  2. Memory latency is weakly dependent on CPU frequency, Northbridge frequency, and DRAM frequency in Opteron systems.
  3. Memory latency is controlled by the longer of the time required to get the data from DRAM and the time required to receive “snoop responses” from all the other chips in the system.
  4. Family10h Revision B3 Opterons have higher latency than the Family K8 Opterons for a variety of reasons:
    • The memory controller in K8 Opterons runs at the full CPU speed, while the memory controller in Family10h Opterons runs at a lower frequency.
    • The difference in frequencies between the CPU and memory controller in Family10h Opterons requires an asynchronous boundary between the two. This increases latency.
    • Family10h Opterons have a shared L3 cache that must be checked before sending load requests to memory. Since the L3 cache is physically larger than the L2 caches and is shared across all the cores on the chip, extra latency is incurred for requests that miss in the L3 and go to memory.
    • Family10h Opterons support the HyperTransport version 3 protocol (though Revisions B & C run in HyperTransport version 1 compatibility mode), which appears to add some latency to the probe responses.
  5. Family10h Revision B and C Opterons in 4-socket systems may have different latencies for different sockets, depending on the details of the HyperTransport interconnect. On the TACC “Ranger” system, the SunBlade x6420 nodes have two sockets that have direct connections to all the other sockets, and two sockets that are only connected to two of the other three sockets. The sockets with direct connections to all other sockets display a local memory latency of 100 ns, while the sockets that are not directly connected to all other sockets have to wait longer for some probe responses and show a latency of 130 ns.
  6. Family10h Revision D0 4-socket systems have decreased memory local memory latency because of the “HT Assist” feature, which uses 1 MB of the shared L3 to maintain a directory of lines that are potentially in a modified state in another processor’s cache. If the cache line is not listed in the directory, then the value in memory is current and probes do not need to be broadcast to other chips if all you want to do is read the data.

If anyone has a working 4-socket Family10h Revision C2 (“Shanghai”) system, I would like to run this benchmark to fill in the last entry in the table!

Posted in Computer Hardware | 1 Comment »