In a recent post (link) I showed that local memory latency is weakly dependent on processor and DRAM frequency in a single-socket Phenom II system. Here are some preliminary results on memory bandwidth as a function of CPU frequency and DRAM frequency in the same system.
This table does not include the lowest CPU frequency (0.8 GHz) or the lowest DRAM frequency (DDR3/800) since these results tend to be relatively poor and I prefer to look at big numbers. 🙂
Single-Thread STREAM Triad Bandwidth (MB/s)
Processor Frequency | DDR3/1600 | DDR3/1333 | DDR3/1066 |
---|---|---|---|
3.2 GHz | 10,101 MB/s | 9,813 MB/s | 9,081 MB/s |
2.5 GHz | 10,077 MB/s | 9,784 MB/s | 9,053 MB/s |
2.1 GHz | 10,075 MB/s | 9,783 MB/s | 9,051 MB/s |
Two-Thread STREAM Triad Bandwidth (MB/s)
Processor Frequency | DDR3/1600 | DDR3/1333 | DDR3/1066 |
---|---|---|---|
3.2 GHz | 13,683 MB/s | 12,851 MB/s | 11,490 MB/s |
2.5 GHz | 13,337 MB/s | 12,546 MB/s | 11,252 MB/s |
2.1 GHz | 13,132 MB/s | 12,367 MB/s | 11,112 MB/s |
So from these results it is clear that the sustainable memory bandwidth as measured by the STREAM benchmark is very weakly dependent on the CPU frequency, but moderately strongly dependent on the DRAM frequency. It is clear that a single thread is not enough to drive the memory system to maximum performance, but the modest speedup from 1 thread to 2 threads suggests that more threads will not be very helpful. More details to follow, along with some attempts to make optimize the STREAM benchmark for better sustained performance.
Joshua Mora says
Hi John.
You may want to break things down in terms of read and writes and what happens if you do not use non temporal stores nor write combined.
I have also observed for large core count the size of the array needs to be increased. 20M was good for few cores but on a 24 or 48 core system you need an array size > 50M to get a good score.
It is also advisable to disable power management in order to get consistent/repeatable good results.
There is also a variant to this test very interesting: the remote stream test where you pin the memory to a given NUMA node but the threads are bounded to another NUMA node. Then you get to stress as well the coherent HyperTransport (in AMD systems) or QPI( in INTEL systems). Same for latency. Then you can also do the stream with few threads running on a NUMA node but the allocations bound on different NUMA nodes so you stress as well the capability of the processor to handle local and remote traffic at the same time, which is more realistic for openMP applications with data that is truly shared amond the NUMA nodes. Performance gets to show a similar behavior (congestion) when you add threads that access remote locations.
Another “interesting” test is to run stream using all the cores of all NUMA nodes in interleaved mode, where pages are stripped across all the NUMA nodes. Then all cores accessing equally all the NUMA nodes. Again it stresses both local and remote and the challenge to correlate it with system (eg. cpu, memory, HT, QPI) frequencies.
You may want to add also the efficiency with respect to the theoretical so people can see that 1333 MHz may not add proportional increase on memory throughput (ie. bandwidth, latency, GUP/s) with respect to 1066 MHz. Similar to HPL efficiency. For instance at low core count the efficiency is better correlated with memory frequency but when adding cores, memory frequency won’t be the limiting factor, hence the need to assess the efficiency when using all the cores of the NUMA node.
John McCalpin says
Hi Joshua! Good comments — I will be addressing these details soon.
Pete Stevenson says
Hi John,
Could you please show a comparison of attained bandwidth to peak bandwidth for this result? Or, just clarify the system configuration, i.e., how many DDR channels were in use?
Thank you,
Pete Stevenson
John D. McCalpin, Ph.D. says
The system under test was a single-socket AMD PhenomII with two DDR3 DRAM channels. Each channel is configured with one DDR3/1600 DIMM.
So the peak DRAM bandwidth for DDR3/1066 is 17.067 GB/s, for DDR3/1333 it is 21.33 GB/s, and for DDR3/1600 it is 25.6 GB/s.
Delivered bandwidth does not approach peak bandwidth in any of these cases for a variety of reasons that I hope to discuss in future posts.