Single Thread, Read Only Results Comparison Across Systems
In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (“Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.
Test Systems
- 2-socket AMD Family10h Opteron Revision C2 (“Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
- 2-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
- 4-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
- 4-socket AMD Family10h Opteron 6174, Revision E0 (“Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
- 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333
All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.
The source code, scripts, and results are all available in a tar file: ReadOnly_2010-11-12.tar.bz2
Results
Code Version | Notes | Vector SSE | Large Page | SW Prefetch | 4 KiB pages accessed | Ref System (2p Shanghai) | 2-socket Istanbul | 4-socket Istanbul | 4-socket Magny-Cours | 1-socket PhenomII |
---|---|---|---|---|---|---|---|---|---|---|
Version001 | “-O1” | – | – | – | 1 | 3.401 GB/s | 3.167 GB/s | 4.311 GB/s | 3.734GB/s | 4.586 GB/s |
Version002 | “-O2” | – | – | – | 1 | 4.122 GB/s | 4.035 GB/s | 5.719 GB/s | 5.120 GB/s | 5.688 GB/s |
Version003 | 8 partial sums | – | – | – | 1 | 4.512 GB/s | 4.373 GB/s | 5.946 GB/s | 5.476 GB/s | 6.207 GB/s |
Version004 | add SW prefetch | – | – | Y | 1 | 6.083 GB/s | 5.732 GB/s | 6.489 GB/s | 6.389 GB/s | 7.571 GB/s |
Version005 | add vector SSE | Y | – | Y | 1 | 6.091 GB/s | 5.765 GB/s | 6.600 GB/s | 6.398 GB/s | 7.580 GB/s |
Version006 | remove prefetch | Y | – | – | 1 | 5.247 GB/s | 5.159 GB/s | 6.787 GB/s | 6.403 GB/s | 6.976 GB/s |
Version007 | add large pages | Y | Y | – | 1 | 5.392 GB/s | 5.234 GB/s | 7.149 GB/s | 6.653 GB/s | 7.117 GB/s |
Version008 | split into triply-nested loop | Y | Y | – | 1 | 4.918 GB/s | 4.914 GB/s | 6.661 GB/s | 6.180 GB/s | 6.616 GB/s |
Version009 | add SW prefetch | Y | Y | Y | 1 | 6.173 GB/s | 5.901 GB/s | 6.646 GB/s | 6.568 GB/s | 7.736 GB/s |
Version010 | multiple pages/loop | Y | Y | Y | 2 | 6.417 GB/s | 6.174 GB/s | 7.569 GB/s | 6.895 GB/s | 7.913 GB/s |
Version011 | multiple pages/loop | Y | Y | Y | 4 | 7.063 GB/s | 6.804 GB/s | 8.319 GB/s | 7.245 GB/s | 8.583 GB/s |
Version012 | multiple pages/loop | Y | Y | Y | 8 | 7.260 GB/s | 6.960 GB/s | 8.378 GB/s | 7.205 GB/s | 8.642 GB/s |
Version013 | Version010 minus SW prefetch | Y | Y | – | 2 | 5.864 GB/s | 6.009 GB/s | 7.667 GB/s | 6.676 GB/s | 7.469 GB/s |
Version014 | Version011 minus SW prefetch | Y | Y | – | 4 | 6.743 GB/s | 6.483 GB/s | 8.136 GB/s | 6.946 GB/s | 8.291 GB/s |
Version015 | Version012 minus SW prefetch | Y | Y | – | 8 | 6.978 GB/s | 6.578 GB/s | 8.112 GB/s | 6.937 GB/s | 8.463 GB/s |
Comments
There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….
- For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
- The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large. - This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
- The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.
Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:
Computed “Effective Concurrency” (= memory latency * measured bandwidth) for all 15 versions of the ReadOnly code on five test systems.
Pete Stevenson says
Hi John,
I would be very interested to see if your NUMA systems see a boost in performance by using the following:
numactl –interleave all — cmd
where cmd is your binary.
I have a dual socket Nehalem with 3 channels per socket, all channels populated with DDR3-1333. I calculate the peak theoretical BW as ~60 GB/sec. Using stream, I see about 10 GB/sec, or 1/3 of one socket capacity. When using numactl –interleave all, I see about 20 GB/sec (or 1/3 of total capacity). I don’t get a full 2x, but its close. I can e-mail you the detailed results if you are interested. FWIW, I believe linux will allocate the memory on the socket that first touches it, i.e., for the streaming benchmarks, *all* of the memory is allocated on one socket only (unless the interleave policy is set).
Thank you,
Pete Stevenson
John D. McCalpin, Ph.D. says
A STREAM result of 9-10 GB/s is typical for a single thread on a Nehalem or Westmere system, whether configured with two or three DDR3 DRAM channels.
The standard STREAM implementation initializes the data using the same processor/memory affinity that the benchmark kernels use, so a “first touch” memory allocation policy should result in (almost) all local accesses. Under Linux, I usually run the benchmark using “numactl” to enforce local first touch allocation.
A dual-socket Nehalem EP system with three channels of DDR3/1066 per socket has a peak bandwidth of 51.2 GB/s and typically delivers 30-31 GB/s on STREAM when using 4, 6, or 8 threads (evenly distributed across the two chips).
A dual-socket Westmere EP system with three channels of DDR3/1333 per socket has a peak bandwidth of 64.0 GB/s and typically delivers 40-41 GB/s on STREAM when using 6, 8, 10, or 12 threads (evenly distributed across the two chips).
My experience has been that performance with interleaving is highly variable because of the 4kB page granularity. Cache-line interleaving can be very effective if sufficient link bandwidth is provided between the chips (as in POWER4/POWER5/POWER7 MCM-based systems), but page-level interleaving usually results in short-lived “hot spots” limiting the scalability.
With cache-line interleaving, each thread fetches two or three streams of cache lines from alternating sockets. This fine granularity allows the memory controller access reordering to work reasonably effectively. On the other hand in the page-level interleaving case, each thread will issue 64 consecutive cache line fetches to each of two or three arrays. It is unlikely that these will be evenly distributed across the two chips, and the large number of contiguous fetches makes it impossible for practical memory controllers to reorder around the blocks.
Pete Stevenson says
John,
Thanks for the note in reply. I still think using a page interleaved policy could improve your single thread results (i.e. for situations where your optimizations didn’t push you to a concurrency of 8). Realizing that this is somewhat beside the point — I want to get a version of stream working on my system that demonstrates its peak sustainable bandwidth.
System specs:
Nehalem/Xeon E5520 @ 2.26 GHz
dual socket
3 channels / socket
12 DIMMs @ 4 GB each for 48 GB
all DIMMs are DDR3/1333
It has come to my attention (feel free to correct me, or comment) that the E5520 part only supports up to DDR3/1066, thus my peak theoretical bw is:
2*3*8*1.066 = 51.2 GB/sec
So far I have gotten stream up to 22 GB/sec and I have seen X86membench from BenchIT go to 26 GB/sec. I do like the fact that stream is a much simpler test bench. The question becomes: what are the essential tricks I need to use to get to the highest peak sustainable bandwidth (i.e. as demonstrated by stream)?
Thank you,
Pete Stevenson
John D. McCalpin, Ph.D. says
If I understand your results correctly, then you are already getting amazingly good bandwidth for a single thread on a two-socket system.
For the four kernels of the STREAM benchmark, there are two cases to look at — 1:1 read/write (COPY and SCALE) and 2:1 read/write (ADD and TRIAD).
Case 1: COPY and SCALE kernels with 1:1 read/write traffic:
The Nehalem E5520 runs its QPI links at up to 5.86 Gtransfers/sec, or a peak of 11.5 GB/s per direction. I don’t know the QPI protocol in detail, but for bidirectional traffic generated by one local read, one local non-temporal store, one remote read, and one remote non-temporal store I would expect a protocol overhead (including outbound requests, probe responses, data packet headers, and flow control) of about 45%, so the limiter should be the sustained value of 11.5*55%= ~6.3 GB/s of read traffic plus ~6.3 GB/s write traffic on the QPI link.
The overall peak bandwidth should therefore be about 25-26 GB/s, consisting of 6.3 GB/s for each of (local and remote) (reads and writes).
This is very close to the value that you quoted from X86membench.
Case 2: ADD and TRIAD kernels with 2:1 read/write traffic:
Again, I don’t know the QPI protocol in detail, but for bidirectional traffic generated by two local reads, one local non-temporal store, two remote reads and one remote non-temporal store I would expect a minimum of about 35% protocol overhead (including outbound requests, probe responses, data packet headers, and flow control), so the limiter should be the sustained value of 11.5*65%= ~7.5 GB/s of read traffic on the QPI link.
The total bandwidth (for the ADD and TRIAD kernels) would then be
7.5 GB/s reads from the remote chip
7.5 GB/s reads from the local chip
3.75 GB/s writes to the remote chip
3.75 GB/s writes to the local chip
———-
22.4 GB/s <– very close to what you are observing
Of course these values are just estimates, based on my expectations of the types and sizes of the transactions that have to be included in the QPI cache coherence protocol.
These estimates could be tightened by running carefully controlled microbenchmarks and using the Nehalem performance counters to monitor the specific transactions on the QPI interface, but at first glance I would say that you are pretty close to the limits of what the QPI interface can support for this set of transactions.
Wei says
Dr. McCalpin,
Thank you for these great articles. I have a question about your programs. I hope you can help me with it.
I tried your programs on an Opteron 6174 “Magny-Cours”, the same one as you used. Its memory is also DDR3/1333. The first 6 programs, version001 to version 006 gave me similar results to yours. However, version007(the one with 2M pages) and all other large-page programs, gave me only 3GB/s. I think this is caused by the difference of our memory configurations, probably different physical memory to dram address mapping (rank, bank, row, col etc.)? Also, do you know how to determine the dram address mapping on my machine?
Thank you,
Wei