Optimizing AMD Opteron Memory Bandwidth, Part 5: single-thread, read-only
Posted by John D. McCalpin, Ph.D. on 11th November 2010
Single Thread, Read Only Results Comparison Across Systems
In Part1, Part2, Part3, and Part4, I reviewed performance issues for a single-thread program executing a long vector sum-reduction — a single-array read-only computational kernel — on a 2-socket system with a pair of AMD Family10h Opteron Revision C2 (“Shanghai”) quad-core processors. In today’s post, I will present the results for the same set of 15 implementations run on four additional systems.
Test Systems
- 2-socket AMD Family10h Opteron Revision C2 (“Shanghai”), 2.9 GHz quad-core, dual-channel DDR2/800 per socket. (This is the reference system.)
- 2-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
- 4-socket AMD Family10h Opteron Revision D0 (“Istanbul”), 2.6 GHz six-core, dual-channel DDR2/800 per socket.
- 4-socket AMD Family10h Opteron 6174, Revision E0 (“Magny-Cours”), 2.2 GHz twelve-core, four-channel DDR3/1333 per socket.
- 1-socket AMD PhenomII 555, Revision C2, 3.2 GHz dual-core, dual-channel DDR3/1333
All systems were running TACC’s customized Linux kernel, except for the PhenomII which was running Fedora 13. The same set of binaries, generated by the Intel version 11.1 C compiler were used in all cases.
The source code, scripts, and results are all available in a tar file: ReadOnly_2010-11-12.tar.bz2
Results
Code Version | Notes | Vector SSE | Large Page | SW Prefetch | 4 KiB pages accessed | Ref System (2p Shanghai) | 2-socket Istanbul | 4-socket Istanbul | 4-socket Magny-Cours | 1-socket PhenomII |
---|---|---|---|---|---|---|---|---|---|---|
Version001 | “-O1” | – | – | – | 1 | 3.401 GB/s | 3.167 GB/s | 4.311 GB/s | 3.734GB/s | 4.586 GB/s |
Version002 | “-O2” | – | – | – | 1 | 4.122 GB/s | 4.035 GB/s | 5.719 GB/s | 5.120 GB/s | 5.688 GB/s |
Version003 | 8 partial sums | – | – | – | 1 | 4.512 GB/s | 4.373 GB/s | 5.946 GB/s | 5.476 GB/s | 6.207 GB/s |
Version004 | add SW prefetch | – | – | Y | 1 | 6.083 GB/s | 5.732 GB/s | 6.489 GB/s | 6.389 GB/s | 7.571 GB/s |
Version005 | add vector SSE | Y | – | Y | 1 | 6.091 GB/s | 5.765 GB/s | 6.600 GB/s | 6.398 GB/s | 7.580 GB/s |
Version006 | remove prefetch | Y | – | – | 1 | 5.247 GB/s | 5.159 GB/s | 6.787 GB/s | 6.403 GB/s | 6.976 GB/s |
Version007 | add large pages | Y | Y | – | 1 | 5.392 GB/s | 5.234 GB/s | 7.149 GB/s | 6.653 GB/s | 7.117 GB/s |
Version008 | split into triply-nested loop | Y | Y | – | 1 | 4.918 GB/s | 4.914 GB/s | 6.661 GB/s | 6.180 GB/s | 6.616 GB/s |
Version009 | add SW prefetch | Y | Y | Y | 1 | 6.173 GB/s | 5.901 GB/s | 6.646 GB/s | 6.568 GB/s | 7.736 GB/s |
Version010 | multiple pages/loop | Y | Y | Y | 2 | 6.417 GB/s | 6.174 GB/s | 7.569 GB/s | 6.895 GB/s | 7.913 GB/s |
Version011 | multiple pages/loop | Y | Y | Y | 4 | 7.063 GB/s | 6.804 GB/s | 8.319 GB/s | 7.245 GB/s | 8.583 GB/s |
Version012 | multiple pages/loop | Y | Y | Y | 8 | 7.260 GB/s | 6.960 GB/s | 8.378 GB/s | 7.205 GB/s | 8.642 GB/s |
Version013 | Version010 minus SW prefetch | Y | Y | – | 2 | 5.864 GB/s | 6.009 GB/s | 7.667 GB/s | 6.676 GB/s | 7.469 GB/s |
Version014 | Version011 minus SW prefetch | Y | Y | – | 4 | 6.743 GB/s | 6.483 GB/s | 8.136 GB/s | 6.946 GB/s | 8.291 GB/s |
Version015 | Version012 minus SW prefetch | Y | Y | – | 8 | 6.978 GB/s | 6.578 GB/s | 8.112 GB/s | 6.937 GB/s | 8.463 GB/s |
Comments
There are lots of results in the table above, and I freely admit that I don’t understand all of the details. There are a couple of important patterns in the data that are instructive….
- For the most part, the 2p Istanbul results are slightly slower than the 2p Shanghai results. This is exactly what is expected given the slightly better memory latency of the Shanghai system (74 ns vs 78 ns). The effective concurrency (Measured Bandwidth * Idle Latency) is almost identical across all fifteen implementations.
- The 4-socket Istanbul system gets a large boost in performance from the activation of the “HT Assist” feature — AMD’s implementation of what are typically referred to as “probe filters”. By tracking potentially modified cache lines, this feature allows reduction in memory latency for the common case of data that is not modified in other caches. The local memory latency on the 4p Istanbul box is about 54
ns, compared to 78 ns on the 2p Istanbul box (where the “HT Assist” feature is not activated by default). The performance boost seen is not as large as the latency ratio, but the improvements are still large. - This is my first set of microbenchmark measurements on a “Magny-Cours” system, so there are probably some details that I need to learn about. Idle memory latency on the system is 56.4 ns — slightly higher than on the 4p Istanbul system (as is expected with the slower processor cores: 2.2 GHz vs 2.6 GHz), but the slow-down is worse than expected due to straight latency ratios. Overall, however, the performance profile of the Magny-Cours is similar to that of the 4p Istanbul box, but with slightly lower effective concurrency in most of the code versions tested here. Note that the Magny-Cours system is configured with much faster DRAM: DDR3/1333 compared to DDR2/800. The similarity of the results strongly supports the hypothesis that sustained bandwidth is controlled by concurrency when running a single thread.
- The best performance is provided by the cheapest box — a single-socket desktop system. This is not surprising given the low memory latency on the single socket system.
Several of the comments above refer to the “Effective Concurrency”, which I compute as the product of the measured Bandwidth and the idle memory Latency (see my earlier post for some example data). For the test cases and systems mentioned above, the effective concurrency (measured in cache lines) is presented below:

Computed “Effective Concurrency” (= memory latency * measured bandwidth) for all 15 versions of the ReadOnly code on five test systems.
Posted in Computer Hardware | 5 Comments »