John McCalpin's blog

Dr. Bandwidth explains all….

AMD Opteron Memory Configuration notes

Posted by John D. McCalpin, Ph.D. on 16th December 2015

(These are old notes for relatively old systems — I just found this in my “drafts” folder and decided to switch the status to “public” so I can find it again!)

Some notes on how to determine the DRAM and Memory Controller configuration for a system using AMD Opteron/Phenom or other Family 10h processors.  All of this information is available in AMD’s publication: “BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors”, document number 31116. I am using “Rev 3.48 – April 22, 2010”

Background: How to Read and Interpret the PCI Configuration Bits

The processor configuration bits are available in PCI configuration space and can be read with the “lspci” program.  Unfortunately it requires a relatively new kernel (2.26 or newer) to read the extended configuration bits (i.e., those with offsets greater than 256 Bytes) — I will try to mark the problematic configuration bits as I go along.   To get the configuration bits, run “lspci -xxxx” (as root) and save the text output.

The “lspci” program prints out the output by bus, device, function, and offset.  Here we are only interested in the processor configurations, so we look through the output until we get to a line that looks like:

00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration

The initial characters of the line are interpreted as “bus 0”, “device 18”, “function 0”, followed by a text label for this PCI configuration space region.

The following lines will look like:

00: 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00

These are hexadecimal dumps of the values at various “offsets” into the PCI configuration space.

The values are organized with the higher addresses and most significant bits to the right (except that within each 2-digit hexadecimal number the least significant bits are to the right!).  These PCI configuration space values are organized into 32-bit “registers”, so the first line above corresponds to

Offset 00 04 08 0C
bits 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24
value (hex) 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00

This first line of output corresponds to four 32-bit PCI configuration space registers, shown as offsets 00, 04, 08, and 0C in the table. The BKDG describes these as:
F0x00 Device/Vendor ID Register
Reset: 1200 1022h. <– note the “little-endian” ordering
F0x04 Status/Command Register
Reset: 0010 0000h. Bit[20] is set to indicate the existence of a PCI-defined capability block.
F0x08 Class Code/Revision ID Register
Reset: 0600 0000h.
F0x0C Header Type Register
Reset: 0080 0000h.

Important Memory Controller and DRAM Configuration Bits

Information About Installed Hardware

The first thing to look at are the properties of the installed hardware. The DIMM configuration information is contained in F2x[1,0]80 DRAM Bank Address Mapping Register. The Family 10h Opterons have two memory controllers — the information for controller 0 is located at Function 2, Offset 080, while the information for controller 1 is located at Function 2, Offset 180. Note that these offsets are specified in hexadecimal, so that offsets greater than 100 (=256 decimal) are located in the extended PCI configuration area and will not be included in the output of “lspci” on systems running 2.6.25 or earlier Linux kernels. For this configuration bit the inability to read the controller 1 values is not likely to be a proble — if the DIMMs have been properly installed in matching pairs, then the two memory controllers will be configured identically by the BIOS.

For the system I am looking at today, the output of lspci for Function 2, offset 80 is

80:	55	00	00	00

Comparing the description of the bit field mappings from the BKDG (page 238) with my data for this configuration register and the device information in Table 85 (page 239) of the BKDG gives:

Bits Field Name Value Meaning
31:16 Reserved 00 00h (it is nice to see that these reserved bits are actually zero)
15:12 Dimm3AddrMap 0h ignored because this DIMM is not populated
11:8 Dimm2AddrMap 0h ignored because this DIMM is not populated
7:4 Dimm1AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13
3:0 Dimm0AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13

These interpretations make sense. The DIMMs are composed of 1 Gbit DRAM chips, each with 8 output bits (“x8”). To create a 64-bit DIMM, 8 of these DRAM chips work together as a single “rank” (actually there are 9 chips in a rank, to provide extra bits for ECC error correction). This “rank” will then have a capacity of 1 Gbit/chip * 8 chips = 1 GiB. (Here I am using the newer notation to distinguish between binary and decimal sizes — see http://en.wikipedia.org/wiki/Gibibyte for more information.)

A few things to note:

  • The bits in this configuration register don’t tell whether or not a DIMM is installed. The value of ‘0’ for Dimm2AddrMap and Dimm3AddrMap could correspond to a 128 MiB chip select size composed of 256 Mb parts with x16 width. It is quite unlikely that anyone will run across such a DIMM in their system — 256 Mbit parts are old technology and x16 width is quite unusual outside of the embedded processor space — but there may be no guarantee that the BIOS will set the bits here to such an easily identified value in the event that no DIMM is installed in that slot, so you do need to look elsewhere to be sure of the configuration.
  • The bits in this configuration register don’t tell how many “ranks” are included on each DIMM. A DIMM can be constructed with 1, 2, or 4 ranks (though 1 and 2 are by far the most common), so you need to check elsewhere to find the number of ranks.
  • In any system using both DRAM channels (either ganged or unganged-but-interleaved — see below) the address bit numbers in above must be incremented by one. The BKDG includes a comment on the bottom of page 238 that the address bits only need to be incremented when running in ganged mode. I think this is incorrect — the “effective” bank size is doubled (corresponding to incrementing the bank address bits) by the use of two DRAM channels, whether the interleaving is within a cache line (as in ganged mode) or between cache lines (as in unganged-but-interleaved) mode.

Information About Configuration of Hardware

Memory Controller Channel Interleave or Ganging

There are a number of common configurations choices that determine how the hardware makes use of the two 64-bit DRAM channels.

  • One feature is called “DRAM controller ganging”, which sets up the two DRAM channels to work in lockstep, with each cache line being split between the two channels. This feature is typically activated when the strongest error-correction features are desired. The downside of this approach is that each memory controller is only transferring 32 Bytes per cache line, which corresponds to 4 8-Byte bursts. This is the shortest burst length supported by DDR2 memory and makes the DRAM bus overheads relatively larger. DRAM controller ganging is enabled if F2x110 DRAM Controller Select Low Register Bit 4: DctGangEn: “DRAM controller ganging enable”, is set.
  • If the memory controllers are not ganged, the BIOS will attempt to set up “DRAM channel interleaving”. In this mode, each channel transfers full cache lines, and cache lines are distributed evenly between the two channels.
    DRAM channel interleaving is enabled if F2x110 DRAM Controller Select Low Register Bit 5: DctDatIntLv: “DRAM controller data interleave enable” is set. There are a number of options controlling exactly how the cache lines are mapped to the two DRAM channels. These options are controlled by F2x110 DRAM Controller Select Low Register Bits 7:6 DctSelIntLvAddr: “DRAM controller select channel interleave address bit”. The recommended option is the value “10”, which causes the DRAM channel select bit to be computed as the Exclusive-OR of address bits 20:16 & 6. When the value is “1”, channel 1 is selected, otherwise channel 0 is selected. Bit 6 is the bit above the cache line address, so using bit 6 alone would cause cache lines to be mapped odd/even to the two DRAM channels. By computing the channel select bit using the Exclusive-OR of six bits it is much less likely that an access pattern will repeatedly access only one of the two DRAM channels.

Posted in Computer Hardware | Comments Off on AMD Opteron Memory Configuration notes

AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies

Posted by John D. McCalpin, Ph.D. on 27th July 2012

In an earlier post, I documented the local and remote memory latencies for the SunBlade X6420 compute nodes in the TACC Ranger supercomputer, using AMD Opteron “Barcelona” (model 8356) processors running at 2.3 GHz.

Similar latency tests were run on other systems based on AMD Opteron processors in the TACC “Discovery” benchmarking cluster.  The systems reported here include

  • 2-socket node with AMD Opteron “Shanghai” processors (model 2389, quad-core, revision C2, 2.9 GHz)
  • 2-socket node with AMD Opteron “Istanbul” processors (model 2435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Istanbul” processors (model 8435, six-core, revision D0, 2.6 GHz)
  • 4-socket node with AMD Opteron “Magny-Cours” processors (model 6174, twelve-core, revision D1, 2.2 GHz)

Compared to the previous results with the AMD Opteron “Barcelona” processors on the TACC Ranger system, the “Shanghai” and “Istanbul” processors have a more advanced memory controller prefetcher, and the “Istanbul” processor also supports “HT Assist”, which allocates a portion of the L3 cache to serve as a “snoop filter” in 4-socket configurations.  The “Magny-Cours” processor uses 2 “Istanbul” die in a single package.

Note that for the “Barcelona” processors, the hardware prefetcher in the memory controller did not perform cache coherence “snoops” — it just loaded data from memory into a buffer.  When a core subsequently issued a load for that address, missing in the L3 cache initiated a coherence snoop.   In both 2-socket and 4-socket systems, this snoop took longer than obtaining the data from DRAM, so the memory prefetcher had no impact on the effective latency seen by a processor core.  “Shanghai” and later Opteron processors include a coherent prefetcher, so prefetched lines could be loaded with lower effective latency.   This difference means that latency testing on “Shanghai” and later processors needs to be slightly more sophisticated to prevent memory controller prefetching from biasing the latency measurements.  In practice, using a pointer-chasing code with a stride of 512 Bytes was sufficient to avoid hardware prefetch in “Shanghai”, “Istanbul”, and “Magny-Cours”.

Results for 2-socket systems

Processorcores/packageFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)
Opteron 2222230000Fh6095
Opteron 23564230010hB3Barcelona85
Opteron 23894290010hC2Shanghai73
Opteron 24356260010hD0Istanbul78
Opteron 617412220010hD1Magny-Cours

Notes for 2-socket results:

  1. The values for the Opteron 2222 are from memory, but should be fairly accurate.
  2. The local latency value for the Opteron 2356 is from memory, but should be in the right ballpark.  The latency is higher than the earlier processors because of the lower core frequency, the lower “Northbridge” frequency, the presence of an L3 cache, and the asynchronous clock boundary between the core(s) and the Northbridge.
  3. The script used for the Opteron 2389 (“Shanghai”) did not correctly bind the threads, so no remote latency data was collected.
  4. The script used for the Opteron 2435 (“Istanbul”) did not correctly bind the threads, so no remote latency data was collected.
  5. The Opteron 6174 was not tested in the 2-socket configuration.

Results for 4-socket systems

ProcessorFrequencyFamilyRevisionCode NameLocal Latency (ns)Remote Latency (ns)NOTES1-hop median Latency (ns)2-hop median Latency (ns)
Opteron 822230000Fh90
Opteron 8356230010hB3Barcelona100/133122-1461,2
Opteron 8389290010hC2Shanghai
Opteron 8435260010hD0Istanbul561183,4
Opteron 6174220010hD1Magny-Cours56121-1795124179

Notes for 4-socket results:

  1. On the 4-socket Opteron 8356 (“Barcelona”) system, 2 of the 4 sockets have a local latency of 100ns, while the other 2 have a local latency of 133ns.  This is due to the asymmetric/anisotropic HyperTransport interconnect, in which only 2 of sockets have direct connections to all other sockets, while the other 2 sockets require two hops to reach one of the remote sockets.
  2. On the 4-socket Opteron 8356 (“Barcelona”) system, the asymmetric/anisotropic HyperTransport interconnect gives rise to several different latencies for various combinations of requestor and target socket.  This is discussed in more detail at AMD “Barcelona” Memory Latency.
  3. Starting with “Istanbul”, 4-socket systems have lower local latency than 2-sockets systems because “HyperTransport Assist” (a probe filter) is activated.  Enabling this feature reduces the L3 cache size from 6MiB to 5MiB, but enables the processor to avoid sending probes to the other chips in many cases (including this one).
  4. On the 4-socket Opteron 8435 (“Istanbul”) system, the scripts I ran had an error causing them to only measure local latency and remote latency on 1 of the 3 remote sockets.  Based on other system results, it looks like the remote value was measured for a single “hop”, with no values available for the 2-hop case.
  5. On the 4-socket Opteron 6140 (“Magny-Cours”) system, each package has 2 die, each constituting a separate NUMA node.  The HyperTransport interconnect is asymmetric/anisotropic, with 2 die having direct links to 6 other die (with 1 die requiring 2 hops), and the other 6 die having direct links to 4 other die (with 3 die requiring 2 hops).  The average latency for globally uniform accesses (local and remote) is 133ns, while the average latency for uniformly distributed remote accesses is 144ns.

Comments

This disappointingly incomplete dataset still shows a few important features….

  • Local latency is not improving, and shows occasional degradations
    • Processor frequencies are not increasing.
    • DRAM latencies are essentially unchanged over this period — about 15 ns for DRAM page hits, 30 ns for DRAM pages misses, 45 ns for DRAM page conflicts, and 60 ns for DRAM bank cycle time.   The latency benchmark is configured to allow open page hits for the majority of accesses, but these results did not include instrumentation of the DRAM open page hit rate.
    • Many design changes act to increase memory latency.
      • Major factors include the increased number of cores per chip, the addition of a shared L3 cache, the increase in the number of die per package, and the addition of a separate clock frequency domain for the Northbridge.
    • There have been no architectural changes to move away from the transparent, “flat” shared-memory paradigm.
    • Instead, overcoming these latency adders required the introduction of “probe filters” – a useful feature, but one that significantly complicates the implementation, uses 1/6th of the L3 cache, and significantly complicates performance analysis.
  • Remote latency is getting slowly worse
    • This is primarily due to the addition of the L3 cache, the increase in the number of cores, and the increase in the number of die per package.

Magny-Cours Topology

The pointer-chasing latency code was run for all 64 combinations of data binding (NUMA nodes 0..7) and thread binding (1 core chosen from each of NUMA nodes 0..7).   It was not initially clear which topology was used in the system under test, but the observed latency pattern showed very clearly that 2 NUMA nodes had 6 1-hop neighbors, while the other 6 NUMA nodes had only 4 1-hop neighbors.   This property is also shown by the “Max Perf” configuration from Figure 3(c) of the 2010 IEEE Micro paper “Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor” by Conway, et al. (which I highly recommend for its discussion of the cache coherence protocol and probe filter).   The figure below corrects an error from the figure in the paper (which is missing the x16 link between the upper and lower chips in the package on the left), and renumbers the die so that they correspond to the NUMA nodes of the system I was testing.

The latency values are quite easy to understand: all the local values are the same, all the 1-hop values are almost the same, and all the 2-hop values are the same.  (The 1-hop values that are within the same package are about 3.3ns faster than the 1-hop values between packages, but this is a difference of less than 3%, so it will not impact “real-world” performance.)

The bandwidth patterns are much less pretty, but that is a much longer topic for another day….

Posted in Computer Hardware | Comments Off on AMD Opteron “Shanghai” and “Istanbul” Local and Remote Memory Latencies

TACC Ranger Node Local and Remote Memory Latency Tables

Posted by John D. McCalpin, Ph.D. on 26th July 2012

In the previous post, I published my best set of numbers for local memory latency on a variety of AMD Opteron system configurations. Here I expand that to include remote memory latency on some of the systems that I have available for testing.

Ranger is the oldest system still operational here at TACC.  It was brought on-line in February 2008 and is currently scheduled to be decommissioned in early 2013.  Each of the 3936 SunBlade X6420 nodes contains four AMD “Barcelona” quad-core Opteron processors (model 8356), running at a core frequency of 2.3 GHz and a NorthBridge frequency of 1.6 GHz.  (The Opteron 8356 processor supports a higher NorthBridge frequency, but this requires a different motherboard with  “split-plane” power supply support — not available on the SunBlade X6420.)

The on-node interconnect topology of the SunBlade X6420 is asymmetric, making maximum use of the three HyperTransport links on each Opteron processor while still allowing 2 HyperTransport links to be used for I/O.

As seen in the figure below, chips 1 & 2 on each node are directly connected to each of the other three chips, while chips 0 & 3 are only connected to two other chips — requiring two “hops” on the HyperTransport network to access the third remote chip.  Memory latency on this system is bounded below by the time required to “snoop” the caches on the other chips.  Chips 1 & 2 are directly connected to the other chips, so they get their snoop responses back more quickly and therefore have lower memory latency.

Ranger compute node inter-processor topology.

Ranger Compute node processor interconnect.

A variant of the “lat_mem_rd.c” program from “lmbench” (version 2) was used to measure the memory access latency.  The benchmark chases a chain of pointers that have been set up with a fixed stride of 128 Bytes (so that the core hardware prefetchers are not activated) and with a total size that significantly exceeds the size of the 2MiB L3 cache.  For the table below, array sizes of 32MiB to 1024MiB were used, with negligible variations in observed latency.    For this particular system, the memory controller prefetchers were active with the stride of 128 used, but since the effective latency is limited by the snoop response time, there is no change to the effective latency even when the memory controller prefetchers fetch the data early.  (I.e., the processors might get the data earlier due to memory controller prefetch, but they cannot use the data until all the snoop responses have been received.)

Memory latency for all combinations of (chip making request) and (chip holding data) are shown in the table below:

Memory Latency (ns) Data on Chip 0 Data on Chip 1 Data on Chip 2 Data on Chip 3
Request from Chip 0 133.2 136.9 136.4 145.4
Request from Chip 1 140.3 100.3 122.8 139.3
Request from Chip 2 140.4 122.2 100.4 139.3
Request from Chip 3 146.4 137.4 137.4 134.9
Cache latency and local and remote memory latency for Ranger compute nodes.

Cache latency and local and remote memory latency for Ranger compute nodes.

Posted in Computer Hardware | Comments Off on TACC Ranger Node Local and Remote Memory Latency Tables

AMD Opteron Processor models, families, and revisions

Posted by John D. McCalpin, Ph.D. on 2nd April 2011

Opteron Processor models, families, and revisions/steppings

Opteron naming is not that confusing, but AMD seems intent on making it difficult by rearranging their web site in mysterious ways….

I am creating this blog entry to make it easier for me to find my own notes on the topic!

The Wikipedia page is has a pretty good listing:
List of AMD Opteron microprocessors

AMD has useful product comparison reference pages at:
AMD Opteron Processor Solutions
AMD Desktop Processor Solutions
AMD Opteron First Generation Reference (pdf)

Borrowing from those pages, a simple summary is:

First Generation Opteron: models 1xx, 2xx, 8xx.

  • These are all Family K8, and are described in AMD pub 26094.
  • They are usually referred to as “Rev E” or “K8, Rev E” processors.
    This is usually OK since most of the 130 nm parts are gone, but there is a new Family 10h rev E (below).
  • They are characterized by having DDR DRAM interfaces, supporting DDR 266, 333, and (Revision E) 400 MHz.
  • This also includes Athlon 64 and Athlon 64 X2 in sockets 754 and 939.
  • Versions:
    • Single core, 130 nm process: K8 revisions B3, C0, CG
    • Single core, 90 nm process: K8 revisions E4, E6
    • Dual core, 90 nm process: K8 revisions E1, E6

Second Generation Opteron: models 12xx, 22xx, 82xx

  • These are upgraded Family K8 cores, with a DDR2 memory controller.
  • They are usually referred to as “Revision F”, or “K8, Rev F”, and are described in AMD pub 32559 (where they are referred to as “Family NPT 0Fh”, with NPT meaning “New Platform Technology” and referring to the infrastructure related to socket F (aka socket 1207), and socket AM2 )
  • This also includes socket AM2 models of Athlon and most Athlon X2 processors (some are Family 11h, described below).
  • There is only one server version, with two steppings:
    • Dual core, 90 nm process: K8 revisions F2, F3

Upgraded Second Generation Opteron: Athlon X2, Sempron, Turion, Turion X2

  • These are very similar to Family 0Fh, revision G (not used in server parts), and are described in AMD document 41256.
  • The memory controller has less functionality.
  • The HyperTransport interface is upgraded to support HyperTransport generation 3.
    This allows a higher frequency connection between the processor chip and the external PCIe controller, so that PCIe gen2 speeds can be supported.

Third Generation Opteron: models 13xx, 23xx, 83xx

  • These are Family 10h cores with an enhanced DDR2 memory controller and are described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • This also includes Phenom X2, X3, and X4 (Rev B3) and Phenom II X2, X3, X4 (Rev C)
  • Versions:
    • Barcelona: Dual core & Quad core, 65 nm process: Family 10h revisions B0, B2, B3, BA
    • Shanghai: Dual core & Quad core, 45 nm process: Family 10h revision C2
    • Istanbul: Up to 6-core, 45 nm process: Family 10h, revision D0
  • Revision D (“Istanbul”) introduced the “HT Assist” probe filter feature to improve scalability in 4-socket and 8-socket systems.

Upgraded Third Generation Opteron: models 41xx & 61xx

  • These are Family 10h cores with an enhanced DDR3-capable memory controller and are also described in AMD publication 41322.
  • All server and most desktop versions have a shared L3 cache.
  • It does not appear that any of the desktop parts use this same stepping as the server parts (D1).
  • There are two versions — both manufactured using a 45nm process:
    • Lisbon: 41xx series have one Family10h revision D1 die per package (socket C32).
    • Magny-Cours: 61xx series have two Family10h revision D1 dice per package (socket G34).
  • Family 10h, Revision E0 is used in the Phenom II X6 products.
    • This revision is the first to offer the “Core Performance Boost” feature.
    • It is also the first to generate confusion about the label “Rev E”.
    • It should be referred to as “Family 10h, Revision E” to avoid ambiguity.

Fourth Generation Opteron: server processor models 42xx & 62xx, and “AMD FX” desktop processors

  • These are socket-compatible with the 41xx and 61xx series, but with the “Bulldozer” core rather than the Family 10h core.
  • The Bulldozer core adds support for:
    • AVX — the extension of SSE from 128 bits wide to 256 bits wide, plus many other improvements. (First introduced in Intel “Sandy Bridge” processors.)
    • AES — additional instructions to dramatically improve performance of AES encryption/descryption. (First introduced in Intel “Westmere” processors.)
    • FMA4 — AMD’s 4-operand multiply-accumulate instructions. (32-bit & 64-bit arithmetic, with 64b, 128b, or 256b vectors.)
    • XOP — AMD’s set of extra integer instructions that were not included in AVX: multiply/accumulate, shift/rotate/permute, etc.
  • All current parts are produced in a 32 nm semiconductor process.
  • Valencia: 42xx series have one Bulldozer revision B2 die per package (socket C32)
  • Interlagos: 62xx series have two Bulldozer revision B2 dice per package (socket G34)
  • “AMD FX”: desktop processors have one Bulldozer revision B2 die per package (socket AM3+)
  • Counting cores and chips is getting more confusing…
    • Each die has 1, 2, 3, or 4 “Bulldozer modules”.
    • Each “Bulldozer module” has two processor cores.
    • The two processor cores in a module share the instruction cache (64kB), some of the instruction fetch logic, the pair of floating-point units, and the 2MB L2 cache.
    • The two processor cores in a module each have a private data cache (16kB), private fixed point functional and address generation units, and schedulers.
    • All modules on a die share an 8 MB L3 cache and the dual-channel DDR3 memory controller.
  • Bulldozer-based systems are characterized by a much larger “turbo” boost frequency increase than previous processors, with almost models supporting an automatic frequency boost of over 20% when not using all the cores, and some models supporting frequency boosts of more than 30%.

Posted in Computer Hardware, Reference | 4 Comments »