Posted by John D. McCalpin, Ph.D. on March 10, 2011
A reader of this site asked me if I had a detailed breakdown of the components of memory latency for a modern microprocessor-based system. Since the only real data I have is confidential/proprietary and obsolete, I decided to try to build up a latency equation from memory….
It is possible to estimate pieces of the latency equation on various systems if you combined carefully controlled microbenchmarks with a detailed understanding of the cache hierarchy, the coherence protocol, and the hardware performance monitors. Being able to control the CPU, DRAM, and memory controller frequencies independently is a big help.
On the other hand, if you have not worked in the design team of a modern microprocessor it is unlikely that you will be able to anticipate all the steps that are required in making a “simple” memory access. I spent most of 12 years in design teams at SGI, IBM, and AMD, and I am pretty sure that I cannot think of all the required steps.
Memory Latency Components: Abridged
Here is a sketch of some of the components for a simple, single-chip system (my AMD Phenom II model 555), for which I quoted a pointer-chasing memory latency of 51.58 ns at 3.2 GHz with DDR3/1600 memory. I will start counting when the load instruction is issued (ignoring instruction fetch, decode, and queuing).
- The load instruction queries the (virtually addressed) L1 cache tags — this probably occurs one cycle after the load instruction executes.
Simultaneously, the virtual address is looked up in the TLB. Assuming an L1 Data TLB hit, the corresponding physical address is available ~1 cycle later and is used to check for aliasing in the L1 Data Cache (this is rare). Via sneakiness, the Opteron manages to perform both queries with only a single access to the L1 Data Cache tags.
- Once the physical address is available and it has been determined that the virtual address missed in the L1, the hardware initiates a query of the (private) L2 cache tags and the core’s Miss Address Buffers. In parallel with this, the Least Recently Used entry in the corresponding congruence class of the L1 Data Cache is selected as the “victim” and migrated to the L2 cache (unless the chosen victim entry in the L1 is in the “invalid” state or was originally loaded into the L1 Data Cache using the PrefetchNTA instruction).
- While the L2 tags are being queried, a Miss Address Buffer is allocated and a speculative query is sent to the L3 cache directory.
- Since the L3 is both larger than the L2 and shared, it’s response time will constitute the critical path. I did not measure L3 latency on the Phenom II system, but other AMD Family 10h Revision C processors have an average L3 hit latency of 48.4 CPU clock cycles. (The non-integer average is no surprise at the L3 level, since the 6 MB L3 is composed of several different blocks that almost certainly have slightly different latencies.)
I can’t think of a way to precisely determine the time required to identify an L3 miss, but estimating it as 1/2 of the L3 hit latency is probably in the right ballpark. So 24.2 clock cycles at 3.2 GHz contributes the first 7.56 ns to the latency.
- Once the L3 miss is confirmed, the processor can begin to set up a memory access. The core sends the load request to the “System Request Interface”, where the address is compared against various tables to determine where to send the request (local chip, remote chip, or I/O), so that the message can be prepended with the correct crossbar output address. This probably takes another few cycles, so we are up to about 9.0 ns.
- The load request must cross an asynchronous clock boundary on the way from the core to the memory controller, since they run at different clock frequencies. Depending on the implementation, this can add a latency of several cycles on each side of the clock boundary. An aggressive implementation might take as few as 3 cycles in the CPU clock domain plus 5 cycles in the memory controller clock domain, for a total of ~3.5 ns in the outbound direction (assuming a 3.2 GHz core clock and a 2.0 GHz NorthBridge clock).
- At this point the memory controller begins to do two things in parallel. (Either of these could constitute the critical path in the latency equation, depending on the details of the chip implementation and the system configuration.)
- probe the other caches on the chip, and
- begin to set up the DRAM access.
- For the probes, it looks like four asynchronous crossings are required (requesting core to memory controller, memory controller to other core(s), other cores to memory controller, memory controller to requesting core). (Probe responses from the various cores on each chip are gathered by the chip’s memory controller and then forwarded to the requesting core as a single message per memory controller.) Again assuming 3 cycles on the source side of the interface and 5 cycles on the destination side of the interface, these four crossings take 3.5+3.1+3.5+3.1 = 13.2 ns. Each of the other cores on the chip will take a few cycles to probe its L1 and L2caches — I will assume that this takes about 1/2 of the 15.4 cycle average L2 hit latency, so about 2.4 ns. If there is no overhead in collecting the probe response(s) from the other core(s) on the chip, this adds up to 15.6 ns from the time the System Request Interface is ready to send the request until the probe response is returned to the requesting core. Obviously the core won’t be able to process the probe response instantaneously — it will have to match the probe response with the corresponding load buffer, decide what the probe response means, and send the appropriate signal to any functional units waiting for the register that was loaded to become valid. This is probably pretty fast, especially at core frequencies, but probably kicks the overall probe response latency up to ~17ns.
- For the memory access path, there are also four asynchronous crossings required — requesting core to memory controller, memory controller to DRAM, DRAM to memory controller, and memory controller to core. I will assume 3.5 and 3.1 ns for the core to memory controller boundaries. If I assume the same 3+5 cycle latency for the asynchronous boundary at the DRAMs the numbers are quite high — 7.75 ns for the outbound path and 6.25 ns for the inbound path (assuming 2 GHz for the memory controller and 0.8 GHz for the DRAM channel).
- There is additional latency associated with the time-of-flight of the commands from the memory controller to the DRAM and of the data from the DRAM back to the memory controller on the DRAM bus. These vary with the physical details of the implementation, but typically add on the order of 1 ns in each direction.
- I did not record the CAS latency settings for my system, but CAS 9 is typical for DDR3/1600. This contributes 11.25 ns.
- On the inbound trip, the data has to cross two asynchronous boundaries, as discussed above.
- Most systems are set up to perform “critical word first” memory accesses, so the memory controller returns the 8 to 128 bits requested in the first DRAM transfer cycle (independent of where they are located in the cache line). Once this first burst of data is returned to the core clock domain, it must be matched with the corresponding load request and sent to the corresponding processor register (which then has its “valid” bit set, allowing the out-of-order instruction scheduler to pick any dependent instructions for execution in the next cycle.) In parallel with this, the critical burst and the remainder of the cache line are transferred to the previous chosen “victim” location in the L1 Data Cache and the L1 Data Cache tags are updated to mark the line as Most Recently Used. Again, it is hard to know exactly how many cycles will be required to get the data from the “edge” of the core clock domain into a valid register, but 3-5 cycles gives another 1.0-1.5 ns.
The preceding steps add up all the outbound and inbound latency components that I can think of off the top of my head.
Let’s see what they add up to:
- Core + System Request Interface: outbound: ~9 ns
- Cache Coherence Probes: (~17 ns) — smaller than the memory access path, so probably completely overlapped
- Memory Access Asynchronous interface crossings: ~21 ns
- DRAM CAS latency: 11.25 ns
- Core data forwarding: ~1.5 ns
- Total non-overlapped: ~43 ns
- Measured latency: 51.6 ns
- Unaccounted: ~9 ns = 18 memory controller clock cycles (assuming 2.0 GHz)
- I don’t know how much of the above is correct, but the match to observed latency is closer than I expected when I started….
- The inference of 18 memory controller clock cycles seems quite reasonable given all the queues that need to be checked & such.
- I have a feeling that my estimates of the asynchronous interface delays on the DRAM channels are too high, but I can’t find any good references on this topic at the moment.
Comments and corrections are always welcome. In my career I have found that a good way to learn is to try to explain something badly and have knowledgeable people correct me!