“Memory directories” in Intel processors

One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”. This is used in multi-socket systems to reduce cache coherence traffic between sockets.

I have referred to this in various presentations as:

“A Memory Directory is one or more bits per cache line in DRAM that tell the processor whether another socket might have a dirty copy of the cache line.”

When asked for references to public information about the Intel’s memory directory implementation, I had to go back and find the various tiny bits of information scattered around different places. That was boring, so I am posting my notes here so I can find them again — maybe others will find this useful as well….

Existence:

The clearest admission that the memory directory feature exists is in this “technical overview” presentation – in the section “Directory-Based Coherency” – https://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html. The language is not as precise as I would prefer, but there are actually quite a few interesting details in that section.

History of Implementations:

Intel has disclosed various details in some presentation at the Hot Chips series of conferences:

The Intel Hot Chips presentation on Westmere-EX describes an early implementation of the memory directory (slides 9-12):
- https://old.hotchips.org/wp-content/uploads/hc_archives/hc22/HC22.24.610-Nagara-Intel-6-Westmere-EX.pdf
The Intel Hot Chips presentation on Ivy Bridge EP describes the memory directory feature and says that it was upgraded from one bit to two bits in the IVB-EP processor – describing the 3 states that are still used.
- See slides 17-19 of https://old.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-8-Big-Iron-Servers-epub/HC26.13.832-IvyBridge-Esmer-Intel-IVB%20Server%20Hotchips2014.pdf

(I apologize in advance if the links are broken — it is hard to keep up with web site reorganizations — but the presentations should be relatively easy to find using web search services.)

Implicit Information:

Much of the detailed understanding of the memory directory implementation comes from studying the uncore performance counter events that make reference to memory directories.

As an example, there are direct references to memory directories in several sections of the Intel Uncore Performance Monitoring Guides for their various processor families. These documents are linked from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

The version that I have spent the most time with is for the 1st and 2nd generations of Intel Xeon Scalable Processors (“Skylake Xeon” and “Cascade Lake Xeon”): “Intel® Xeon® Processor Scalable Memory Family Uncore Performance Monitoring Reference Manual”, Intel document 336274.

Specific references in this manual include:

CHA Events “DIR_LOOKUP” and “DIR_UPDATE”
IMC Event
- IMC_WRITES_COUNT includes: “NOTE: Directory bits are stored in memory. Remote socket RFOs will result in a directory update which, in turn, will cause a write command.”
- Aside: it would be easy to be misled by this description. Remote socket RFOs will result in a directory update but so will ordinary reads that return data in E state (the default for data reads that hit unshared lines).
M2M Events “DIRECTORY_HIT”, “DIRECTORY_LOOKUP”, “DIRECTORY_MISS”, “DIRECTORY_UPDATE”
- These include information on directory states – very helpful for understanding the implementation.
Section 3.1.3 “Reference for M2M Packet Matching”, Table 3-9 “SMI3 Opcodes”, mentions the directory in the description of 10 of the 18 transaction types.

It is helpful to compare and contrast the documentation from all of the recent processor generations. The specific items disclosed in each generation are not the same, and sometimes a generation will have a more verbose explanation that fills in gaps for the other generations.

Reading the documentation is seldom enough to understand the implementation — I typically have to create customized microbenchmarks to generate known counts of transactions of various types and then compare the measured performance counter values with the values I expected to generate. When there are significant differences more thinking is required. Thinking does not always help — sometimes the performance counter event is just broken, sometimes there is just not enough information disclosed to derive reasonable bounds on the implementation, and sometimes the implementation has strongly dynamic/adaptive behavior based on activity metrics that are not visible or not controllable using available HW features.

OEM Vendor Disclosures:

I have run across a few other bits of information scattered around the interwebs:

https://sp.ts.fujitsu.com/dmsp/Publications/public/wp-skylake-RXTX-bios-settings-primergy-ww-en.pdf
- In the “Stale AtoS” BIOS option, this document describes the 3 states of the in-memory directory (I, A, S), and describes the transactions that are enabled by this option. Very helpful.
https://lenovopress.lenovo.com/lp1477.pdf
- Includes a discussion of the Stale AtoS feature (with more words than the Fujitsu document).
- The section “Snoop Preference (Processor Snoop Mode)” discusses the differences between “Home Snoop” and “Home Snoop with Directory Lookup + Opportunistic Snoop Broadcast (OSB) + HitME cache”

Encoding the Memory Directory bits:

In various presentations I have said that the “one or more” memory directory bits were “hidden” in the ECC bits. This is easy to do – standard 64-bit DRAM interfaces with ECC provide 72 bits of data with every “64-bit” data transfer and generate 8 contiguous transfers per transaction. 64 bits of data requires 8 bits of ECC for SECDED for every transfer. Computing the ECC over P transfers only requires log2(P) additional bits, while the ECC gives you 8*(P-1) extra bits. Intel used this aggregation approach very aggressively on the Knights Corner accelerator to store the SECDED ECC bits for 64 Byte cache lines in regular memory – reserving every 32^ndcache line for that purpose.

Are there any other important references that I am missing?