Coherence with Cached Memory-Mapped IO

In response to my previous blog entry, a question was asked about how to manage coherence for cached memory-mapped IO regions. Here are some more details…

Maintaining Coherence with Cached Memory-Mapped IO

For the “read-only” range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy. Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line. Not all of these will work on all implementations!

Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the “flushing” reads. There are also some systems for which the pseudo-LRU algorithm has “features” that can break this approach. (HyperThreading and shared caches can both add complexity in this dimension.)
The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
Whether it will evict the line from the L3 depends on the implementation, and I don’t have enough information to speculate on whether this will work on Xeon processors. For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
For memory mapped my the MTRRs as WP (“Write Protect”), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches. This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer’s Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed. Additional details are discussed in the SW Developer’s Guide, Volume 3. Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer’s Guide, Vol 2.
The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost). This is very likely to crash your system, and is only mentioned here for completeness.
AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO). This is *almost* useful for supporting cached MMIO, but (at least on the Family 10h processors), the specific mode that I wanted to set up (see addendum below) is disallowed for ugly microarchitectural reasons that I can’t discuss.

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate. There are no guarantees, but this seems like the approach most likely to work.

Addendum: The AMD almost-solution using MTRR extensions.

The AMD64 architecture provides extensions to the MTRR mechanism called IORRs that allow the system programmer to independently specify whether reads to a certain region go to system memory or MMIO and whether writes to that region go to system memory or MMIO. This is discussed in the “AMD64 Architecture Programmers Manual, Volume 2: System Programming” (publication number 24593).
I am using version 3.22 from September 2012, where this is described in section 7.9.

The idea was to use this to modify the behavior of the “read-only” MMIO mapping so that reads would go to MMIO while writes would go to system memory. At first glance this seems strange — I would be creating a “write-only” region of system memory that could never be read (because reads to that address range would go to MMIO).

So why would this help?

It would help because sending the writes to system memory would cause the cache coherence mechanisms to be activated. A streaming store (for example) to this region would be sent to the memory controller for that physical address range. The memory controller treats streaming stores in the same way as DMA stores from IO devices to system memory, and it sends out invalidate messages to all caches in the system. This would invalidate the cached MMIO line in all caches, which would eliminate both the need to pin the thread to a specific core and the problem of the CLFLUSH not reaching the L3 cache.

At least in the AMD Family 10h processors, this IORR function works, but due to some implementation issues in this particular use case it forces the region to the MTRR UC (uncached) type, which defeats my purpose in the exercise. I think that the implementation issues could be either fixed or worked around, but since this is a fix to a mode that is not entirely supported, it is easy to understand that this never showed up as a high priority to “fix”.