Notes on Cached Access to Memory-Mapped IO Regions

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices. These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device. This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device. The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance. As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges. Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput. When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references. So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses? The answer is a resounding, “yes, but….“. The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining). In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.

Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region. This software based coherence functionality can be implemented at many different levels of complexity, for example:

For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic. In such a case, a simple “directory-based coherence” mechanism can be used. The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”. The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated. Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….