I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).
The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.
What might such a hint be good for? There are two obvious use cases:
- Temporal Locality Control: The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
- Cache-to-Cache Intervention Optimization: The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
- Intel’s instruction description mentions this use case explicitly.
If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.
Instructions that directly pertain to caching can be grouped into three categories:
- Mandatory Control
- The specified cache transaction must take place to guarantee correctness.
- E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory. The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
- “Direct” Hints
- A cache transaction is requested, but it is not required for correctness.
- The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
- E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
- This allows the cache line to be brought into the processor in advance of the store.
- More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
- “Indirect” Hints
- A cache transaction is requested, but it is not required for correctness.
- The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
- E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next. The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
- I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
- The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
- Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.
Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach. There are several issues at play here:
- The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
- This is documented in my invited talk at SC16: https://sites.utexas.edu/jdm4372/2016/11/22/sc16-invited-talk-memory-bandwidth-and-system-balance-in-hpc-systems/
- Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
- This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
- The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
- Protocols must become more complex to deal with increasing core counts and cache sizes without sacrificing frequency and/or latency.
- Implementations must have more complex behavior to deal with the increasing opportunities for contention — dynamically adaptive prefetching, dynamically adaptive coherence policies.
- Implementations must become increasingly effective at power management without degrading cache performance.
- Increasingly complex designs are more prone to “surprises”, e.g., https://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/
- Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
- This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
- Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).
Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power. More of the same is not an answer — new thinking is required.
If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:
- “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
- “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.
Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.