John McCalpin's blog

Dr. Bandwidth explains all….

Intel’s future “CLDEMOTE” instruction

Posted by John D. McCalpin, Ph.D. on 18th February 2019

I recently saw a reference to a future Intel “Atom” core called “Tremont” and ran across an interesting new instruction, “CLDEMOTE”, that will be supported in “Future Tremont and later” microarchitectures (ref: “Intel® Architecture Instruction Set Extensions and Future Features Programming Reference”, document 319433-035, October 2018).

The “CLDEMOTE” instruction is a “hint” to the hardware that it might help performance to move a cache line from the cache level(s) closest to the core to a cache level that is further from the core.

What might such a hint be good for?   There are two obvious use cases:

  • Temporal Locality Control:  The cache line is expected to be re-used, but not so soon that it should remain in the closest/smallest cache.
  • Cache-to-Cache Intervention Optimization:  The cache line is expected to be accessed soon by a different core, and cache-to-cache interventions may be faster if the data is not in the closest level(s) of cache.
    • Intel’s instruction description mentions this use case explicitly.

If you are not a “cache hint instruction” enthusiast, this may not seem like a big deal, but it actually represents a relatively important shift in instruction design philosophy.

Instructions that directly pertain to caching can be grouped into three categories:

  1. Mandatory Control
    • The specified cache transaction must take place to guarantee correctness.
    • E.g., In a system with some non-volatile memory, a processor must have a way to guarantee that dirty data has been written from the (volatile) caches to the non-volatile memory.    The Intel CLWB instruction was added for this purpose — see https://software.intel.com/en-us/blogs/2016/09/12/deprecate-pcommit-instruction
  2. “Direct” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of specific transactions on a model architecture (with caveats that an implementation might do something different).
    • E.g., Intel’s PREFETCHW instruction requests that the cache line be loaded in anticipation of a store to the line.
      • This allows the cache line to be brought into the processor in advance of the store.
      • More importantly, it also allows the cache coherence transactions associated with obtaining exclusive access to the cache line to be started in advance of the store.
  3. “Indirect” Hints
    • A cache transaction is requested, but it is not required for correctness.
    • The instruction definition is written in terms of the semantics of the program, not in terms of specific cache transactions (though specific transactions might be provided as an example).
    • E.g., “Push For Sharing Instruction” (U.S. Patent 8099557) is a hint to the processor that the current process is finished working on a cache line and that another processing core in the same coherence domain is expected to access the cache line next.  The hardware should move the cache line and/or modify its state to minimize the overhead that the other core will incur in accessing this line.
      • I was the lead inventor on this patent, which was filed in 2008 while I was working at AMD.
      • The patent was an attempt to generalize my earlier U.S. Patent 7194587, “Localized Cache Block Flush Instruction”, filed while I was at IBM in 2003.
    • Intel’s CLDEMOTE instruction is clearly very similar to my “Push For Sharing” instruction in both philosophy and intent.

 

Even though I have contributed to several patents on cache control instructions, I have decidedly mixed feelings about the entire approach.  There are several issues at play here:

  • The gap between processor and memory performance continues to increase, making performance more and more sensitive to the effective use of caches.
  • Cache hierarchies are continuing to increase in complexity, making it more difficult to understand what to do for optimal performance — even if precise control were available.
    • This has led to the near-extinction of “bottom-up” performance analysis in computing — first among customers, but also among vendors.
  • The cost of designing processors continues to increase, with caches & coherence playing a significant role in the increased cost of design and validation.
  • Technology trends show that the power associated with data motion (including cache coherence traffic) has come to far exceed the power required by computations, and that the ratio will continue to increase.
    • This does not currently dominate costs (as discussed in the SC16 talk cited above), but that is in large part because the processors have remained expensive!
    • Decreasing processor cost will require simpler designs — this decreases the development cost that must be recovered and simultaneously reduces the barriers to entry into the market for processors (allowing more competition and innovation).

Cache hints are only weakly effective at improving performance, but contribute to the increasing costs of design, validation, and power.  More of the same is not an answer — new thinking is required.

If one starts from current technology (rather than the technology of the late 1980’s), one would naturally design architectures to address the primary challenges:

  • “Vertical” movement of data (i.e., “private” data moving up and down the levels of a memory hierarchy) must be explicitly controllable.
  • “Horizontal” movement of data (e.g., “shared” data used to communicate between processing elements) must be explicitly controllable.

Continuing to apply “band-aids” to the transparent caching architecture of the 1980’s will not help move the industry toward the next disruptive innovation.

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Intel’s future “CLDEMOTE” instruction

Some notes on producer/consumer communication in cached processors

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

In a recent Intel Software Developer Forum discussion (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477), I put together a few notes on the steps required for a single-producer, single-consumer communication using separate cache lines for “data” and “flag” values.

Although this was not a carefully-considered formal analysis, I think it is worth re-posting here as a reminder of the ridiculous complexity of implementing even the simplest communication operations in shared memory on cached systems.


I usually implement a producer/consumer code using “data” and “flag” in separate cache lines.  This enables the consumer to spin on the flag while the producer updates the data.   When the data is ready, the producer writes the flag variable.   At a low level, the steps are:

  1. The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.
    • The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
    • The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
  2. The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
  3. The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
    • The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
    • The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable.  If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
    • The L3 may send the data for the cache line containing the flag to the producer’s cache at this time, but that data cannot be used until the coherence transactions are complete.
  4. The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
    • Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
  5. The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
    • If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
    • If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache.  I don’t know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
  6. Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
    • Depending on the implementation, several different things might happen.
    • One option is for the L3 to hold the line containing the flag variable in a “transient” state while it waits for an acknowledgement from the Producer core that it has received the write permission message.  In this case the L3 will either:
      • Stall the read request from the consumer core, or
      • NACK the read request from the consumer core (i.e., tell it to try again).
    • Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core’s cache.
  7. In the “lucky” case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
    • Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
    • Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data.  (This is more common in the case of a store than a read.)
  8. Finally the consumer core gets the new value of the flag variable and sees that it has changed!  The data is now ready!
  9. The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
  10. The consumer core now executes a load instruction to get the data.  This misses in the consumer’s L1 and L2 caches and generates a read request on the ring.
  11. The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
  12. The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
  13. The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now “read-only” in the producer’s caches.
  14. As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
  15. The cache line containing the data arrives in the consumer core’s cache, and the load instruction is allowed to complete.
  16. Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
    • This requires the consumer to make an “upgrade” request on the cache line containing the flag, so it can write to it.   This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again — it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
    • Double-buffering can be used to avoid this extra transaction — if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation.  These often involve extra round trips required to ensure ordering in ugly corner cases.  A common example is maintaining global ordering across stores that are handled by different coherence controllers.  This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.

There are fewer steps in the case where the “data” and “flag” are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line.  This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Some notes on producer/consumer communication in cached processors