In a recent Intel Software Developer Forum discussion (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477), I put together a few notes on the steps required for a single-producer, single-consumer communication using separate cache lines for “data” and “flag” values.
Although this was not a carefully-considered formal analysis, I think it is worth re-posting here as a reminder of the ridiculous complexity of implementing even the simplest communication operations in shared memory on cached systems.
I usually implement a producer/consumer code using “data” and “flag” in separate cache lines. This enables the consumer to spin on the flag while the producer updates the data. When the data is ready, the producer writes the flag variable. At a low level, the steps are:
There are many variants and many details that can be non-intuitive in an actual implementation. These often involve extra round trips required to ensure ordering in ugly corner cases. A common example is maintaining global ordering across stores that are handled by different coherence controllers. This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.
There are fewer steps in the case where the “data” and “flag” are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line. This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).