Memory systems using caches have a lot more potential flexibility than most implementations are able to exploit – you get the standard behavior all the time, even if an alternative behavior would be allowable and desirable in a specific circumstance. One area in which many vendors have provided an alternative to the standard behavior is in “non-temporal” or “streaming” stores. These are sometimes also referred to as “cache-bypassing” or “non-allocating” stores, and even “Non-Globally-Ordered stores” in some cases.
The availability of streaming stores on various processors can often be tied to the improvements in performance that they provide on the STREAM benchmark, but there are several approaches to implementing this functionality, with some subtleties that may not be obvious. (The underlying mechanism of write-combining was developed long before, and for other reasons, but exposing it to the user in standard cached memory space required the motivation provided by a highly visible benchmark.)
Before getting into details, it is helpful to review the standard behavior of a “typical” cached system for store operations. Most recent systems use a “write-allocate” policy — when a store misses the cache, the cache line containing the store’s target address is read into the cache, then the parts of the line that are receiving the new data are updated. These “write-allocate” cache policies have several advantages (mostly related to implementation correctness), but if the code overwrites all the data in the cache line, then reading the cache line from memory could be considered an unnecessary performance overhead. It is this extra overhead of read transactions that makes the subject of streaming stores important for the STREAM benchmark.
As in many subjects, a lot of mistakes can be avoided by looking carefully at what the various terms mean — considering both similarities and differences.
- “Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”).
- So there is no benefit to keeping the data in the processor’s cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
- “Streaming store” is suggestive of stores to contiguous addresses (i.e., high “spatial locality”).
- “Cache-bypassing store” says that at least some aspects of the transaction bypass the cache(s).
- “Non-allocating store” says that a store that misses in a cache will not load the corresponding cache line into the cache before performing the store.
- This implies that there is some other sort of structure to hold the data from the stores until it is sent to memory. This may be called a “store buffer”, a “write buffer”, or a “write-combining buffer”.
- “Non-globally-ordered store” says that the results of an NGO store might appear in a different order (relative to ordinary stores or to other NGO stores) on other processors.
- All stores must appear in program order on the processor performing the stores, but may only need to appear in the same order on other processors in some special cases, such as stores related to interprocessor communication and synchronization.
There are at least three issues raised by these terms:
- Caching: Do we want the result of the store to displace something else from the cache?
- Allocating: Can we tolerate the extra memory traffic incurred by reading the cache line before overwriting it?
- Ordering: Do the results of this store need to appear in order with respect to other stores?
Different applications may have very different requirements with respect to these issues and may be best served by different implementations. For example, if the priority is to prevent the stored data from displacing other data in the processor cache, then it may suffice to put the data in the cache, but mark it as Least-Recently-Used, so that it will displace as little useful data as possible. In the case of the STREAM benchmark, the usual priority is the question of allocation — we want to avoid reading the data before over-writing it.
While I am not quite apologizing for that, it is clear that STREAM over-represents stores in general, and store misses in particular, relative to the high-bandwidth applications that I intended it to represent.
To understand the benefit of non-temporal stores, you need to understand if you are operating in (A) a concurrency-limited regime or (B) a bandwidth-limited regime. (This is not a short story, but you can learn a lot about how real systems work from studying it.)
(Note: The performance numbers below are from the original private version of this post created on 2015-05-29. Newer systems require more concurrency — numbers will be updated when I have a few minutes to dig them up….)
(A) For a single thread you are almost always working in a concurrency-limited regime. For example, my Intel Xeon E5-2680 (Sandy Bridge EP) dual-socket systems have an idle memory latency to local memory of 79 ns and a peak DRAM bandwidth of 51.2 GB/s (per socket). If we assume that some cache miss handling buffer must be occupied for approximately one latency per transaction, then queuing theory dictates that you must maintain 79 ns * 51.2 GB/s = 4045 Bytes “in flight” at all times to “fill the memory pipeline” or “tolerate the latency”. This rounds up to 64 cache lines in flight, while a single core only supports 10 concurrent L1 Data Cache misses. In the absence of other mechanisms to move data, this limits a single thread to a read bandwidth of 10 lines * 64 Bytes/line / 79 ns = 8.1 GB/s.
Store misses are essentially the same as load misses in this analysis.
L1 hardware prefetchers don’t help performance here because they share the same 10 L1 cache miss buffers. L2 hardware prefetchers do help because bringing the data closer reduces the occupancy required by each cache miss transaction. Unfortunately, L2 hardware prefetchers are not directly controllable, so experimentation is required. The best read bandwidth that I have been able to achieve on this system is about 17.8 GB/s, which corresponds to an “effective concurrency” of about 22 cache lines or an “effective latency” of about 36 ns (for each of the 10 concurrent L1 misses).
L2 hardware prefetchers on this system are also able to perform prefetches for store miss streams, thus reducing the occupancy for store misses and increasing the store miss bandwidth.
For non-temporal stores there is no concept corresponding to “prefetch”, so you are stuck with whatever buffer occupancy the hardware gives you. Note that since the non-temporal store does not bring data *to* the processor, there is no reason for its occupancy to be tied to the memory latency. One would expect a cache miss buffer holding a non-temporal store to have a significantly lower latency than a cache miss, since it is allocated, filled, and then transfers its contents out to the memory controller (which presumably has many more buffers than the 10 cache miss buffers that the core supports).
But, while non-temporal stores are expected to have a shorter buffer occupancy than that of a cache miss that goes all the way to memory, it is not at all clear whether they will have a shorter buffer occupancy than a store misses that activates an L2 hardware prefetch engine. It turns out that on the Xeon E5-2680 (Sandy Bridge EP), non-temporal stores have *higher* occupancy than store misses that activate the L2 hardware prefetcher, so using non-temporal stores slows down the performance of each of the four STREAM kernels. I don’t have all the detailed numbers in front of me, but IIRC, STREAM Triad runs at about 10 GB/s with one thread on a Xeon E5-2680 when using non-temporal stores and at between 12-14 GB/s when *not* using non-temporal stores.
This result is not a general rule. On the Intel Xeon E3-1270 (also a Sandy Bridge core, but with the “client” L3 and memory controller), the occupancy of non-temporal stores in the L1 cache miss buffers appears to be much shorter, so there is not an occupancy-induced slowdown. On the older AMD processors (K8 and Family 10h), non-temporal stores used a set of four “write-combining registers” that were independent of the eight buffers used for L1 data cache misses. In this case the use of non-temporal stores allowed one thread to have more concurrent
memory transactions, so it almost always helped performance.
(B) STREAM is typically run using as many cores as is necessary to achieve the maximum sustainable memory bandwidth. In this bandwidth-limited regime non-temporal stores play a very different role. When using all the cores there are no longer concurrency limits, but there is a big difference in bulk memory traffic. Store misses must read the target lines into the L1 cache before overwriting it, while non-temporal stores avoid this (semantically unnecessary) load from memory. Thus for the STREAM Copy and Scale kernels, using cached stores results in two memory reads and one memory write, while using non-temporal stores requires only one memory read and one memory write — a 3:2 ratio. Similarly, the STREAM Add and Triad kernels transfer 4:3 as much data when using cached stores compared to non-temporal stores.
On most systems the reported performance ratios for STREAM (using all processors) with and without non-temporal stores are very close to these 3:2 and 4:3 ratios. It is also typically the case that STREAM results (again using all processors) are about the same for all four kernels when non-temporal stores are used, while (due to the differing ratios of extra read traffic) the Add and Triad kernels are typically ~1/3 faster on systems that use cached stores.