John McCalpin's blog

Dr. Bandwidth explains all….

Archive for November, 2016

SC16 Invited Talk: Memory Bandwidth and System Balance in HPC Systems

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

Slide01

I have been involved in HPC for over 30 years:

  • 12 years as student & faculty user in ocean modeling,
  • 12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
  • over 7 years as a research scientist at TACC.

Slide02

Slide03
  • This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.

Slide04
  • Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
    • The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
    • The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
  • RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
    • The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
    • The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
    • In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
  • x86-based systems rapidly replaced RISC systems starting in around 2003.
    • The first big x86 system on the list was ASCI Red in 1996.
    • The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
    • The earliest of these systems were 32-bit Intel processors.
    • The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
    • Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
    • AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
  • My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
    • Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
    • The big exception is the new #1 system, the Sunway Taihulight system in China.
  • Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
    • “RoadRunner” was the only significant system using the IBM Cell processor.
    • Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
      • The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
      • The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
    • Note the rapid growth, then contraction, of the accelerated systems Rmax.
      • More on this topic below in the discussion of “clusters of clusters”.

Slide05
  • Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
  • With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
  • The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
  • One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.

Slide06
  • Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
  • So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
  • Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
  • A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.

Slide07

Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.


Slide08

Slide09
  • This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
  • I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.

Slide10
  • Note that the output of each kernel is used as an input to the next.
    • The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
    • Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
    • Another way to say this is that STREAM is resistant to undetected over-optimization.
  • OpenMP directives were added in 1996 or 1997.
  • STREAM in C was made fully 64-bit capable in 2013.
    • The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
  • Output print formats keep getting wider as systems get faster.

Slide11
  • STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
    • For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
    • I count the reads that I asked for and the writes that I asked for.
    • If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
      • The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
      • The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
  • One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
    • A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
    • Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
      • For example, GCC will not generate streaming stores.
    • In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.

Slide12

Back to the main topic….

For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.


Slide13
  • Using the same performance axes as on the previous slide….
  • All balances are shifting to make data motion relatively more expensive than arithmetic.

Slide14
  • The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
    • This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
    • I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
  • The data points here are from a set of real systems.
    • The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
      • ~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
      • ~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
      • ~1996: SGI Origin2000 (MIPS R10000 processor)
      • ~1999: DEC AlphaServer DS20 (DEC Alpha processor)
      • ~2005: AMD Opteron 244 (single-core, DDR1 memory)
      • ~2006: AMD Opteron 275 (dual-core, DDR1 memory)
      • ~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
      • ~2011: Intel Xeon X5680 (6-core Westmere EP)
      • ~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
      • ~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
      • ~2014: Intel Xeon E5 v3 (12-core Haswell EP)
      • (future: Intel Xeon E5 v5)
  • Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
    • Unfortunately every other metric is much worse….

Slide15
  • ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
  • Memory latency is becoming expensive much more rapidly than memory bandwidth!
    • Memory latency is dominated by the time required for cache coherence in most recent systems.
    • Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
  • Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
    • More on this below — slide 38.
    • The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.

Slide16
  • Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
  • Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart.  Don’t believe me?  OK…

Slide17
  • Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
  • Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
    • The details depend on the scaling characteristics of your application and are left as an exercise….

Slide18
  • Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low.   This is no longer the case.
  • In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced.  We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
  • In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
  • Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
    • These rapid changes require more design cost which results in higher product cost.
Slide19

Slide20
  • This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
    • The specific values are based on the median-priced 2-socket server processor in each time frame.
    • The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
    • Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
      • This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
  • Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
    • Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
  • How do Intel’s ManyCore (Xeon Phi) processors compare?

Slide21
  • Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
  • The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
  • The basic physics being applied here is based on several complementary principles:
    • Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
    • Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
    • For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
  • These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).

Slide22
  • The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
  • Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.

Slide23
  • In 1990, the original “Killer Micros” had a single level of cache.
  • Since about 2008, most x86 processors have had 3 levels of cache.
    • Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
      • E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
      • The extra lookups can actually reduce performance if they delay “real” transactions.
  • Asynchronous clock crossings are hard to implement with low latency.
    • A big, and under-appreciated, topic for another forum…
  • Intel Xeon processors evolution:
    • Monolithic L3: Nehalem/Westmere — 1 coherence protocol supported
    • Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
    • Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported

Slide24

Slide25
  • Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
  • Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
  • Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
    • It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.

Slide26

Slide27
  • Thanks to Burton Smith for the Little’s Law reference!
  • Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….

Slide28
  • Simple graphical illustration of Little’s Law.
  • I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
  • The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
  • As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
  • The second set of six loads can re-use the same buffers, so only six buffers are needed

Slide29
  • In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
  • In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
  • By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
    • Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
    • It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
  • The “Mainstream” machines are
    • SGI Origin2000 (MIPS R10000)
    • DEC Alpha DS20 (DEC Alpha EV5)
    • AMD Opteron 244 (single-core, DDR1 memory)
    • AMD Opteron 275 (dual-core, DDR1 memory)
    • AMD Opteron 2352 (dual-core, DDR2 memory)
    • Intel Xeon X5680 (6-core Westmere EP)
    • Intel Xeon E5 (8-core Sandy Bridge EP)
    • Intel Xeon E5 v2 (10-core Ivy Bridge EP)
    • Intel Xeon E5 v3 (12-core Haswell EP)
    • Intel Xeon E5 v4 (14-core Broadwell EP)
    • (future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
  • The GPU/Manycore machines are:
    • NVIDIA M2050
    • Intel Xeon Phi (KNC)
    • NVIDIA K20
    • NVIDIA K40
    • Intel Xeon Phi/KNL
    • NVIDIA P100 (Latency estimated).

Slide30

Slide31

Slide32

Slide33

Slide34
  • Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
  • For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
  • Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
  • Electrical cost is unlikely to increase by large factors in locations attached to power grids.
  • So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
  • Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
  • Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
  • In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.

Slide35
  • Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
    • 25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
    • (Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
  • Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
  • So if we get cheaper processors, they must be more energy-efficient as well.
  • This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
    • Not enough time for that topic today, but there are some speculations in the bonus slides.

Slide36
  • For the purposes of this talk, my speculations focus on the “most likely” scenarios.
  • Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.

Slide37
  • Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
    • (This is fairly widespread in practice.)
  • If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
    • (This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
  • Next we will look at examples of “application balance” from various application areas.

Slide38
  • CRITICAL!  Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
  • The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
  • TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
  • This chart is based on a sensitivity-based performance model with additive performance components.
    • In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
    • More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
    • But the modeling approach is still in active use at TACC, and was the subject of my keynote talk at the
      2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)

Slide39
  • Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!

Slide40
  • It is very hard for vendors to back off of existing performance levels….
  • As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
  • Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
  • Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.

Slide41
  • This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
  • There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
    • As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
    • If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.

Slide42
  • Here I hit the 45-minute mark and ended the talk.
  • I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.

 

Posted in Computer Architecture, Computer Hardware | 1 Comment »

Some notes on producer/consumer communication in cached processors

Posted by John D. McCalpin, Ph.D. on 22nd November 2016

In a recent Intel Software Developer Forum discussion (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477), I put together a few notes on the steps required for a single-producer, single-consumer communication using separate cache lines for “data” and “flag” values.

Although this was not a carefully-considered formal analysis, I think it is worth re-posting here as a reminder of the ridiculous complexity of implementing even the simplest communication operations in shared memory on cached systems.


I usually implement a producer/consumer code using “data” and “flag” in separate cache lines.  This enables the consumer to spin on the flag while the producer updates the data.   When the data is ready, the producer writes the flag variable.   At a low level, the steps are:

  1. The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.
    • The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
    • The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
  2. The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
  3. The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
    • The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
    • The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable.  If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
    • The L3 may send the data for the cache line containing the flag to the producer’s cache at this time, but that data cannot be used until the coherence transactions are complete.
  4. The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
    • Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
  5. The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
    • If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
    • If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache.  I don’t know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
  6. Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
    • Depending on the implementation, several different things might happen.
    • One option is for the L3 to hold the line containing the flag variable in a “transient” state while it waits for an acknowledgement from the Producer core that it has received the write permission message.  In this case the L3 will either:
      • Stall the read request from the consumer core, or
      • NACK the read request from the consumer core (i.e., tell it to try again).
    • Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core’s cache.
  7. In the “lucky” case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
    • Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
    • Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data.  (This is more common in the case of a store than a read.)
  8. Finally the consumer core gets the new value of the flag variable and sees that it has changed!  The data is now ready!
  9. The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
  10. The consumer core now executes a load instruction to get the data.  This misses in the consumer’s L1 and L2 caches and generates a read request on the ring.
  11. The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
  12. The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
  13. The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now “read-only” in the producer’s caches.
  14. As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
  15. The cache line containing the data arrives in the consumer core’s cache, and the load instruction is allowed to complete.
  16. Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
    • This requires the consumer to make an “upgrade” request on the cache line containing the flag, so it can write to it.   This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again — it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
    • Double-buffering can be used to avoid this extra transaction — if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation.  These often involve extra round trips required to ensure ordering in ugly corner cases.  A common example is maintaining global ordering across stores that are handled by different coherence controllers.  This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.

There are fewer steps in the case where the “data” and “flag” are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line.  This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Comments Off on Some notes on producer/consumer communication in cached processors

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on 5th November 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | 2 Comments »