John McCalpin's blog

Dr. Bandwidth explains all….

Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

Posted by John D. McCalpin, Ph.D. on December 6, 2016

The Xeon Phi x200 (Knights Landing) has a lot of modes of operation (selected at boot time), and the latency and bandwidth characteristics are slightly different for each mode.

It is also important to remember that the latency can be different for each physical address, depending on the location of the requesting core, the location of the coherence agent responsible for that address, and the location of the memory controller for that address. Intel has not publicly disclosed the mapping of core numbers (APIC IDs) to physical locations on the chip or the locations of coherence agents (CHA boxes) on the chip, nor has it disclosed the hash functions used to map physical addresses to coherence agents and to map physical addresses to MCDRAM or DDR4 memory controllers. (In some modes of operation the memory mappings are trivial, but not in all modes.)

The modes that are important are:

  • “Flat” vs “Cache”
    • In “Flat” mode, MCDRAM memory is used as directly accessible memory, occupying the upper 16 GiB of physical address space.
      • The OS exposes this memory as being on “NUMA node 1”, so it can be accessed using the standard NUMA control facilities (e.g., numactl).
      • Sustained bandwidth from MCDRAM is highest in “Flat” mode.
    • In “Cache” mode, MCDRAM memory is used as an L3 cache for the main DDR4 memory.
      • In this mode the MCDRAM is invisible and effectively uncontrollable.  I will discuss the performance characteristics of Cache mode at a later date.
  • “All-to-All” vs “Quadrant”
    • In “All-to-All” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers (CHA boxes) distributed across the entire chip using an undocumented hash function, and consecutive physical (cache-line) addresses are assigned to memory controllers (MCDRAM or DDR4) distributed across the entire chip.
      • Initial testing indicates that addresses mapped to MCDRAM are distributed across the 8 MCDRAM controllers using a simple modulo-8 function on the 3 bits above the cache line address.
    • In “Quadrant” mode, consecutive physical (cache-line) addresses are assigned to coherence controllers distributed across the entire chip, but the each address is assigned to one of the MCDRAM controllers in the same “quadrant” as the coherence controller.
      • This reduces the number of “hops” required for request/response/coherence messages on the mesh, and should reduce both latency and contention.
      • Initial testing indicates that addresses mapped to MCDRAM are hashed across the 8 controllers using a complex hash function based on many high-order address bits.
        • Conjecture: This was done to allow the assignment of addresses to coherence agents to remain the same, with the “same quadrant” property enforced by changing the MCDRAM controller owning the address, rather than by changing the coherence agent owning the address.
  • “Sub-NUMA-Cluster”
    • There are several of these modes, only one of which will be discussed here.
    • “Sub-NUMA-Cluster 4” (SNC4) mode divides the chip into four “quadrants”, each of which acts like a NUMA node in a multi-socket system.
      • “node 0” owns the 1st quarter of contiguous physical address space.
        • The cores belonging to “node 0” are “close to” MCDRAM controllers 0 and 1.
        • Initial tests indicate that consecutive cache-line addresses are mapped to MCDRAM controllers 0/1 using a simple even/odd interleave.
        • The physical addresses that belong to “node 0” are mapped to coherence agents that are also located “close to” MCDRAM controllers 0 and 1.
      • Ditto for nodes 1, 2, and 3.

The Knights Landing system at TACC uses the Xeon Phi 7250 processor (68 cores, 1.4 GHz nominal).

My preferred latency tester provides the values in the table below for data mapped to MCDRAM memory.  The values presented are averaged over many addresses, with the ranges showing the variation of average latency across cores.

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
MCDRAM maximum latency (ns)156.1158.3153.6164.7
MCDRAM average latency (ns)154.0155.9150.5156.8
MCDRAM minimum latency (ns)152.3154.4148.3150.3
MCDRAM standard deviation (ns)1.01.00.93.1

Caveats:

  • My latency tester uses permutations of even-numbered cache lines in various sized address range blocks, so it is not guaranteed that my averages are uniformly distributed over all the coherence agents.
  • Variability across nodes is not entirely negligible, in part because different nodes have different patterns of disabled tiles.
    • E.g., Four of the 38 tiles are disabled on each Xeon Phi 7250 processor.
  • Run-to-run variability is typically small (1-2 ns) when using large pages, but there are certain idiosyncrasies that have yet to be explained.

Note that even though the average latency differences are quite small across these modes of operation, the sustained bandwidth differences are much larger. The decreased number of “hops” required for coherence transactions in “Quadrant” and “SNC-4” modes reduces contention on the mesh links and thereby allows higher sustained bandwidths. The difference between sustained bandwidth in Flat-All-to-All and Flat-Quadrant modes suggests that contention on the non-data mesh links (address, acknowledge, and invalidate) is more important than contention on the data transfer links (which should be the same for those two modes of operation). I will post more details to my blog as they become available….

The corresponding data for addresses mapped to DDR4 memory are included in the table below:

Mode of OperationFlat-QuadrantFlat-All2AllSNC4 localSNC4 remote
DDR4 maximum latency (ns)133.3136.8130.0141.5
DDR4 average latency (ns)130.4131.8128.2133.1
DDR4 minimum latency (ns)128.2128.5125.4126.5
DDR4 standard deviation (ns)1.22.41.13.1

There is negligible sustained bandwidth variability across modes for data in DDR4 memory because the DDR4 memory runs out of bandwidth long before the mesh runs out of bandwidth.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Computer Architecture, Computer Hardware, Performance | Tagged: , , | Comments Off on Memory Latency on the Intel Xeon Phi x200 “Knights Landing” processor

SC16 Invited Talk: Memory Bandwidth and System Balance in HPC Systems

Posted by John D. McCalpin, Ph.D. on November 22, 2016

Slide01

I have been involved in HPC for over 30 years:

  • 12 years as student & faculty user in ocean modeling,
  • 12 years as a performance analyst and system architect at SGI, IBM, and AMD, and
  • over 7 years as a research scientist at TACC.

Slide02

Slide03
  • This history is based on my own study of the market, with many of the specific details from my own re-analysis of the systems in the TOP500 lists.

Slide04
  • Vector systems were in decline by the time the first TOP500 list was collected in 1993, but still dominated the large systems space in the early 1990’s.
    • The large bump in Rmax in 2002 was due to the introduction of the “Earth Simulator” in Japan.
    • The last vector system (2nd gen Earth Simulator) fell off the list in 2014.
  • RISC SMPs and Clusters dominated the installed base in the second half of the 1990’s and the first few years of the 2000’s.
    • The large bump in Rmax in 2011 is the “K Machine” in Japan, based on a Fujitsu SPARC processor.
    • The “RISC era” was very dynamic, seeing the rapid rise and fall of 6-7 different architectures in about a decade.
    • In alphabetical order the major processor architectures were: Alpha, IA-64, i860, MIPS, PA-RISC, PowerPC, POWER, SPARC.
  • x86-based systems rapidly replaced RISC systems starting in around 2003.
    • The first big x86 system on the list was ASCI Red in 1996.
    • The large increase in x86 systems in 2003-2004 was due to several large systems in the top 10, rather than due to a single huge system.
    • The earliest of these systems were 32-bit Intel processors.
    • The growth of the x86 contribution was strongly enhanced by the introduction of the AMD x86-64 processors in 2004, with AMD contributing about 40% of the x86 Rmax by the end of 2006.
    • Intel 64-bit systems replaced 32-bit processors rapidly once they become available.
    • AMD’s share of the x86 Rmax dropped rapidly after 2011, and in the November 2016 list has fallen to about 1% of the Intel x86 Rmax.
  • My definition of “MPP” differs from Dongarra’s and is based on how the development of the most expensive part of the system (usually the processor) was funded.
    • Since 2005 almost all of the MPP’s in this chart have been IBM Blue Gene systems.
    • The big exception is the new #1 system, the Sunway Taihulight system in China.
  • Accelerated systems made their appearance in 2008 with the “RoadRunner” system at LANL.
    • “RoadRunner” was the only significant system using the IBM Cell processor.
    • Subsequent accelerated systems have almost all used NVIDIA GPUs or Intel Xeon Phi processors.
      • The NVIDIA GPUs took their big jump in 2010 with the introduction of the #2-ranked “Nebulae” (Dawning TC3600 Blade System) system in China (4640 GPUS), then took another boost in late 2012 with the upgrade of ORNL’s Jaguar to Titan (>18000 GPUs).
      • The Xeon Phi contribution is dominated by the immensely large Tianhe-2 system in China (48000 coprocessors), and the Stampede system at TACC (6880 coprocessors).
    • Note the rapid growth, then contraction, of the accelerated systems Rmax.
      • More on this topic below in the discussion of “clusters of clusters”.

Slide05
  • Obviously a high-level summary, but backed by a large amount of (somewhat fuzzy) data over the years.
  • With x86, we get (slowly) decreasing price per “core”, but it is not obvious that we will get another major technology replacement soon.
  • The embedded and mobile markets are larger than the x86 server/PC/laptop market, but there are important differences in the technologies and market requirements that make such a transition challenging.
  • One challenge is the increasing degree of parallelism required — think about how much parallelism an individual researcher can “own”.

Slide06
  • Systems on the TOP500 list are almost always shared, typically by scores or hundreds of users.
  • So up to a few hundred users, you don’t need to run parallel applications – your share is less than 1 ”processor”.
  • Beyond a few thousand cores, a user’s allocation will typically be multiple core-years per year, so parallelism is required.
  • A fraction of users can get away with single-node parallelism (perhaps with independent jobs running concurrently on multiple nodes), but the majority of users will need multi-node parallel implementations for turnaround, for memory capacity, or simply for throughput.

Slide07

Instead of building large homogeneous systems, many sites have recognized the benefit of specialization – a type of HW/SW “co-configuration”.
These configurations are easiest when the application profile is stable and well-known. It is much more challenging for a general-purpose site such as TACC.


Slide08

Slide09
  • This aside introduces the STREAM benchmark, which is what got me thinking about “balance” 25 years ago.
  • I have never visited the University of Virginia, but had colleagues there who agreed that STREAM should stay in academia when I moved to industry in 1996, and offered to host my guest account.

Slide10
  • Note that the output of each kernel is used as an input to the next.
    • The earliest versions of STREAM did not have this property and some compilers removed whole loops whose output was not used.
    • Fortunately it is easy to identify cases where this happens so that workarounds can be applied.
    • Another way to say this is that STREAM is resistant to undetected over-optimization.
  • OpenMP directives were added in 1996 or 1997.
  • STREAM in C was made fully 64-bit capable in 2013.
    • The validation code was also fixed to eliminate a problem with round-off error that used to occur for very large array sizes.
  • Output print formats keep getting wider as systems get faster.

Slide11
  • STREAM measures time, not bandwidth, so I have to make assumptions about how much data is moved to and from memory.
    • For the Copy kernel, there are actually three different conventions for how many bytes of traffic to count!
    • I count the reads that I asked for and the writes that I asked for.
    • If the processor requires “write allocates” the maximum STREAM bandwidth will be lower than the peak DRAM bandwidth.
      • The Copy and Scale kernels require 3/2 as much bandwidth as STREAM gives credit for if write allocates are included.
      • The Add and Triad kernels require 4/3 as much bandwidth as STREAM gives credit for if write allocates are included.
  • One weakness of STREAM is that all four kernels are in “store miss” form – none of the arrays are read before being written in a kernel.
    • A counter-example is the DAXPY kernel: A[i] = A[i] + scalar*B[i], for which the store hits in the cache because A[i] was already loaded as an input to the addition operation.
    • Non-allocating/non-temporal/streaming stores are typically required for best performance with the existing STREAM kernels, but these are not supported by all architectures or by all compilers.
      • For example, GCC will not generate streaming stores.
    • In my own work I typically supplement STREAM with “read-only” code (built around DDOT), a standard DAXPY (updating one of the two inputs), and sometimes add a “write-only” benchmark.

Slide12

Back to the main topic….

For performance modeling, I try to find a small number of “performance axes” that are capable of accounting for most of the execution time.


Slide13
  • Using the same performance axes as on the previous slide….
  • All balances are shifting to make data motion relatively more expensive than arithmetic.

Slide14
  • The first “Balance Ratio” to look at is (FP rate) / (Memory BW rate).
    • This is the cost per (64-bit) “word” loaded relative to the cost of a (peak) 64-bit FP operation, and applies to long streaming accesses (for which latency can be overlapped).
    • I refer to this metric as the “STREAM Balance” or “Machine Balance” or “System Balance”.
  • The data points here are from a set of real systems.
    • The systems I chose were both commercially successful and had very good memory subsystem performance relative to their competitors.
      • ~1990: IBM RISC System 6000 Model 320 (IBM POWER processor)
      • ~1993: IBM RISC System 6000 Model 590 (IBM POWER processor)
      • ~1996: SGI Origin2000 (MIPS R10000 processor)
      • ~1999: DEC AlphaServer DS20 (DEC Alpha processor)
      • ~2005: AMD Opteron 244 (single-core, DDR1 memory)
      • ~2006: AMD Opteron 275 (dual-core, DDR1 memory)
      • ~2008: AMD Opteron 2352 (dual-core, DDR2 memory)
      • ~2011: Intel Xeon X5680 (6-core Westmere EP)
      • ~2012: Intel Xeon E5 (8-core Sandy Bridge EP)
      • ~2013: Intel Xeon E5 v2 (10-core Ivy Bridge EP)
      • ~2014: Intel Xeon E5 v3 (12-core Haswell EP)
      • (future: Intel Xeon E5 v5)
  • Because memory bandwidth is understood to be an important performance limiter, the processor vendors have not let it degrade too quickly, but more and more applications become bandwidth-limited as this value increases (especially with essentially fixed cache size per core).
    • Unfortunately every other metric is much worse….

Slide15
  • ERRATA: There is an error in the equation in the title — it should be “(GFLOPS/s)*(Memory Latency)”
  • Memory latency is becoming expensive much more rapidly than memory bandwidth!
    • Memory latency is dominated by the time required for cache coherence in most recent systems.
    • Slightly decreasing clock speeds with rapidly increasing core counts leads to slowly increasing memory latency – even with heroic increases in hardware complexity.
  • Memory latency is not a dominant factor in very many applications, but it was not negligible in 7 of the 17 SPECfp2006 codes using hardware from 2006, so it is likely to be of increasing concern.
    • More on this below — slide 38.
    • The principal way to combat the negative impact of memory latency is to make hardware prefetching more aggressive, which increases complexity and costs a significant amount of power.

Slide16
  • Interconnect bandwidth (again for large messages) is falling behind faster than local memory bandwidth – primarily because interconnect widths have not increased in the last decade (while most chips have doubled the width of the DRAM interfaces over that period).
  • Interconnect *latency* (not shown) is so high that it would ruin the scaling of even a log chart.  Don’t believe me?  OK…

Slide17
  • Interconnect latency is decreasing more slowly than per-core performance, and much more slowly than per-chip performance.
  • Increasing the problem size only partly offsets the increasing relative cost of interconnect latency.
    • The details depend on the scaling characteristics of your application and are left as an exercise….

Slide18
  • Early GPUs had better STREAM Balance (FLOPS/Word) because the double-precision FLOPS rate was low.   This is no longer the case.
  • In 2012, mainstream, manycore, and GPU had fairly similar balance parameters, with both manycore and GPGPU using GDDR5 to get enough bandwidth to stay reasonably balanced.  We expect mainstream to be *more tolerant* of low bandwidth due to large caches and GPUs to be *less tolerant* of low bandwidth due to the very small caches.
  • In 2016, mainstream processors have not been able to increase bandwidth proportionately (~3x increase in peak FLOPS and 1.5x increase in BW gives 2x increase in FLOPS/Word).
  • Both manycore and GPU have required two generations of non-standard memory (GDDR5 in 2012 and MCDRAM and HBM in 2016) to prevent the balance from degrading too far.
    • These rapid changes require more design cost which results in higher product cost.
Slide19

Slide20
  • This chart is based on representative Intel processor models available at the beginning of each calendar year – starting in 2003, when x86 jumped to 36% of the TOP500 Rmax.
    • The specific values are based on the median-priced 2-socket server processor in each time frame.
    • The frequencies presented are the “maximum Turbo frequency for all cores operational” for processors through Sandy Bridge/Ivy Bridge.
    • Starting with Haswell, the frequency presented is the power-limited frequency when running LINPACK (or similar) on all cores.
      • This causes a significant (~25%) drop in frequency relative to operation with less computationally intense workloads, but even without the power limitation the frequency trend is slightly downward (and is expected to continue to drop.
  • Columns shaded with hash marks are for future products (Broadwell EP is shipping now for the 2017 column).
    • Core counts and frequencies are my personal estimates based on expected technology scaling and don’t represent Intel disclosures about those products.
  • How do Intel’s ManyCore (Xeon Phi) processors compare?

Slide21
  • Comparing the components of the per-socket GFLOPS of the Intel ManyCore processors relative to the Xeon ”mainstream” processors at their introduction.
  • The delivered performance ratio is expected to be smaller than the peak performance ratio even in the best cases, but these ratios are large enough to be quite valuable even if only a portion of the speedup can be obtained.
  • The basic physics being applied here is based on several complementary principles:
    • Simple cores are smaller (so you can fit more per chip) and use less power (so you can power & cool more per chip).
    • Adding cores brings a linear increase in peak performance (assuming that the power can be supplied and the resulting heat dissipated).
    • For each core, reducing the operating frequency brings a greater-than-proportional reduction in power consumption.
  • These principles are taken even further in GPUs (with hundreds or thousands of very simple compute elements).

Slide22
  • The DIMM architecture for DRAMs has been great for flexibility, but terrible for bandwidth.
  • Modern serial link technology runs at 25 Gbs per lane, while the fastest DIMM-based DDR4 interfaces run at just under 1/10 of that data rate.

Slide23
  • In 1990, the original “Killer Micros” had a single level of cache.
  • Since about 2008, most x86 processors have had 3 levels of cache.
    • Every design has to consider the potential performance advantage of speculatively probing the next level of cache (before finding out if the request has “hit” in the current level of cache) against the power cost of performing all those extra lookups.
      • E.g., if the miss rate is typically 10%, then speculative probing will increase the cache tag lookup rate by 10x.
      • The extra lookups can actually reduce performance if they delay “real” transactions.
  • Asynchronous clock crossings are hard to implement with low latency.
    • A big, and under-appreciated, topic for another forum…
  • Intel Xeon processors evolution:
    • Monolithic L3: Nehalem/Westmere — 1 coherence protocol supported
    • Sliced L3 on one ring: Sandy Bridge/Ivy Bridge — 2/3 coherence protocols supported
    • Sliced L3 on two rings: Haswell/Broadwell — 3 coherence protocols supported

Slide24

Slide25
  • Hardware is capable of extremely low-latency, low-overhead, high-bandwidth data transfers (on-chip or between chips), but only in integrated systems.
  • Legacy IO architectures have orders of magnitude higher latency and overhead, and are only able to attain full bandwidth with very long messages.
  • Some SW requirements, such as MPI message tagging, have been introduced without adequate input from HW designers, and end up being very difficult to optimize in HW.
    • It may only take one incompatible “required” feature to prevent an efficient HW architecture from being used for communication.

Slide26

Slide27
  • Thanks to Burton Smith for the Little’s Law reference!
  • Before we jump into the numbers, I want to show an illustration that may make Little’s Law more intuitive….

Slide28
  • Simple graphical illustration of Little’s Law.
  • I had to pick an old processor with a small latency-BW product to make the picture small enough to be readable.
  • The first set of six loads is needed to keep the bus busy from 0 to 50 ns.
  • As each cache line arrives, another request needs to be sent out so that the memory will be busy 60 ns in the future.
  • The second set of six loads can re-use the same buffers, so only six buffers are needed

Slide29
  • In the mid-1990’s, processors were just transitioning from supporting a single outstanding cache miss to supporting 4 outstanding cache misses.
  • In 2005, a single core of a first generation AMD Opteron could manage 8 cache misses directly, but only needed 5-6 to saturate the DRAM interface.
  • By mid-2016, a Xeon E5 v4 (Broadwell) processor requires about 100 cache lines “in flight” to fully tolerate the memory latency.
    • Each core can only directly support 10 outstanding L1 Data Cache misses, but the L2 hardware prefetchers can provide additional concurrency.
    • It still requires 6 cores to get within 5% of asymptotic bandwidth, and the processor energy consumed is 6x to 10x the energy consumed in the DRAMs.
  • The “Mainstream” machines are
    • SGI Origin2000 (MIPS R10000)
    • DEC Alpha DS20 (DEC Alpha EV5)
    • AMD Opteron 244 (single-core, DDR1 memory)
    • AMD Opteron 275 (dual-core, DDR1 memory)
    • AMD Opteron 2352 (dual-core, DDR2 memory)
    • Intel Xeon X5680 (6-core Westmere EP)
    • Intel Xeon E5 (8-core Sandy Bridge EP)
    • Intel Xeon E5 v2 (10-core Ivy Bridge EP)
    • Intel Xeon E5 v3 (12-core Haswell EP)
    • Intel Xeon E5 v4 (14-core Broadwell EP)
    • (future: Intel Xeon E5 v5 — a plausible estimate for a future Intel Xeon processor, based on extrapolation of current technology trends.)
  • The GPU/Manycore machines are:
    • NVIDIA M2050
    • Intel Xeon Phi (KNC)
    • NVIDIA K20
    • NVIDIA K40
    • Intel Xeon Phi/KNL
    • NVIDIA P100 (Latency estimated).

Slide30

Slide31

Slide32

Slide33

Slide34
  • Power density matters, but systems remain so expensive that power cost is still a relatively small fraction of acquisition cost.
  • For example, a hypothetical exascale system that costs $100 million a year for power may not be not out of line because the purchase price of such a system would be the range of $2 billion.
  • Power/socket is unlikely to increase significantly because of the difficulty of managing both bulk cooling and hot spots.
  • Electrical cost is unlikely to increase by large factors in locations attached to power grids.
  • So the only way for power costs to become dominant is for the purchase price per socket to be reduced by something like an order of magnitude.
  • Needless to say, the companies that sell processors and servers at current prices have an incentive to take actions to keep prices at current levels (or higher).
  • Even the availability of much cheaper processors is not quite enough to make power cost a first-order concern, because if this were to happen, users would deliberately pay more money for more energy-efficient processors, just to keep the ongoing costs tolerable.
  • In this environment, budgetary/organizational/bureaucratic issues would play a major role in the market response to the technology changes.

Slide35
  • Client processors could reduce node prices by using higher-volume, lower-gross-margin parts, but this does not fundamentally change the technology issues.
    • 25%/year for power might be tolerable with minor adjustments to our organizational/budgetary processes.
    • (Especially since staff costs are typically comparable to system costs, so 25% of hardware purchase price per year might only be about 12% of the annual computing center budget for that system.)
  • Very low-cost parts (”embedded” or “DSP” processors) are in a different ballpark – lifetime power cost can exceed hardware acquisition cost.
  • So if we get cheaper processors, they must be more energy-efficient as well.
  • This means that we need to understand where the energy is being used and have an architecture that allows us to control high-energy operations.
    • Not enough time for that topic today, but there are some speculations in the bonus slides.

Slide36
  • For the purposes of this talk, my speculations focus on the “most likely” scenarios.
  • Alternatives approaches to maintaining the performance growth rate of systems are certainly possible, but there are many obstacles on those paths and it is difficult to have confidence that any will be commercially viable.

Slide37
  • Once an application becomes important enough to justify the allocation of millions of dollars per year of computing resources, it begins to make sense to consider configuring one or more supercomputers to be cost-effective for that application.
    • (This is fairly widespread in practice.)
  • If an application becomes important enough to justify the allocation of tens of millions of dollars per year of computing resources, it begins to make sense to consider designing one or more supercomputers to be cost-effective for that application.
    • (This has clearly failed many times in the past, but the current technology balances makes the approach look attractive again.)
  • Next we will look at examples of “application balance” from various application areas.

Slide38
  • CRITICAL!  Application characterization is a huge topic, but it is easy to find applications that vary by two orders of magnitude or more in requirements for peak FP rate, for pipelined memory accesses (bandwidth), for unexpected memory access (latency), for large-message interconnect bandwidth, and for short-message interconnect latency.
  • The workload on TACC’s systems covers the full range of node-level balances shown here (including at least a half-dozen of the specific codes listed).
  • TACC’s workload includes comparable ranges in requirements for various attributes of interconnect and filesystem IO performance.
  • This chart is based on a sensitivity-based performance model with additive performance components.
    • In 2006/2007 there were enough different configurations available in the SPEC benchmark database for me to perform this analysis using public data.
    • More recent SPEC results are less suitable for this data mining approach for several reasons (notably the use of autoparallelization and HyperThreading).
    • But the modeling approach is still in active use at TACC, and was the subject of my keynote talk at the
      2nd International Workshop on Performance Modeling: Methods and Applications (PMMA16)

Slide39
  • Catch-22 — the more you know about the application characteristics and the more choices you have for computing technology and configuration, the harder it is to come up with cost-effective solutions!

Slide40
  • It is very hard for vendors to back off of existing performance levels….
  • As long as purchase prices remain high enough to keep power costs at second-order, there will be incentive to continue making the fast performance axes as fast as possible.
  • Vendors will only have incentive to develop systems balanced to application-specific performance ratios if there is a large enough market that makes purchases based on optimizing cost/performance for particular application sub-sets.
  • Current processor and system offerings provide a modest degree of configurability of performance characteristics, but the small price range makes this level of configurability a relatively weak lever for overall system optimization.

Slide41
  • This is not the future that I want to see, but it is the future that seems most likely – based on technological, economic, and organizational/bureaucratic factors.
  • There is clearly a disconnect between systems that are increasingly optimized for dense, vectorized floating-point arithmetic and systems that are optimized for low-density “big data” application areas.
    • As long as system prices remain high, vendors will be able to deliver systems that have good performance in both areas.
    • If the market becomes competitive there will be incentives to build more targeted alternatives – but if the market becomes competitive there will also be less money available to design such systems.

Slide42
  • Here I hit the 45-minute mark and ended the talk.
  • I will try to upload and annotate the “Bonus Slides” discussing potential disruptive technologies sometime in the next week.

 

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware | Tagged: , , , | 1 Comment »

Some notes on producer/consumer communication in cached processors

Posted by John D. McCalpin, Ph.D. on November 22, 2016

In a recent Intel Software Developer Forum discussion (https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/700477), I put together a few notes on the steps required for a single-producer, single-consumer communication using separate cache lines for “data” and “flag” values.

Although this was not a carefully-considered formal analysis, I think it is worth re-posting here as a reminder of the ridiculous complexity of implementing even the simplest communication operations in shared memory on cached systems.


I usually implement a producer/consumer code using “data” and “flag” in separate cache lines.  This enables the consumer to spin on the flag while the producer updates the data.   When the data is ready, the producer writes the flag variable.   At a low level, the steps are:

  1. The producer executes a store instruction, which misses in its L1 Data Cache and L2 cache, thus generating an RFO transaction on the ring.
    • The data for the store is held in a store buffer at least until the producer core has write permission on the cache line in its own L1 Data Cache.
    • The data for the store may be held in the store buffer for longer periods, awaiting other potential stores to the same cache line.
  2. The RFO traverses the ring to find the L3 slice that owns the physical address being used, and the RFO hits in the L3.
  3. The L3 has the data, but also sees that it is marked as being in a clean state in one or more processor private caches, so it issues an invalidate on the cache line containing the flag variable.
    • The L3 may or may not have a way of tracking whether the cache line containing the flag variable is shared in another chip.
    • The L3 may have directories that track which cores may have a copy of the cache line containing the flag variable.  If there are directories, the invalidates may be targeted at the specific caches that may hold the cache line, rather than being broadcast to all cores.
    • The L3 may send the data for the cache line containing the flag to the producer’s cache at this time, but that data cannot be used until the coherence transactions are complete.
  4. The consumer receives the invalidate, invalidates its copy of the flag line, and responds to the L3 that the line has been invalidated.
    • Immediately after responding to the L3 that the line has been invalidated, the spin loop in the consumer tries to load the flag variable again.
  5. The L3 receives the invalidation acknowledgements from all the cores that may have had a shared copy of the line, and notifies the producer core that it now has write permission on the cache line containing the flag data.
    • If you are lucky, the producer core will write the flag variable from the store buffer to the L1 Data Cache immediately on receipt of permission.
    • If you are unlucky, the producing core may lose the line containing the flag variable before the store buffer is written to the L1 Data Cache.  I don’t know if Intel processors can do this, but I know some processors than can lose the line before dumping the store buffer.
  6. Very shortly after sending the write permission notification to the producer core, the L3 will receive a read request from the consumer core for the same cache line containing the flag.
    • Depending on the implementation, several different things might happen.
    • One option is for the L3 to hold the line containing the flag variable in a “transient” state while it waits for an acknowledgement from the Producer core that it has received the write permission message.  In this case the L3 will either:
      • Stall the read request from the consumer core, or
      • NACK the read request from the consumer core (i.e., tell it to try again).
    • Another option is for the L3 to immediately process the read request and send an intervention/downgrade request for the cache line containing the flag variable to the producer core’s cache.
  7. In the “lucky” case, the intervention/downgrade request generated by the read from the consumer core will get the new value of the cache line containing the flag variable and return it to the consumer core and to the L3 slice that owns that physical address.
    • Various implementations have specific ordering requirements here that determine whether the cache line must be sent to the L3 first, then the to consumer core, or whether it can be sent to both at the same time.
    • Some implementations require an extra handshaking step after the consumer core receives the data, before the L3 will give it permission to use the data.  (This is more common in the case of a store than a read.)
  8. Finally the consumer core gets the new value of the flag variable and sees that it has changed!  The data is now ready!
  9. The spin loop on the consumer core now exits, which incurs a 20-cycle mispredicted branch delay.
  10. The consumer core now executes a load instruction to get the data.  This misses in the consumer’s L1 and L2 caches and generates a read request on the ring.
  11. The read request traverses the ring to the slice that owns the physical address of the cache line containing the data (which may be a different slice than the one controlling the cache line containing the flag), and the read request hits in the L3.
  12. The data in the L3 is stale, but the L3 knows exactly which core has write permission on the cache line containing the data, so it issues an intervention/downgrade on the cache line containing the data and targeting the cache of the producer core.
  13. The cache(s) of the producer core receive the intervention/downgrade request and return the new value of the cache line containing the data variable to the L3, simultaneously downgrading the permissions on the line so that it is now “read-only” in the producer’s caches.
  14. As was the case for the cache line containing the flag variable, the cache line containing the data variable makes its way to both the L3 slice that owns the physical address and the consumer core that requested the data.
  15. The cache line containing the data arrives in the consumer core’s cache, and the load instruction is allowed to complete.
  16. Once the consumer core has gotten the data safely into a register, it typically has to re-write the flag variable to let the producer core know that the value has been consumed and that the producer core is free to write to the cache line containing the data variable again.
    • This requires the consumer to make an “upgrade” request on the cache line containing the flag, so it can write to it.   This is similar to the sequence above, but since the consumer already has the data, it does not need the L3 to send it again — it only needs to wait until all other cores have acknowledge the invalidate before it can write to the flag line.
    • Double-buffering can be used to avoid this extra transaction — if the consumer uses a different set of addresses to send data back to the producer, then the fact that the producer has received another message from the consumer means that the consumer must have finished using the original data buffer, so it is safe for the producer to use it again.

There are many variants and many details that can be non-intuitive in an actual implementation.  These often involve extra round trips required to ensure ordering in ugly corner cases.  A common example is maintaining global ordering across stores that are handled by different coherence controllers.  This can be different L3 slices (and/or Home Agents) in a single package, or the more difficult case of stores that alternate across independent packages.

There are fewer steps in the case where the “data” and “flag” are in the same cache line, but extra care needs to be taken in that case because it is easier for the polling activity of the consumer to take the cache line away from the producer before it has finished doing the updates to the data part of the cache line.  This can result in more performance variability and reduced total performance, especially in cases with multiplier producers and multiple consumers (with locks, etc.).

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Cache Coherence Implementations, Cache Coherence Protocols, Computer Architecture | Tagged: , | Comments Off on Some notes on producer/consumer communication in cached processors

Intel discloses “vector+SIMD” instructions for future processors

Posted by John D. McCalpin, Ph.D. on November 5, 2016

The art and science of microprocessor architecture is a never-ending struggling to balance complexity, verifiability, usability, expressiveness, compactness, ease of encoding/decoding, energy consumption, backwards compatibility, forwards compatibility, and other factors.   In recent years the trend has been to increase core-level performance by the use of SIMD vector instructions, and to increase package-level performance by the addition of more and more cores.

In the latest (October 2016) revision of  Intel’s Instruction Extensions Programming Reference, Intel has disclosed a fairly dramatic departure from these “traditional” approaches.   Chapter 6 describes a small number of future 512-bit instructions that I consider to be both “vector” instructions (in the sense of performing multiple consecutive operations) and “SIMD” instructions (in the sense of performing multiple simultaneous operations on the elements packed into the SIMD registers).

Looking at the first instruction disclosed, the V4FMADDPS instruction performs 4 consecutive multiply-accumulate operations with a single 512-bit accumulator register, four different (consecutively-numbered) 512-bit input registers, and four (consecutive) 32-bit memory values from memory.   As an example of one mode of operation, the four steps are:

  1. Load the first 32 bits from memory starting at the requested memory address, broadcast this single-precision floating-point value across the 16 “lanes” of the 512-bit SIMD register, multiply the value in each lane by the corresponding value in the named 512-bit SIMD input register, then add the results to the values in the corresponding lanes of the 512-bit SIMD accumulator register.
  2. Repeat step 1, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  3. Repeat step 2, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.
  4. Repeat step 3, but the first input value comes from the next consecutive 32-bit memory location and the second input value comes from the next consecutive register number.  The results are added to the same accumulator.

This remarkably specific sequence of operations is exactly the sequence used in the inner loop of a highly optimized dense matrix multiplication (DGEMM or SGEMM) kernel.

So why does it make sense to break the fundamental architectural paradigm in this way?

Understanding this requires spending some time reviewing the low-level details of the implementation of matrix multiplication on recent processors, to see what has been done, what the challenges are with current instruction sets and implementations, and how these might be ameliorated.

So consider the dense matrix multiplication operation C += A*B, where A, B, and C are dense square matrices of order N, and the matrix multiplication operation is equivalent to the pseudo-code:

for (i=0; i<N; i++) {
   for (j=0; j<N; j++) {
      for (k=0; k<N; k++) {
         C[i][j] += A[i][k] * B[k][j];
      }
   }
}

Notes on notation:

  • C[i][j] is invariant in the innermost loop, so I refer to the values in the accumulator as elements of the C array.
  • In consecutive iterations of the innermost loop, A[i][k] and B[k][j] are accessed with different strides.
    • In the implementation I use, one element of A is multiplied against a vector of contiguous elements of B.
    • On a SIMD processor, this is accomplished by broadcasting a single value of A across a full SIMD register, so I will refer to the values that get broadcast as the elements of the A array.
    • The values of B are accessed with unit stride and re-used for each iteration of the outermost loop — so I refer to the values in the named input registers as the elements of the B array.
  • I apologize if this breaks convention — I generally get confused when I look at other people’s code, so I will stick with my own habits.

Overview of GEMM implementation for AVX2:

  • Intel processors supporting the AVX2 instruction set also support the FMA3 instruction set.  This includes Haswell and newer cores.
  • These cores have 2 functional units supporting Vector Fused Multiply-Add instructions, with 5-cycle latency on Haswell/Broadwell and 4-cycle latency on Skylake processors (ref: http://www.agner.org/optimize/instruction_tables.pdf)
  • Optimization requires vectorization and data re-use.
    • The most important step in enabling these is usually referred to as “register blocking” — achieved by unrolling all three loops and “jamming” the results together into a single inner loop body.
  • With 2 FMA units that have 5-cycle latency, the code must implement at least 2*5=10 independent accumulators in order to avoid stalls.
    • Each of these accumulators must consist of a full-width SIMD register, which is 4 independent 64-bit values or 8 independent 32-bit values with the AVX2 instruction set.
    • Each of these accumulators must use a different register name, and there are only 16 SIMD register names available.
  • The number of independent accumulators is equal to the product of three terms:
    1. the unrolling factor for the “i” loop,
    2. the unrolling factor for the “j” loop,
    3. the unrolling factor for the “k” loop divided by the number of elements per SIMD register (4 for 64-bit arithmetic, 8 for 32-bit arithmetic).
      • So the “k” loop must be unrolled by at least 4 (for 64-bit arithmetic) or 8 (for 32-bit arithmetic) to enable full-width SIMD vectorization.
  • The number of times that a data item loaded into a register can be re-used also depends on the unrolling factors.
    • Elements of A can be re-used once for each unrolling of the “j” loop (since they are not indexed by “j”).
    • Elements of B can be re-used once for each unrolling of the “i” loop (since they are not indexed by “i”).
    • Note that more unrolling of the “k” loop does not enable additional re-use of elements of A and B, so unrolling of the “i” and “j” loops is most important.
  • The number accumulators is bounded below (at least 10) by the pipeline latency times the number of pipelines, and is bounded above by the number of register names (16).
    • Odd numbers are not useful — they correspond to not unrolling one of the loops, and therefore don’t provide for register re-use.
    • 10 is not a good number — it comes from unrolling factors of 2 and 5, and these don’t allow enough register re-use to keep the number of loads per cycle acceptably low.
    • 14 is not a good number — the unrolling factors of 2 and 7 don’t allow for good register re-use, and there are only 2 register names left that can be used to save values.
    • 12 is the only number of accumulators that makes sense.
      • Of the two options to get to 12 (3×4 and 4×3), only one works because of the limit of 16 register names.
      • The optimum register blocking is therefore based on
        • Unrolling the “i” loop 4 times
        • Unrolling the “j” loop 3 times
        • Unrolling the “k” loop 4/8 times (1 vector width for 64-bit/32-bit)
      • The resulting code requires all 16 registers:
        • 12 registers to hold the 12 SIMD accumulators,
        • 3 registers to hold the 3 vectors of B that are re-used across 4 iterations of “i”, and
        • 1 register to hold the elements of A that are loaded one element at a time and broadcast across the SIMD lanes of the target register.
  • I have been unable to find any other register-blocking scheme that has enough accumulators, fits in the available registers, and requires less than 2 loads per cycle.
    • I am sure someone will be happy to tell me if I am wrong!

So that was a lot of detail — what is the point?

The first point relates the the new Xeon Phi x200 (“Knights Landing”) processor.   In the code description above, the broadcast of A requires a separate load with broadcast into a named register.  This is not a problem with Haswell/Broadwell/Skylake processors — they have plenty of instruction issue bandwidth to include these separate loads.   On the other hand this is a big problem with the Knights Landing processor, which is based on a 2-instruction-per-cycle core.  The core has 2 vector FMA units, so any instruction that is not a vector FMA instruction represents a loss of 50% of peak performance for that cycle!

The reader may recall that SIMD arithmetic instructions allow memory arguments, so the vector FMA instructions can include data loads without breaking the 2-instruction-per-cycle limit.   Shouldn’t this fix the problem?   Unfortunately, not quite….

In the description of the AVX2 code above there are two kinds of loads — vector loads of contiguous elements that are placed into a named register and used multiple times, and scalar loads that are broadcast across all the lanes of a named register and only used once.   The memory arguments allowed for AVX2 arithmetic instructions are contiguous loads only.  These could be used for the contiguous input data (array B), but since these loads don’t target a named register, those vectors would have to be re-loaded every time they are used (rather than loaded once and used 4 times).   The core does not have enough load bandwidth to perform all of these extra load operations at full speed.

To deal with this issue for the AVX-512 implementation in Knights Landing, Intel added the option for the memory argument of an arithmetic instruction to be a scalar that is implicitly broadcast across the SIMD lanes.  This reduces the instruction count for the GEMM kernel considerably. Even combining this rather specialized enhancement with a doubling of the number of named SIMD registers (to 32), the DGEMM kernel for Knights Landing still loses almost 20% of the theoretical peak performance due to non-FMA instructions (mostly loads and prefetches, plus the required pointer updates, and a compare and branch at the bottom of the loop).   (The future “Skylake Xeon” processor with AVX-512 support will not have this problem because it is capable of executing at least 4 instructions per cycle, so “overhead” instructions will not “displace” the vector FMA instructions.)

To summarize: instruction issue limits are a modest problem with the current Knights Landing processor, and it is easy to speculate that this “modest” problem could become much more serious if Intel chose to increase the number of functional units in a future processor.

 

This brings us back to the newly disclosed “vector+SIMD” instructions.   A first reading of the specification implies that the new V4FMADD instruction will allow two vector units to be fully utilized using only 2 instruction slots every 4 cycles instead of 2 slots per cycle.  This will leave lots of room for “overhead” instructions, or for an increase in the number of available functional units.

Implications?

  • The disclosure only covers the single-precision case, but since this is the first disclosure of these new “vector” instructions, there is no reason to jump to the conclusion that this is a complete list.
  • Since this disclosure is only about the instruction functionality, it is not clear what the performance implications might be.
    • This might be a great place to introduce a floating-point accumulator with single-cycle issue rate (e.g., http://dl.acm.org/citation.cfm?id=1730587), for example, but I don’t think that would be required.
  • Implicit in all of the above is that larger and larger computations are required to overcome the overheads of starting up these increasingly-deeply-pipelined operations.
    • E.g., the AVX2 DGEMM implementation discussed above requires 12 accumulators, each 4 elements wide — equivalent to 48 scalar accumulators.
    • For short loops, the reduction of the independent accumulators to a single scalar value can exceed the time required for the vector operations, and the cross-over point is moving to bigger vector lengths over time.
  • It is not clear that any compiler will ever use this instruction — it looks like it is designed for Kazushige Goto‘s personal use.
  • The inner loop of GEMM is almost identical to the inner loop of a convolution kernel, so the V4FMADDPS instruction may be applicable to convolutions as well.
    • Convolutions are important in many approaches to neural network approaches to machine learning, and these typically require lower arithmetic precision, so the V4FMADDPS may be primarily focused on the “deep learning” hysteria that seems to be driving the recent barking of the lemmings, and may only accidentally be directly applicable to GEMM operations.
    • If my analyses are correct, GEMM is easier than convolutions because the alignment can be controlled — all of the loads are either full SIMD-width-aligned, or they are scalar loads broadcast across the SIMD lanes.
    • For convolution kernels you typically need to do SIMD loads at all element alignments, which can cause a lot more stalls.
      • E.g., on Haswell you can execute two loads per cycle of any size or alignment as long as neither crosses a cache-line boundary.  Any load crossing a cache-line boundary requires a full cycle to execute because it uses both L1 Data Cache ports.
      • As a simpler core, Knights Landing can execute up to two 512-bit/64-Byte aligned loads per cycle, but any load that crosses a cache-line boundary introduces a 2-cycle stall. This is OK for DGEMM, but not for convolutions.
      • It is possible to write convolutions without unaligned loads, but this requires a very large number of permute operations, and there is only one functional unit that can perform permutes.
      • On Haswell it is definitely faster to reload the data from cache (except possibly for the case where an unaligned load crosses a 4KiB page boundary) — I have not completed the corresponding analysis on KNL.

Does anyone else see the introduction of “vector+SIMD” instructions as an important precedent?


UPDATE: 2016-11-13:

I am not quite sure how I missed this, but the most important benefit of the V4FMADDPS instruction may not be a reduction in the number of instructions issued, but rather the reduction in the number of Data Cache accesses.

With the current AVX-512 instruction set, each FMA with a broadcast load argument requires an L1 Data Cache access.    The core can execute two FMAs per cycle, and the way the SGEMM code is organized, each pair of FMAs will be fetching consecutive 32-bit values from memory to (implicitly) broadcast across the 16 lanes of the 512-bit vector units.   It seems very likely that the hardware has to be able to merge these two load operations into a single L1 Data Cache access to keep the rate of cache accesses from being the performance bottleneck.

But 2 32-bit loads is only 1/8 of a natural 512-bit cache access, and it seems unlikely that the hardware can merge cache accesses across multiple cycles.   The V4FMADDPS instruction makes it trivial to coalesce 4 32-bit loads into a single L1 Data Cache access that would support 4 consecutive FMA instructions.

This could easily be extended to the double-precision case, which would require 4 64-bit loads, which is still only 1/2 of a natural 512-bit cache access.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Algorithms, Computer Architecture, Computer Hardware, Performance | Tagged: , , | 2 Comments »

Invited Talk at SuperComputing 2016!

Posted by John D. McCalpin, Ph.D. on October 16, 2016

“Memory Bandwidth and System Balance in HPC Systems”

If you are planning to attend the SuperComputing 2016 conference in Salt Lake City next month, be sure to reserve a spot on your calendar for my talk on Wednesday afternoon (4:15pm-5:00pm).

I will be talking about the technology and market trends that have driven changes in deployed HPC systems, with a particular emphasis on the increasing relative performance cost of memory accesses (vs arithmetic).   The talk will conclude with a discussion of near-term trends in HPC system balances and some ideas on the fundamental architectural changes that will be required if we ever want to obtain large reductions in cost and power consumption.

The official announcement:

SC16 Invited Talk Spotlight: Dr. John D. McCalpin Presents “Memory Bandwidth and System Balance in HPC Systems”

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Architecture, Computer Hardware, Performance | Tagged: , , , , , | Comments Off on Invited Talk at SuperComputing 2016!

AMD Opteron Memory Configuration notes

Posted by John D. McCalpin, Ph.D. on December 16, 2015

(These are old notes for relatively old systems — I just found this in my “drafts” folder and decided to switch the status to “public” so I can find it again!)

Some notes on how to determine the DRAM and Memory Controller configuration for a system using AMD Opteron/Phenom or other Family 10h processors.  All of this information is available in AMD’s publication: “BIOS and Kernel Developer’s Guide (BKDG) For AMD Family 10h Processors”, document number 31116. I am using “Rev 3.48 – April 22, 2010”

Background: How to Read and Interpret the PCI Configuration Bits

The processor configuration bits are available in PCI configuration space and can be read with the “lspci” program.  Unfortunately it requires a relatively new kernel (2.26 or newer) to read the extended configuration bits (i.e., those with offsets greater than 256 Bytes) — I will try to mark the problematic configuration bits as I go along.   To get the configuration bits, run “lspci -xxxx” (as root) and save the text output.

The “lspci” program prints out the output by bus, device, function, and offset.  Here we are only interested in the processor configurations, so we look through the output until we get to a line that looks like:

00:18.0 Host bridge: Advanced Micro Devices [AMD] K10 [Opteron, Athlon64, Sempron] HyperTransport Configuration

The initial characters of the line are interpreted as “bus 0”, “device 18”, “function 0”, followed by a text label for this PCI configuration space region.

The following lines will look like:

00: 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00
10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
30: 00 00 00 00 80 00 00 00 00 00 00 00 00 00 00 00

These are hexadecimal dumps of the values at various “offsets” into the PCI configuration space.

The values are organized with the higher addresses and most significant bits to the right (except that within each 2-digit hexadecimal number the least significant bits are to the right!).  These PCI configuration space values are organized into 32-bit “registers”, so the first line above corresponds to

Offset 00 04 08 0C
bits 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24 7:0 15:8 23:16 31:24
value (hex) 22 10 00 12 00 00 10 00 00 00 00 06 00 00 80 00

This first line of output corresponds to four 32-bit PCI configuration space registers, shown as offsets 00, 04, 08, and 0C in the table. The BKDG describes these as:
F0x00 Device/Vendor ID Register
Reset: 1200 1022h. <– note the “little-endian” ordering
F0x04 Status/Command Register
Reset: 0010 0000h. Bit[20] is set to indicate the existence of a PCI-defined capability block.
F0x08 Class Code/Revision ID Register
Reset: 0600 0000h.
F0x0C Header Type Register
Reset: 0080 0000h.

Important Memory Controller and DRAM Configuration Bits

Information About Installed Hardware

The first thing to look at are the properties of the installed hardware. The DIMM configuration information is contained in F2x[1,0]80 DRAM Bank Address Mapping Register. The Family 10h Opterons have two memory controllers — the information for controller 0 is located at Function 2, Offset 080, while the information for controller 1 is located at Function 2, Offset 180. Note that these offsets are specified in hexadecimal, so that offsets greater than 100 (=256 decimal) are located in the extended PCI configuration area and will not be included in the output of “lspci” on systems running 2.6.25 or earlier Linux kernels. For this configuration bit the inability to read the controller 1 values is not likely to be a proble — if the DIMMs have been properly installed in matching pairs, then the two memory controllers will be configured identically by the BIOS.

For the system I am looking at today, the output of lspci for Function 2, offset 80 is

80:	55	00	00	00

Comparing the description of the bit field mappings from the BKDG (page 238) with my data for this configuration register and the device information in Table 85 (page 239) of the BKDG gives:

Bits Field Name Value Meaning
31:16 Reserved 00 00h (it is nice to see that these reserved bits are actually zero)
15:12 Dimm3AddrMap 0h ignored because this DIMM is not populated
11:8 Dimm2AddrMap 0h ignored because this DIMM is not populated
7:4 Dimm1AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13
3:0 Dimm0AddrMap 5h = 1010b CS size = 1 GB, Device size/width = 1G, x8, Bank Address bits = 15,14,13

These interpretations make sense. The DIMMs are composed of 1 Gbit DRAM chips, each with 8 output bits (“x8”). To create a 64-bit DIMM, 8 of these DRAM chips work together as a single “rank” (actually there are 9 chips in a rank, to provide extra bits for ECC error correction). This “rank” will then have a capacity of 1 Gbit/chip * 8 chips = 1 GiB. (Here I am using the newer notation to distinguish between binary and decimal sizes — see http://en.wikipedia.org/wiki/Gibibyte for more information.)

A few things to note:

  • The bits in this configuration register don’t tell whether or not a DIMM is installed. The value of ‘0’ for Dimm2AddrMap and Dimm3AddrMap could correspond to a 128 MiB chip select size composed of 256 Mb parts with x16 width. It is quite unlikely that anyone will run across such a DIMM in their system — 256 Mbit parts are old technology and x16 width is quite unusual outside of the embedded processor space — but there may be no guarantee that the BIOS will set the bits here to such an easily identified value in the event that no DIMM is installed in that slot, so you do need to look elsewhere to be sure of the configuration.
  • The bits in this configuration register don’t tell how many “ranks” are included on each DIMM. A DIMM can be constructed with 1, 2, or 4 ranks (though 1 and 2 are by far the most common), so you need to check elsewhere to find the number of ranks.
  • In any system using both DRAM channels (either ganged or unganged-but-interleaved — see below) the address bit numbers in above must be incremented by one. The BKDG includes a comment on the bottom of page 238 that the address bits only need to be incremented when running in ganged mode. I think this is incorrect — the “effective” bank size is doubled (corresponding to incrementing the bank address bits) by the use of two DRAM channels, whether the interleaving is within a cache line (as in ganged mode) or between cache lines (as in unganged-but-interleaved) mode.

Information About Configuration of Hardware

Memory Controller Channel Interleave or Ganging

There are a number of common configurations choices that determine how the hardware makes use of the two 64-bit DRAM channels.

  • One feature is called “DRAM controller ganging”, which sets up the two DRAM channels to work in lockstep, with each cache line being split between the two channels. This feature is typically activated when the strongest error-correction features are desired. The downside of this approach is that each memory controller is only transferring 32 Bytes per cache line, which corresponds to 4 8-Byte bursts. This is the shortest burst length supported by DDR2 memory and makes the DRAM bus overheads relatively larger. DRAM controller ganging is enabled if F2x110 DRAM Controller Select Low Register Bit 4: DctGangEn: “DRAM controller ganging enable”, is set.
  • If the memory controllers are not ganged, the BIOS will attempt to set up “DRAM channel interleaving”. In this mode, each channel transfers full cache lines, and cache lines are distributed evenly between the two channels.
    DRAM channel interleaving is enabled if F2x110 DRAM Controller Select Low Register Bit 5: DctDatIntLv: “DRAM controller data interleave enable” is set. There are a number of options controlling exactly how the cache lines are mapped to the two DRAM channels. These options are controlled by F2x110 DRAM Controller Select Low Register Bits 7:6 DctSelIntLvAddr: “DRAM controller select channel interleave address bit”. The recommended option is the value “10”, which causes the DRAM channel select bit to be computed as the Exclusive-OR of address bits 20:16 & 6. When the value is “1”, channel 1 is selected, otherwise channel 0 is selected. Bit 6 is the bit above the cache line address, so using bit 6 alone would cause cache lines to be mapped odd/even to the two DRAM channels. By computing the channel select bit using the Exclusive-OR of six bits it is much less likely that an access pattern will repeatedly access only one of the two DRAM channels.
Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware | Tagged: , | Comments Off on AMD Opteron Memory Configuration notes

Memory Bandwidth Requirements of the HPL benchmark

Posted by John D. McCalpin, Ph.D. on September 11, 2014

The High Performance LINPACK (HPL) benchmark is well known for delivering a high fraction of peak floating-point performance. The (historically) excellent scaling of performance as the number of processors is increased and as the frequency is increased suggests that memory bandwidth has not been a performance limiter.

But this does not mean that memory bandwidth will *never* be a performance limiter. The algorithms used by HPL have lots of data re-use (both in registers and from the caches), but the data still has to go to and from memory, so the bandwidth requirement is not zero, which means that at some point in scaling the number of cores or frequency or FP operations per cycle, we are going to run out of the available memory bandwidth. The question naturally arises: “Are we (almost) there yet?”

Using Intel’s optimized HPL implementation, a medium-sized (N=18000) 8-core (single socket) HPL run on a Stampede compute node (3.1 GHz, 8 cores/chip, 8 FP ops/cycle) showed about 15 GB/s sustained memory bandwidth at about 165 GFLOPS. This level of bandwidth utilization should be no trouble at all (even when running on two sockets), given the 51.2 GB/s peak memory bandwidth (~38 GB/s sustainable) on each socket.

But if we scale this to the peak performance of a new Haswell EP processor (e.g., 2.6 GHz, 12 cores/chip, 16 FP ops/cycle), it suggests that we will need about 40 GB/s of memory bandwidth for a single-socket HPL run and about 80 GB/s of memory bandwidth for a 2-socket run. A single Haswell chip can only deliver about 60 GB/s sustained memory bandwidth, so the latter value is a problem, and it means that we expect LINPACK on a 2-socket Haswell system to require attention to memory placement.

A colleague here at TACC ran into this while testing on a 2-socket Haswell EP system. Running in the default mode showed poor scaling beyond one socket. Running the same binary under “numactl —interleave=0,1” eliminated most (but not all) of the scaling issues. I will need to look at the numbers in more detail, but it looks like the required chip-to-chip bandwidth (when using interleaved memory) may be slightly higher than what the QPI interface can sustain.

Just another reminder that overheads that are “negligible” may not stay that way in the presence of exponential performance growth.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Algorithms, Performance | Tagged: , , , | Comments Off on Memory Bandwidth Requirements of the HPL benchmark

Notes on the mystery of hardware cache performance counters

Posted by John D. McCalpin, Ph.D. on July 14, 2013


In response to a question on the PAPI mailing list, I scribbled some notes to try to help users understand the complexity of hardware performance counters for cache accesses and cache misses, and thought they might be helpful here….

For any interpretation of specific hardware performance counter events, it is absolutely essential to precisely specify the processor that you are using.

Cautionary Notes

Although it may not make a lot of sense, the meanings of “cache miss” and “cache access” are almost always quite different across different vendors’ CPUs, and can be quite different for different CPUs from the same vendor. It is actually rather *uncommon* for L1 cache misses to match L2 cache accesses, for a variety of reasons that are difficult to summarize concisely.

Some examples of behavior that could make the L1 miss counter larger than the L2 access counter:

  • If an instruction fetch misses in the L1 Icache, the fetch may be retried several times before the instructions have been returned to the L1 Icache. The L1 Icache miss event might be incremented every time the fetch is attempted, while the L2 cache access counter may only be incremented on the initial fetch.
  • L1 caches (both data and instruction) typically have hardware prefetch engines. The L1 Icache miss counter may only be incremented when the instruction fetcher requests data that is not found in the L1 Icache, while the L2 cache access counter may be incremented every time the L2 receives either an L1 Icache miss or an L1 Icache prefetch.
  • The processor may attempt multiple instruction fetches of different addresses in the same cache line. The L1 Icache miss event might be incremented on each of these fetch attempts, while the L2 cache access counter might only be incremented once for the cache line request.
  • The processor may be fetching data that is not allowed to be cached in the L2 cache, such as ROM-resident code. It may not be allowed in the L1 Instruction cache either, so every instruction fetch would miss in the L1 cache (because it is not allowed to be there), then bypass access to the L2 cache (because it is not allowed to be there), then get retrieved directly from memory. (I don’t know of any specific processors that work this way, but it is certainly plausible.)

An example of behavior that could make the L1 miss counter smaller than the L2 access counter: (this is a very common scenario)

  • The L1 instruction cache miss counter might be incremented only once when an instruction fetch misses in the L1 Icache, while the L2 cache might be accessed repeatedly until the data actually arrives in the L2. This is especially common in the case of L2 cache misses — the L1 Icache miss might request data from the L2 dozens of times before it finally arrives from memory.

A Recommended Procedure

Given the many possible detailed meanings of such counters, the procedure I use to understand the counter events is:

  1. Identify the processor in detail.
    This includes vendor, family, model, and stepping.
  2. Determine the precise mapping of PAPI events to underlying hardware events.
    (This is irritatingly difficult on Linux systems that use the “perf-events” subsystem — that is a long topic in itself.)
  3. Look up the detailed descriptions of the hardware events in the vendor processor documentation.
    For AMD, this is the relevant “BIOS and Kernel Developers Guide” for the processor family.
    For Intel, this Volume 3 of the “Intel 64 and IA-32 Architecture Software Developer’s Guide”.
  4. Check the vendor’s published processor errata to see if there are known bugs associated with the counter events in question.
    For AMD these documents are titled “Revision Guide for the AMD Family [nn] Processors”.
    For Intel these documents are usually given a title including the words “Specification Update”.
  5. Using knowledge of the cache sizes and associativities, build a simple test code whose behavior should be predictable by simple paper-and-pencil analysis.
    The STREAM Benchmark is an example of a code whose data access patterns and floating point operation counts are easy to determine and easy to modify.
  6. Compare the observed performance counter results for the simple test case with the expected results and try to work out a model that bridges between the two.
    The examples of different ways to count given at the beginning of this note should be very helpful in attempting to construct a model.
  7. Decide which counters are “close enough” to be helpful, and which counters cannot be reliably mapped to performance characteristics of interest.

An example of a counter that (probably) cannot be made useful

As an example of the final case — counters that cannot be reliably mapped to performance characteristics of interest — consider the floating point instruction counters on the Intel “Sandy Bridge” processor series. These counters are incremented on instruction *issue*, not on instruction *execution* or instruction *retirement*. If the inputs to the instruction are not “ready” when the instruction is *issued*, the instruction issue will be rejected and the instruction will be re-issued later, and may be re-issued many times before it is finally able to execute. The most common cause for input arguments to not be “ready” is that they are coming from memory and have not arrived in processor registers yet (either explicit load instructions putting data in registers or implicit register loads via memory arguments to the floating-point arithmetic instruction itself).

For a workload with a very low cache miss rate (e.g., DGEMM), the “overcounting” of FP instruction issues relative to the more interesting FP instruction execution or retirement can be as low as a few percent. For a workload with a high cache miss rate (e.g., STREAM), the “overcounting” of FP instructions can be a factor of 4 to 6 (perhaps worse), depending on how many cores are in use and whether the memory accesses are fully localized (on multi-chip platforms). In the absence of detailed information about the processor’s internal algorithm for retrying operations, it seems unlikely that this large overcount can be “corrected” to get an accurate estimate of the number of floating-point operations actually executed or retired. The amount of over-counting will likely depend on at least the following factors:

  • the instruction retry rate (which may depend on how many instructions are available for attempted issue in the processor’s reorder buffer, including whether or not HyperThreading is enabled),
  • the instantaneous frequency of the processor (which can vary from 1.2 GHz to 3.5 GHz on the Xeon E5-2670 “Sandy Bridge” processors in the TACC “Stampede” system),
  • the detailed breakdown of latency for the individual loads (i.e., the average latency may not be good enough if the retry rate is not fixed),
  • the effectiveness of the hardware prefetchers at getting the data into the data before it is needed (which, in turn, is a function of the number of data streams, the locality of the streams, the contention at the memory controllers)

There are likely other applicable factors as well — for example the Intel “Sandy Bridge” processors support several mechanisms that allow the power management unit to bias behavior related to the trade-off of performance vs power consumption. One mechanism is referred to as the “performance and energy bias hint”, and is described as as a “hint to guide the hardware heuristic of power management features to favor increasing dynamic performance or conserve energy consumption” (Intel 64 and IA-32 Architectures Software Developer’s Manual: Volume 3, Section 14.3.4, Document 325384-047US, June 2013). Another mechanism (apparently only applicable to “Sandy Bridge” systems with integrated graphics units) is a pair of “policy” registers (MSR_PP0_POLICY and MSR_PP1_POLICY) that define the relative priority of the processor cores and the graphics unit in dividing up the chip’s power budget. The specific mechanisms by which these features work, and the detailed algorithms used to control those mechanisms, are not publicly disclosed — but it seems likely that at least some of the mechanisms involved may impact the floating-point instruction retry rate.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Computer Hardware, Performance, Performance Counters | Tagged: , | Comments Off on Notes on the mystery of hardware cache performance counters

Coherence with Cached Memory-Mapped IO

Posted by John D. McCalpin, Ph.D. on May 30, 2013

In response to my previous blog entry, a question was asked about how to manage coherence for cached memory-mapped IO regions.   Here are some more details…

Maintaining Coherence with Cached Memory-Mapped IO

For the “read-only” range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy.   Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line.  Not all of these will work on all implementations!

  1. Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
    A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
    The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the “flushing” reads.   There are also some systems for which the pseudo-LRU algorithm has “features” that can break this approach.  (HyperThreading and shared caches can both add complexity in this dimension.)
  2. The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
    Whether it will evict the line from the L3 depends on the implementation, and I don’t have enough information to speculate on whether this will work on Xeon processors.   For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
  3. For memory mapped my the MTRRs as WP (“Write Protect”), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches.  This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
  4. The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer’s Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed.  Additional details are discussed in the SW Developer’s Guide, Volume 3.    Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer’s Guide, Vol 2.
  5. The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost).   This is very likely to crash your system, and is only mentioned here for completeness.
  6. AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO).  This is *almost* useful for supporting cached MMIO, but (at least on the Family 10h processors), the specific mode that I wanted to set up (see addendum below) is disallowed for ugly microarchitectural reasons that I can’t discuss.

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate.   There are no guarantees, but this seems like the approach most likely to work.

 

Addendum: The AMD almost-solution using MTRR extensions.

The AMD64 architecture provides extensions to the MTRR mechanism called IORRs that allow the system programmer to independently specify whether reads to a certain region go to system memory or MMIO and whether writes to that region go to system memory or MMIO.   This is discussed in the “AMD64 Architecture Programmers Manual, Volume 2: System Programming” (publication number 24593).
I am using version 3.22 from September 2012, where this is described in section 7.9.

The idea was to use this to modify the behavior of the “read-only” MMIO mapping so that reads would go to MMIO while writes would go to system memory.  At first glance this seems strange — I would be creating a “write-only” region of system memory that could never be read (because reads to that address range would go to MMIO).

So why would this help?

It would help because sending the writes to system memory would cause the cache coherence mechanisms to be activated.   A streaming store (for example) to this region would be sent to the memory controller for that physical address range.  The memory controller treats streaming stores in the same way as DMA stores from IO devices to system memory, and it sends out invalidate messages to all caches in the system.  This would invalidate the cached MMIO line in all caches, which would eliminate both the need to pin the thread to a specific core and the problem of the CLFLUSH not reaching the L3 cache.

At least in the AMD Family 10h processors, this IORR function works, but due to some implementation issues in this particular use case it forces the region to the MTRR UC (uncached) type, which defeats my purpose in the exercise.   I think that the implementation issues could be either fixed or worked around, but since this is a fix to a mode that is not entirely supported, it is easy to understand that this never showed up as a high priority to “fix”.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Accelerated Computing, Computer Hardware, Linux | Tagged: , , , | Comments Off on Coherence with Cached Memory-Mapped IO

Notes on Cached Access to Memory-Mapped IO Regions

Posted by John D. McCalpin, Ph.D. on May 29, 2013

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices.  These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device.   This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device.  The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance.   As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges.   Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput.  When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references.   So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses?   The answer is a resounding, “yes, but….“.    The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

  • Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
    BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
  • Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
    This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
    The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
  • Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
    The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
    It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
  • Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
    Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

  • Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining).  In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.
  • Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
    For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
    These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region.   This software based coherence functionality can be implemented at many different levels of complexity, for example:

  • For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
  • If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic.  In such a case, a simple “directory-based coherence” mechanism can be used.  The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”.  The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated.   Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….

 

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • Facebook
  • LinkedIn

Posted in Accelerated Computing, Computer Hardware, Linux | Tagged: , , , , | Comments Off on Notes on Cached Access to Memory-Mapped IO Regions