John McCalpin's blog

Dr. Bandwidth explains all….

Memory Bandwidth on Xeon Phi (Knights Corner)

Posted by John D. McCalpin, Ph.D. on 5th December 2013

A Quick Note

There are a lot of topics that could be addressed here, but this short note will focus on bandwidth from main memory (using the STREAM benchmark) as a function of the number of threads used.

Published STREAM Bandwidth Results

  • Official STREAM submission at: http://www.cs.virginia.edu/stream/stream_mail/2013/0015.html
  • Compiled with icc -mmic -O3 -openmp -DNTIMES=100 -DSTREAM_ARRAY_SIZE=64000000 -opt-prefetch-distance=64,8 -opt-streaming-cache-evict=0 -opt-streaming-stores always stream_5-10.c -o stream_intelopt.100x.mic
  • Configured with an array size of 64 million elements per array and 10 iterations.
  • Run with 60 threads (bound to separate physical cores) and Transparent Huge Pages.

 

Function Best Rate MB/s Avg time (sec) Min time (sec) Max time (sec)
Copy 169446.8 0.0062 0.0060 0.0063
Scale 169173.1 0.0062 0.0061 0.0063
Add 174824.3 0.0090 0.0088 0.0091
Triad 174663.2 0.0089 0.0088 0.0091

Memory Controllers

The Xeon Phi SE10P has 8 memory controllers, each controlling two 32-bit channels.  Each 32-bit channel has two GDDR5 chips, each having a 16-bit-wide interface.   Each of the 32 GDDR5 DRAM chips has 16 banks.  This gives a *raw* total of 512 DRAM banks.  BUT:

  • The two GDDR5 chips on each 32-bit channel are operating in “clamshell” mode — emulating a single GDDR5 chip with a 32-bit-wide interface.  (This is done for cost reduction — two 2 Gbit chips with x16 interfaces were presumably a cheaper option than one 4 Gbit chip with a x32 interface).  This reduces the effective number of DRAM banks to 256 (but the effective bank size is doubled from 2KiB to 4 KiB).
  • The two 32-bit channels for each memory controller operate in lockstep — creating a logical 64-bit interface.  Since every cache line is spread across the two 32-bit channels, this reduces the effective number of DRAM banks to 128 (but the effective bank size is doubled again, from 4 KiB to 8 KiB).

So the Xeon Phi SE10P memory subsystem should be analyzed as a 128-bank resource.   Intel has not disclosed the details of the mapping of physical addresses onto DRAM channels and banks, but my own experiments have shown that addresses are mapped to a repeating permutation of the 8 memory controllers in blocks of 62 cache lines.  (The other 2 cache lines in each 64-cacheline block are used to hold the error-correction codes for the block.)

Bandwidth vs Number of Data Access STREAM

One “rule of thumb” that I have found on Xeon Phi is that memory-bandwidth-limited jobs run best when the number of read streams across all the threads is close to, but less than, the number of GDDR5 DRAM banks.  On the Xeon Phi SE10P coprocessors in the TACC Stampede system, this is 128 (see Note 1).    Some data from the STREAM benchmark supports this hypothesis:

Kernel Reads Writes 2/core 3/core 4/core
Copy 1 1 -0.8% -5.2% -7.3%
Scale 1 1 -1.0% -3.3% -6.7%
Add 2 1 -3.1% -12.0% -13.6%
Triad 2 1 -3.6% -11.2% -13.5%

From these results you can see that the Copy and Scale kernels have about the same performance with either 1 or 2 threads per core (61 or 122 read streams), but drop 3%-7% when generating more than 128 address streams, while the Add and Triad kernels are definitely best with one thread per core (122 read streams), and drop 3%-13% when generating more than 128 address streams.

So why am I not counting the write streams?

I found this puzzling for a long time, then I remembered that the Xeon E5-2600 series processors have a memory controller that supports multiple modes of prioritization.  The default mode is to give priority to reads while buffering stores.  Once the store buffers in the memory controller reach a “high water mark”, the mode shifts to giving priority to the stores while buffering reads.  The basic architecture is implied by the descriptions of the “major modes” in section 2.5.8 of the Xeon E5-2600 Product Family Uncore Performance Monitoring Guide (document 327043 — I use revision 001, dated March 2012).      So *if* Xeon Phi adopts a similar multi-mode strategy, the next question is whether the duration in each mode is long enough that the open page efficiency is determined primarily by the number of streams in each mode, rather than by the total number of streams.   For STREAM Triad, the observed bandwidth is ~175 GB/s.  Combining this with the observed average memory latency of about 275 ns (idle) means that at least 175*275=48125 bytes need to be in flight at any time.  This is about 768 cache lines (rounded up to a convenient number) or 96 cache lines per memory controller.  For STREAM Triad, this corresponds to an average of 64 cache line reads and 32 cache line writes in each memory controller at all times.   If the memory controller switches between “major modes” in which it does 64 cache line reads (from two read streams, and while buffering writes) and 32 cache line writes (from one write stream, and while buffering reads), the number of DRAM banks needed at any one time should be close to the number of banks needed for the read streams only….

Posted in Computer Architecture, Performance | Comments Off on Memory Bandwidth on Xeon Phi (Knights Corner)

Coherence with Cached Memory-Mapped IO

Posted by John D. McCalpin, Ph.D. on 30th May 2013

In response to my previous blog entry, a question was asked about how to manage coherence for cached memory-mapped IO regions.   Here are some more details…

Maintaining Coherence with Cached Memory-Mapped IO

For the “read-only” range, cached copies of MMIO lines will never be invalidated by external traffic, so repeated reads of the data will always return the cached copy.   Since there are no external mechanisms to invalidate the cache line, we need a mechanism that the processor can use to invalidate the line, so the next load to that line will go to the IO device and get fresh data.

There are a number of ways that a processor should be able to invalidate a cached MMIO line.  Not all of these will work on all implementations!

  1. Cached copies of MMIO addresses can, of course, be dropped when they become LRU and are chosen as the victim to be replaced by a new line brought into the cache.
    A code could read enough conflicting cacheable addresses to ensure that the cached MMIO line would be evicted.
    The number is typically 8 for a 32 KiB data cache, but you need to be careful that the reads have not been rearranged to put the cached MMIO read in the middle of the “flushing” reads.   There are also some systems for which the pseudo-LRU algorithm has “features” that can break this approach.  (HyperThreading and shared caches can both add complexity in this dimension.)
  2. The CLFLUSH instruction operating on the virtual address of the cached MMIO line should evict it from the L1 and L2 caches.
    Whether it will evict the line from the L3 depends on the implementation, and I don’t have enough information to speculate on whether this will work on Xeon processors.   For AMD Family 10h processors, due to the limitations of the CLFLUSH implementation, cached MMIO lines are only allowed in the L1 cache.
  3. For memory mapped my the MTRRs as WP (“Write Protect”), a store to the address of the cached MMIO line should invalidate that line from the L1 & L2 data caches.  This will generate an *uncached* store, which typically stalls the processor for quite a while, so it is not a preferred solution.
  4. The WBINVD instruction (kernel mode only) will invalidate the *entire* processor data cache structure and according to the Intel Architecture Software Developer’s Guide, Volume 2 (document 325338-044), will also cause all external caches to be flushed.  Additional details are discussed in the SW Developer’s Guide, Volume 3.    Additional caution needs to be taken if running with HyperThreading enabled, as mentioned in the discussion of the CPUID instruction in the SW Developer’s Guide, Vol 2.
  5. The INVD instruction (kernel mode only) will invalidate all the processor caches, but it does this non-coherently (i.e., dirty cache lines are not written back to memory, so any modified data gets lost).   This is very likely to crash your system, and is only mentioned here for completeness.
  6. AMD processors support some extensions to the MTRR mechanism that allow read and write operations to the same physical address to be sent to different places (i.e., one to system memory and the other to MMIO).  This is *almost* useful for supporting cached MMIO, but (at least on the Family 10h processors), the specific mode that I wanted to set up (see addendum below) is disallowed for ugly microarchitectural reasons that I can’t discuss.

There are likely to be more complexities that I am not remembering right now, but the preferred answer is to bind the process doing the cached MMIO to a single core (and single thread context if using HyperThreading) and use CLFLUSH on the address you want to invalidate.   There are no guarantees, but this seems like the approach most likely to work.

 

Addendum: The AMD almost-solution using MTRR extensions.

The AMD64 architecture provides extensions to the MTRR mechanism called IORRs that allow the system programmer to independently specify whether reads to a certain region go to system memory or MMIO and whether writes to that region go to system memory or MMIO.   This is discussed in the “AMD64 Architecture Programmers Manual, Volume 2: System Programming” (publication number 24593).
I am using version 3.22 from September 2012, where this is described in section 7.9.

The idea was to use this to modify the behavior of the “read-only” MMIO mapping so that reads would go to MMIO while writes would go to system memory.  At first glance this seems strange — I would be creating a “write-only” region of system memory that could never be read (because reads to that address range would go to MMIO).

So why would this help?

It would help because sending the writes to system memory would cause the cache coherence mechanisms to be activated.   A streaming store (for example) to this region would be sent to the memory controller for that physical address range.  The memory controller treats streaming stores in the same way as DMA stores from IO devices to system memory, and it sends out invalidate messages to all caches in the system.  This would invalidate the cached MMIO line in all caches, which would eliminate both the need to pin the thread to a specific core and the problem of the CLFLUSH not reaching the L3 cache.

At least in the AMD Family 10h processors, this IORR function works, but due to some implementation issues in this particular use case it forces the region to the MTRR UC (uncached) type, which defeats my purpose in the exercise.   I think that the implementation issues could be either fixed or worked around, but since this is a fix to a mode that is not entirely supported, it is easy to understand that this never showed up as a high priority to “fix”.

Posted in Accelerated Computing, Computer Hardware, Linux | Comments Off on Coherence with Cached Memory-Mapped IO

Notes on Cached Access to Memory-Mapped IO Regions

Posted by John D. McCalpin, Ph.D. on 29th May 2013

When attempting to build heterogeneous computers with “accelerators” or “coprocessors” on PCIe interfaces, one quickly runs into asymmetries between the data transfer capabilities of processors and IO devices.  These asymmetries are often surprising — the tremendously complex processor is actually less capable of generating precisely controlled high-performance IO transactions than the simpler IO device.   This leads to ugly, high-latency implementations in which the processor has to program the IO unit to perform the required DMA transfers and then interrupt the processor when the transfers are complete.

For tightly-coupled acceleration, it would be nice to have the option of having the processor directly read and write to memory locations on the IO device.  The fundamental capability exists in all modern processors through the feature called “Memory-Mapped IO” (MMIO), but for historical reasons this provides the desired functionality without the desired performance.   As discussed below, it is generally possible to set up an MMIO mapping that allows high-performance writes to IO space, but setting up mappings that allow high-performance reads from IO space is much more problematic.

Processors only support high-performance reads when executing loads to cached address ranges.   Such reads transfer data in cache-line-sized blocks (64 Bytes on x86 architectures) and can support multiple concurrent read transactions for high throughput.  When executing loads to uncached address ranges (such as MMIO ranges), each read fetches only the specific bits requested (1, 2, 4, or 8 Bytes), and all reads to uncached address ranges are completely serialized with respect to each other and with respect to any other memory references.   So even if the latency to the IO device were the same as the latency to memory, using cache-line accesses could easily be (for example) 64 times as fast as using uncached accesses — 8 concurrent transfers of 64 Bytes using cache-line accesses versus one serialized transfer of 8 Bytes.

But is it possible to get modern processors to use their cache-line access mechanisms to read data from MMIO addresses?   The answer is a resounding, “yes, but….“.    The notes below provide an introduction to some of the issues….

It is possible to map IO devices to cacheable memory on at least some processors, but the accesses have to be very carefully controlled to keep within the capabilities of the hardware — some of the transactions to cacheable memory can map to IO transactions and some cannot.
I don’t know the details for Intel processors, but I did go through all the combinations in great detail as the technology lead of the “Torrenza” project at AMD.

Speaking generically, some examples of things that should and should not work (though the details will depend on the implementation):

  • Load miss — generates a cache line read — converted to a 64 Byte IO read — works OK.
    BUT, there is no way for the IO device to invalidate that line in the processor(s) cache(s), so coherence must be maintained manually using the CLFLUSH instruction. NOTE also that the CLFLUSH instruction may or may not work as expected when applied to addresses that are mapped to MMIO, since the coherence engines are typically associated with the memory controllers, not the IO controllers. At the very least you will need to pin threads doing cached MMIO to a single core to maximize the chances that the CLFLUSH instructions will actually clear the (potentially stale) copies of the cache lines mapped to the MMIO range.
  • Streaming Store (aka Write-Combining store, aka Non-temporal store) — generates one or more uncached stores — works OK.
    This is the only mode that is “officially” supported for MMIO ranges by x86 and x86-64 processors. It was added in the olden days to allow a processor core to execute high-speed stores into a graphics frame buffer (i.e., before there was a separate graphics processor). These stores do not use the caches, but do allow you to write to the MMIO range using full cache line writes and (typically) allows multiple concurrent stores in flight.
    The Linux “ioremap_wc” maps a region so that all stores are translated to streaming stores, but because the hardware allows this, it is typically possible to explicitly generate streaming stores (MOVNTA instructions) for MMIO regions that are mapped as cached.
  • Store Miss (aka “Read For Ownership”/RFO) — generates a request for exclusive access to a cache line — probably won’t work.
    The reason that it probably won’t work is that RFO requires that the line be invalidated in all the other caches, with the requesting core not allowed to use the data until it receives acknowledgements from all the other cores that the line has been invalidated — but an IO controller is not a coherence controller, so it (typically) cannot generate the required probe/snoop transactions.
    It is possible to imagine implementations that would convert this transaction to an ordinary 64 Byte IO read, but then some component of the system would have to “remember” that this translation took place and would have to lie to the core and tell it that all the other cores had responded with invalidate acknowledgements, so that the core could place the line in “M” state and have permission to write to it.
  • Victim Writeback — writes back a dirty line from cache to memory — probably won’t work.
    Assuming that you could get past the problems with the “store miss” and get the line in “M” state in the cache, eventually the cache will need to evict the dirty line. Although this superficially resembles a 64 Byte store, from the coherence perspective it is quite a different transaction. A Victim Writeback actually has no coherence implications — all of the coherence was handled by the RFO up front, and the Victim Writeback is just the delayed completion of that operation. Again, it is possible to imagine an implementation that simply mapped the Victim Writeback to a 64 Byte IO store, but when you get into the details there are features that just don’t fit. I don’t know of any processor implementation for which a mapping of Victim Writeback operations to MMIO space is supported.

There is one set of mappings that can be made to work on at least some x86-64 processors, and it is based on mapping the MMIO space *twice*, with one mapping used only for reads and the other mapping used only for writes:

  • Map the MMIO range with a set of attributes that allow write-combining stores (but only uncached reads). This mode is supported by x86-64 processors and is provided by the Linux “ioremap_wc()” kernel function, which generates an MTRR (“Memory Type Range Register”) of “WC” (write-combining).  In this case all stores are converted to write-combining stores, but the use of explicit write-combining store instructions (MOVNTA and its relatives) makes the usage more clear.
  • Map the MMIO range a second time with a set of attributes that allow cache-line reads (but only uncached, non-write-combined stores).
    For x86 & x86-64 processors, the MTRR type(s) that allow this are “Write-Through” (WT) and “Write-Protect” (WP).
    These might be mapped to the same behavior internally, but the nominal difference is that in WT mode stores *update* the corresponding line if it happens to be in the cache, while in WP mode stores *invalidate* the corresponding line if it happens to be in the cache. In our current application it does not matter, since we will not be executing any stores to this region. On the other hand, we will need to execute CLFLUSH operations to this region, since that is the only way to ensure that (potentially) stale cache lines are removed from the cache and that the subsequent read operation to a line actually goes to the MMIO-mapped device and reads fresh data.

On the particular device that I am fiddling with now, the *device* exports two address ranges using the PCIe BAR functionality. These both map to the same memory locations on the device, but each BAR is mapped to a different *physical* address by the Linux kernel. The different *physical* addresses allow the MTRRs to be set differently (WC for the write range and WT/WP for the read range). These are also mapped to different *virtual* addresses so that the PATs can be set up with values that are consistent with the MTRRs.

Because the IO device has no way to generate transactions to invalidate copies of MMIO-mapped addresses in processor caches, it is the responsibility of the software to ensure that cache lines in the “read” region are invalidated (using the CLFLUSH instruction on x86) if the data is updated either by the IO device or by writes to the corresponding (aliased) address in the “write” region.   This software based coherence functionality can be implemented at many different levels of complexity, for example:

  • For some applications the data access patterns are based on clear “phases”, so in a “phase” you can leave the data in the cache and simply invalidate the entire block of cached MMIO addresses at the end of the “phase”.
  • If you expect only a small fraction of the MMIO addresses to actually be updated during a phase, this approach is overly conservative and will lead to excessive read traffic.  In such a case, a simple “directory-based coherence” mechanism can be used.  The IO device can keep a bit map of the cache-line-sized addresses that are modified during a “phase”.  The processor can read this bit map (presumably packed into a single cache line by the IO device) and only invalidate the specific cache lines that the directory indicates have been updated.   Lines that have not been updated are still valid, so copies that stay in the processor cache will be safe to use.

Giving the processor the capability of reading from an IO device at low latency and high throughput allows a designer to think about interacting with the device in new ways, and should open up new possibilities for fine-grained off-loading in heterogeneous systems….

 

Posted in Accelerated Computing, Computer Hardware, Linux | Comments Off on Notes on Cached Access to Memory-Mapped IO Regions

Some comments on the Xeon Phi coprocessor

Posted by John D. McCalpin, Ph.D. on 17th November 2012

As many of you know, the Texas Advanced Computing Center is in the midst of installing “Stampede” — a large supercomputer using both Intel Xeon E5 (“Sandy Bridge”) and Intel Xeon Phi (aka “MIC”, aka “Knights Corner”) processors.

In his blog “The Perils of Parallel”, Greg Pfister commented on the Xeon Phi announcement and raised a few questions that I thought I should address here.

I am not in a position to comment on Greg’s first question about pricing, but “Dr. Bandwidth” is happy to address Greg’s second question on memory bandwidth!
This has two pieces — local memory bandwidth and PCIe bandwidth to the host. Greg also raised some issues regarding ECC and regarding performance relative to the Xeon E5 processors that I will address below. Although Greg did not directly raise issues of comparisons with GPUs, several of the topics below seemed to call for comments on similarities and differences between Xeon Phi and GPUs as coprocessors, so I have included some thoughts there as well.

Local Memory Bandwidth

The Intel Xeon Phi 5110P is reported to have 8 GB of local memory supporting 320 GB/s of peak bandwidth. The TACC Stampede system employs a slightly different model Xeon Phi, referred to as the Xeon Phi SE10P — this is the model used in the benchmark results reported in the footnotes of the announcement of the Xeon Phi 5110P. The Xeon Phi SE10P runs its memory slightly faster than the Xeon Phi 5110P, but memory performance is primarily limited by available concurrency (more on that later), so the sustained bandwidth is expected to be essentially the same.

Background: Memory Balance

Since 1991, I have been tracking (via the STREAM benchmark) the “balance” between sustainable memory bandwidth and peak double-precision floating-point performance. This is often expressed in “Bytes/FLOP” (or more correctly “Bytes/second per FP Op/second”), but these numbers have been getting too small (<< 1), so for the STREAM benchmark I use "FLOPS/Word" instead (again, more correctly "FLOPs/second per Word/second", where "Word" is whatever size was used in the FP operation). The design target for the traditional "vector" systems was about 1 FLOP/Word, while cache-based systems have been characterized by ratios anywhere between 10 FLOPS/Word and 200 FLOPS/Word. Systems delivering the high sustained memory bandwidth of 10 FLOPS/Word are typically expensive and applications are often compute-limited, while systems delivering the low sustained memory bandwidth of 200 FLOPS/Word are typically strongly memory bandwidth-limited, with throughput scaling poorly as processors are added.

Some real-world examples from TACC's systems:

  • TACC’s Ranger system (4-socket quad-core Opteron Family 10h “Barcelona” processors) sustains about 17.5 GB/s (2.19 GW/s for 8-Byte Words) per node, and have a peak FP rate of 2.3 GHz * 4 FP Ops/Hz/core * 4 cores/socket * 4 sockets = 147.2 GFLOPS per node. The ratio is therefore about 67 FLOPS/Word.
  • TACC’s Lonestar system (2-socket 6-core Xeon 5600 “Westmere” processors) sustains about 41 GB/s (5.125 GW/s) per node, and have a peak FP rate of 3.33 GHz * 4 Ops/Hz/core * 6 cores/socket * 2 sockets = 160 GFLOPS per node. The ratio is therefore about 31 FLOPS/Word.
  • TACC’s forthcoming Stampede system (2-socket 8-core Xeon E5 “Sandy Bridge” processors) sustains about 78 GB/s (9.75 GW/s) per node, and have a peak FP rate of 2.7 GHz * 8 FP Ops/Hz * 8 cores/socket * 2 sockets = 345.6 GFLOPS per ndoe. The ratio is therefore a bit over 35 FLOPS/Word.

Again, the Xeon Phi SE10P coprocessors being installed at TACC are not identical to the announced product version, but the differences are not particularly large. According to footnote 7 of Intel’s announcement web page, the Xeon Phi SE10P has a peak performance of about 1.06 TFLOPS, while footnote 8 reports a STREAM benchmark performance of up to 175 GB/s (21.875 GW/s). The ratio is therefore about 48 FLOPS/Word — a bit less bandwidth per FLOP than the Xeon E5 nodes in the TACC Stampede system (or the TACC Lonestar system), but a bit more bandwidth per FLOP than provided by the nodes in the TACC Ranger system. (I will have a lot more to say about sustained memory bandwidth on the Xeon Phi SE10P over the next few weeks.)

The earlier GPUs had relatively high ratios of bandwidth to peak double-precision FP performance, but as the double-precision FP performance was increased, the ratios have shifted to relatively low amounts of sustainable bandwidth per peak FLOP. For the NVIDIA M2070 “Fermi” GPGPU, the peak double-precision performance is reported as 515.2 GFLOPS, while I measured sustained local bandwidth of about 105 GB/s (13.125 GW/s) using a CUDA port of the STREAM benchmark (with ECC enabled). This corresponds to about 39 FLOPS/Word. I don’t have sustained local bandwidth numbers for the new “Kepler” K20X product, but the data sheet reports that the peak memory bandwidth has been increased by 1.6x (250 GB/s vs 150 GB/s) while the peak FP rate has been increased by 2.5x (1.31 TFLOPS vs 0.515 TFLOPS), so the ratio of peak FLOPS to sustained local bandwidth must be significantly higher than the 39 for the “Fermi” M2070, and is likely in the 55-60 range — slightly higher than the value for the Xeon Phi SE10P.

Although the local memory bandwidth ratios are similar between GPUs and Xeon Phi, the Xeon Phi has a lot more cache to facilitate data reuse (thereby decreasing bandwidth demand). The architectures are quite different, but the NVIDIA Kepler K20x appears to have a total of about 2MB of registers, L1 cache, and L2 cache per chip. In contrast, the Xeon Phi has a 32kB data cache and a private 512kB L2 cache per core, giving a total of more than 30 MB of cache per chip. As the community develops experience with these products, it will be interesting to see how effective the two approaches are for supporting applications.

PCIe Interface Bandwidth

There is no doubt that the PCIe interface between the host and a Xeon Phi has a lot less sustainable bandwidth than what is available for either the Xeon Phi to its local memory or for the host processor to its local memory. This will certainly limit the classes of algorithms that can map effectively to this architecture — just as it limits the classes of algorithms that can be mapped to GPU architectures.

Although many programming models are supported for the Xeon Phi, one that looks interesting (and which is not available on current GPUs) is to run MPI tasks on the Xeon Phi card as well as on the host.

  • MPI codes are typically structured to minimize external bandwidth, so the PCIe interface is used only for MPI messages and not for additional offloading traffic between the host and coprocessor.
  • If the application allows different amounts of “work” to be allocated to each MPI task, then you can use performance measurements for your application to balance the work allocated to each processing component.
  • If the application scales well with OpenMP parallelism, then placing one MPI task on each Xeon E5 chip on the host (with 8 threads per task) and one MPI task on the Xeon Phi (with anywhere from 60-240 threads per task, depending on how your particular application scales).
  • Xeon Phi supports multiple MPI tasks concurrently (with environment variables to control which cores an MPI task’s threads can run on), so applications that do not easily allow different amounts of work to be allocated to each MPI task might run multiple MPI tasks on the Xeon Phi, with the number chosen to balance performance with the performance of the host processors. For example if the Xeon Phi delivers approximately twice the performance of a Xeon E5 host chip, then one might allocate one MPI task on each Xeon E5 (with OpenMP threading internal to the task) and two MPI tasks on the Xeon Phi (again with OpenMP threading internal to the task). If the Xeon Phi delivers three times the performance of the Xeon E5, then one would allocate three MPI tasks to the Xeon Phi, etc….

Running a full operating system on the Xeon Phi allows more flexibility in code structure than is available on (current) GPU-based coprocessors. Possibilities include:

  • Run on host and offload loops/functions to the Xeon Phi.
  • Run on Xeon Phi and offload loops/functions to the host.
  • Run on Xeon Phi and host as peers, for example with MPI.
  • Run only on the host and ignore the Xeon Phi.
  • Run only on the Xeon Phi and use the host only for launching jobs and providing external network and file system access.

Lots of things to try….

ECC

Like most (all?) GPUs that support ECC, the Xeon Phi implements ECC “inline” — using a fraction of the standard memory space to hold the ECC bits. This requires memory controller support to perform the ECC checks and to hide the “holes” in memory that contain the ECC bits, but it allows the feature to be turned on and off without incurring extra hardware expense for widening the memory interface to support the ECC bits.

Note that widening the memory interface from 64 bits to 72 bits is straightforward with x4 and x8 DRAM parts — just use 18 x4 chips instead of 16, or use 9 x8 chips instead of 8 — but is problematic with the x32 GDDR5 DRAMs used in GPUs and in Xeon Phi. A single x32 GDDR5 chip has a minimum burst of 32 Bytes so a cache line can be easily delivered with a single transfer from two “ganged” channels. If one wanted to “widen” the interface to hold the ECC bits, the minimum overhead is one extra 32-bit channel — a 50% overhead. This is certainly an unattractive option compared to the 12.5% overhead for the standard DDR3 ECC DIMMs. There are a variety of tricky approaches that might be used to reduce this overhead, but the inline approach seems quite sensible for early product generations.

Intel has not disclosed details about the implementation of ECC on Xeon Phi, but my current understanding of their implementation suggests that the performance penalty (in terms of bandwidth) is actually rather small. I don’t know enough to speculate on the latency penalty yet. All of TACC’s Xeon Phi’s have been running with ECC enabled, but any Xeon Phi owner should be able to reboot a node with ECC disabled to perform direct latency and bandwidth comparisons. (I have added this to my “To Do” list….)

Speedup relative to Xeon E5

Greg noted the surprisingly reasonable claims for speedup relative to Xeon E5. I agree that this is a good thing, and that it is much better to pay attention to application speedup than to the peak performance ratios. Computer performance history has shown that every approach used to double performance results in less than doubling of actual application performance.

Looking at some specific microarchitectural performance factors:

  1. Xeon Phi supports a 512-bit vector instruction set, which can be expected to be slightly less efficient than the 256-bit vector instruction set on Xeon E5.
  2. Xeon Phi has slightly lower L1 cache bandwidth (in terms of Bytes/Peak FP Op) than the Xeon E5, resulting in slightly lower efficiency for overlapping compute and data transfers to/from the L1 data cache.
  3. Xeon Phi has ~60 cores per chip, which can be expected to give less efficient throughput scaling than the 8 cores per Xeon E5 chip.
  4. Xeon Phi has slightly less bandwidth per peak FP Op than the Xeon E5, so the memory bandwidth will result in a higher overhead and a slightly lower percentage of peak FP utilization.
  5. Xeon Phi has no L3 cache, so the total cache per core (32kB L1 + 512kB L2) is lower than that provided by the Xeon E5 (32kB L1 + 256kB L2 + 2.5MB L3 (1/8 of the 20 MB shared L3).
  6. Xeon Phi has higher local memory latency than the Xeon E5, which has some impact on sustained bandwidth (already considered), and results in additional stall cycles in the occasional case of a non-prefetchable cache miss that cannot be overlapped with other memory transfers.

None of these are “problems” — they are intrinsic to the technology required to obtain higher peak performance per chip and higher peak performance per unit power. (That is not to say that the implementation cannot be improved, but it is saying that any implementation using comparable design and fabrication technology can be expected to show some level of efficiency loss due to each of these factors.)

The combined result of all these factors is that the Xeon Phi (or any processor obtaining its peak performance using much more parallelism with lower-power, less complex processors) will typically deliver a lower percentage of peak on real applications than a state-of-the-art Xeon E5 processor. Again, this is not a “problem” — it is intrinsic to the technology. Every application will show different sensitivity to each of these specific factors, but few applications will be insensitive to all of them.

Similar issues apply to comparisons between the “efficiency” of GPUs vs state-of-the-art processors like the Xeon E5. These comparisons are not as uniformly applicable because the fundamental architecture of GPUs is quite different than that of traditional CPUs. For example, we have all seen the claims of 50x and 100x speedups on GPUs. In these cases the algorithm is typically a poor match to the microarchitecture of the traditional CPU and a reasonable match to the microarchitecture of the GPU. We don’t expect to see similar speedups on Xeon Phi because it is based on a traditional microprocessor architecture and shows similar performance characteristics.

On the other hand, something that we don’t typically see is the list of 0x speedups for algorithms that do not map well enough to the GPU to make the porting effort worthwhile. Xeon Phi is not better than Xeon E5 on all workloads, but because it is based on general-purpose microprocessor cores it will run any general-purpose workload. The same cannot be said of GPU-based coprocessors.

Of course these are all general considerations. Performing careful direct comparisons of real application performance will take some time, but it should be a lot of fun!

Posted in Computer Hardware | Comments Off on Some comments on the Xeon Phi coprocessor