Why don’t we talk about bisection bandwidth any more?

Is it unimportant just because we are not talking about it?

I was recently asked for comments about the value of increased bisection bandwidth in computer clusters for high performance computing. That got me thinking that the architecture of most internet news/comment infrastructures is built around “engagement” — effectively amplifying signals that are already strong and damping those that do not meet an interest threshold. While that approach certainly has a role, as a scientist it is also important for me to take note of topics that might be important, but which are currently generating little content. Bisection bandwidth of clustered computers seems to be one such topic….

Some notes relevant to the history of the discussion of bisection bandwidth in High Performance Computing

There has not been a lot of recent discussion of high-bisection-bandwidth computers, but this was a topic that came up frequently in the late 1990’s and early 2000’s. A good reference is the 2002 Report on High End Computing for the National Security Community, which I participated in as the representative of IBM (ref 1). The section starting on page 35 (“Systems, Architecture, Programmability, and Components Working Group”) discusses two approaches to “supercomputing” – one focusing on aggregating cost-effective peak performance (the “type T” (Transistor) systems), and the other focusing on providing the tightest integration and interconnect performance (“type C” (Communication) systems). A major influence on this distinction was Burton Smith, founder of Tera Computer Company and developer of the Tera MTA system. The Tera MTA was a unique architecture with no caches, with memory distributed/interleaved/hashed across the entire system, and with processors designed to tolerate the memory latency and to effectively synchronize on memory accesses (using “full/empty” metadata bits in the memory).

The 2002 report led fairly directly to the DARPA High Productivity Computing Systems (HPCS) project (2002-2010), which provided direct funding to several companies to develop hardware and software technologies to make supercomputers significantly easier to use. Phase 1 was just some seed money to write proposals, and included about 7 companies. Phase 2 was a much larger ($50M over 3 years to each recipient, if I recall correctly) set of grants for the companies to do significant high-level design work. Phase 3 grants ($200M-$250M over 5 years to each recipient) were awarded to Cray and IBM (ref 2). I was (briefly) the large-scale systems architecture team lead on the IBM Phase 2 project in 2004.

Both the Cray and IBM projects were characterized by a desire to improve effective bisection bandwidth, and both used hierarchical all-to-all interconnect topologies.

(If I recall correctly), the Cray project funded the development of the “Cascade” interconnect which eventually led to the Cray XC30 series of supercomputers (ref 3). Note that the Cray project funded only the interconnect development, while standard AMD and/or Intel processors were used for compute. The inability to influence the processor design limited what Cray was able to do with the interconnect. The IBM grant paid for the development of the “Torrent” bridge/switch chip for an alternate version of an IBM Power7-based server. Because IBM was developing both the processor and the interconnect chip, there was more opportunity to innovate.

I left IBM at the end of 2005 (near the end of Phase 2). IBM did eventually complete the implementation funded by DARPA, but backed out of the “Blue Waters” system at NCSA (ref 4) – an NSF-funded supercomputer that was very deliberately pitched as a “High Productivity Computing System” to benefit from the DARPA HPCS development projects. I have no inside information from the period that IBM backed out, but it is easy to suspect that IBM wanted/needed more money than was available in the budget to deliver the full-scale system. The “Blue Waters” system was replaced by a Cray – but since the DARPA-funded “Cascade” interconnect was not ready, Blue Waters used an older implementation (and older AMD processors). IBM sold only a handful of small Power7 systems with the “Torrent” interconnect, mostly to weather forecasting centers. As far as I can tell, none of the systems were large enough for the Torrent interconnect to show off its (potential) usefulness.

So after about 10 years of effort and a DARPA HPCS budget of over $600M, the high performance computing community got the interconnect for the Cray XC30. I think it was a reasonably good interconnect, but I never used it. TACC just retired our Cray XC40 and its interconnect performance was fine, but not revolutionary. After this unimpressive return on investment, it is perhaps not surprising that there has been a lack of interest in funding for high-bisection-bandwidth systems.

That is not to say that there is an unambiguous market for high-bisection-bandwidth systems! It is easy enough to identify application areas and user communities who will claim to “need” increased bandwidth, but historically they have failed to follow up with actual purchases of the few high-bandwidth systems that have been offered over the years. A common factor in the history of the HPC market has been that given enough time, most users who “demand” special characteristics will figure out how to work around the lack with more software optimization efforts, or with changes to their strategy for computing. A modest fraction will give up on scaling their computations to larger sizes and will just make do with the more modest ongoing improvements in single-node performance to advance their work.

Architectural Considerations

When considering either local memory bandwidth or global bandwidth, a key principle to keep in mind is concurrency. Almost all processor designs have very limited available concurrency per core, so lots of cores are needed just to generate enough cache misses to fill the local memory pipeline. As an example, TACC has recently deployed a few hundred nodes using the Intel Xeon Platinum 8380 processor (3rd generation Xeon Scalable Processor, aka “Ice Lake Xeon”). This has 40 cores and 8 channels of DDR4/3200 DRAM. The latency for memory access is about 80ns. So the latency-bandwidth product is

204.8 GB/s * 80 nanoseconds = 16384 bytes, or 256 cache lines

Each core can directly generate about 10-12 cache misses, and can indirectly generate a few more via the L2 hardware prefetchers – call it 16 concurrent cache line requests per core, so 16 of the (up to) 40 cores are required just to fill the pipeline to DRAM.

For IO-based interconnects, latencies are much higher, but bandwidths are lower. Current state of the art for a node-level interconnect is 200 Gbit, such as the InfiniBand HDR fabric. The high latency favors a “put” model of communication, with one-way RDMA put latency at about 800ns (unloaded) through a couple of switch hops. The latency-bandwidth product is (200/8)*800 = 20000 Bytes. This is only slightly higher than the case for local memory, but it depends on two factors: (1) The bandwidth is fairly low, and (2) the latency is only available for “put” operations – it must be doubled for “get” operations.

A couple of notes:

The local latency of 80 ns is dominated by a serialized check of the three levels of the cache (with the 3rd level hashed/interleaved around the chip in Intel processors), along with the need to traverse many asynchronous boundaries. (An old discussion on this blog Memory Latency Components 2011-03-10 is a good start, but newer systems have a 3rd level of cache and on on-chip 2D mesh interconnect adding to the latency components.) The effective local latency could be reduced significantly by using a “scratchpad” memory architecture and a more general block-oriented interface (to more effectively use the open page features of the DRAM), but it would very challenging to get below about 30 ns read latency.
The IO latency of ~800 ns for a put operation is dominated by the incredibly archaic architecture of IO systems. It is hard to get good data, but it typically requires 300-400 ns for a core’s uncached store operation to reach its target PCIe-connected device. This is absolutely not necessary from a physics perspective, but it cannot easily be fixed without a wholistic approach to re-architecting the system to support communication and synchronization as first-class features. One could certainly design processors that could put IO requests on external pins in 20ns or less – then the speed of light term becomes important in the latency equation (as it should be).

It has been a while since I reviewed the capabilities of optical systems, but the considerations above usually make extremely high bisection bandwidth effectively un-exploitable with current processor architectures.

Future technology developments (e.g, AttoJoule Optoelectronics for Low-Energy Information Processing and Communication) may dramatically reduce the cost, but a new architecture will be required to reduce the latency enough to make the bandwidth useful.

References: