According to Dr. Bandwidth, performance analysis has two recurring themes:
- How fast should this code (or “simple” variations on this code) run on this hardware?
- If I am analyzing (apparent) performance shortfalls, how can I distinguish between cause and effect?
For very simple codes, it may be possible to do a high-level analysis on performance limitations, but once the code becomes complex, it is often necessary to investigate the full stack. This can start with either a “top-down” or “bottom-up” approach, but in complex codes running on complex hardware, what is really required is both approaches — iterated until the interactions between all the components are understood. This is an intellectually challenging and labor-intensive exercise, requiring detailed review of the published details of each of the components of the system, and usually requiring significant “detective work” (using customized microbenchmarks, hardware performance counter analysis, and creative thinking) to fill in the gaps.
MPI is a commonly used standard to implement message-passing programs on computer clusters. The basic interfaces are simple to learn, and I suspect that many people have no idea how many levels of interacting components are involved in executing an MPI program on a cluster.
A discussion with a colleague prompted me to try to write down a list of the interacting components, along with some comments on why it is so hard to understand the interactions between components.
Interacting components in the execution of an MPI job — a brief outline (from memory):
- The user source code, which contains an ordered set of calls to MPI routines.
- There are many different types of MPI calls that can be used to perform the required communication.
- For implementations containing multiple MPI calls, there are (typically) many orders in which the routines can be called that are all functionally equivalent.
- The MPI standard requires that many sequences of calls (appear to) be ordered, but there are exceptions.
- The user execution environment.
- This will typically include environment variables that can influence the behavior of the MPI runtime, and might include environment variables that can influence the behavior of the lower-level shared-memory transport and/or network hardware interfaces.
- The user environment defines the mapping of MPI ranks to hardware resources (cores, sockets, nodes).
- The MPI runtime library.
- An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
- The source code to the library may not be available.
- Low-level MPI tracing may be required to determine what the runtime library decided to execute, and even that may not be enough to fully capture the “decisions” made at lower levels of the implementation.
- An MPI library may aggregate or re-order messages in any manner that provides the appearance of conforming to the ordering rules of the MPI specification.
- In particular, the MPI library may violate ordering rules of the MPI standard if it can prove (typically via runtime checks) that the result is the same as what would be obtained by code that explicitly followed the ordering rules. This sort of speculation means that multiple runs of the same code on the same nodes may end up performing different sequences of operations.
- MPI runtime libraries are typically built to interact with one or more lower-level interfaces to shared-memory transport implementation(s) and networking hardware.
- The implementations of such interfaces may be based on anywhere between a superficial understanding of the lower-level interface, and a deeply detailed understanding of the lower-level interface.
- Different levels of understanding may lead to different choices for exactly how the MPI library chooses to implement a function. For example, the MPI library may (or may not) know specific details of the ordering rules followed by the low-level interface, and so may (or may not) be able to choose different sequences of low-level operations.
- An MPI runtime library will typically include multiple implementations of each MPI function, and will choose implementation(s) to execute based on environment variables and run-time information (sizes, communicators, etc.), in ways that are seldom transparent.
- The implementation(s) of the low-level interface(s) to the shared-memory transport implementation and to the networking hardware interface.
- These vary in complexity and sophistication.
- The explicit functionality exposed in these low-level APIs may or may not be a good map to the functionality of the MPI standard(s) and/or to the functionality of the underlying hardware.
- Mismatches in semantics or ordering rules will require an implementation to choose overly conservative sequences of low-level operations, which typically reduces performance.
- The processor hardware available to support shared-memory transport.
- Current processor architectures do not include cross-core communication or synchronization as first-class architectural concepts, so implementation of communication and synchronization must be done using ordered sequences of loads and stores.
- Detailed analysis of the cache coherence transactions involved in shared-memory synchronization (especially across sockets) show that the lower bound on (uncontested) latency is at least an order of magnitude higher than what a hardware implementation should be able to support. For highly contested accesses, shared-memory synchronization latency is typically several orders of magnitude higher than what a hardware implementation should be able to support. Few implementations have performance that approaches the lower bounds on shared-memory synchronization latency.
- For almost all recent systems, a single thread of execution can only generate a fraction of the bandwidth available within or between sockets, limiting performance of the “one MPI task per socket” hybrid programming model.
- Full bandwidth to/from local DDR4 memory typically requires 6 or more cores per socket.
- Full bandwidth to/from high-bandwidth memory or L3 cache typically requires half or more of the cores in the socket.
- Full bandwidth between sockets varies by processor generation and access pattern, but is also typically 6 or more cores.
- The networking hardware.
- Low-level implementation details and performance characterization data is not typically available for network interfaces and switches.
- Most networking hardware was not designed to support highly efficient implementation of the MPI standards.
- The MPI standards were not designed to support highly efficient implementations on most networking hardware.
- Runtime contention
- Although most supercomputing systems use “space-sharing” to provide exclusive access to nodes, most supercomputing systems use an interconnect that does not separate traffic between different user’s jobs.
- Many current interconnects (InfiniBand, OmniPath) use static routing, making self-conflicts unavoidable (in practical terms) for any MPI job including nodes on more than one leaf switch.
- It is theoretically possible to avoid self-conflicts on a dedicated (non-oversubscribed) system, but this requires a herculean effort to load and parse the routing tables and to schedule all communications to avoid mapping more than one pairwise communication to any link in the multi-level switch hierarchy.
- Because routing tables are not required to be symmetric, it might not be possible to create a non-self-conflicting schedule of MPI rank pairs if bidirectional communication is used.
- Scheduling communications to avoid self-conflicts requires global barriers between steps of pairwise communications (e.g., between steps in an all-to-all communication). This is not “natural” for many/most MPI jobs, and the barrier overhead may result in overall performance degradation if the messages in the original communication are short.
Understanding any one of these components is not usually too hard (unless the implementation is undocumented).
Understanding the interaction of any two interacting components is a bit more work, but is typically manageable by a single person.
Understanding enough of the details of all of the components to be able to reason about the interactions typically requires at least 2-3 people with complementary expertise. Keeping up with changes to the hardware and software, and applying the analyses to the ever-expanding ensemble of user applications would make this a large fraction of a full-time assignment for the team members.
To make it worse, a supercomputing center like TACC has many different systems with a variety of interconnect hardware (several generations of Mellanox InfiniBand, Intel OmniPath, Cray Aries) and several different MPI stacks (Intel MPI, Cray’s version of MPICH, MVAPICH, OpenMPI). Some of these are sufficiently different that it may be necessary to have independent teams specializing in “full-stack” performance analyses for different systems.