Timing Methodology for MPI Programs

While working on the implementation of the MPI version of the STREAM benchmark, I realized that there were some subtleties in timing that could easily lead to inaccurate and/or misleading results. This post is a transcription of my notes as I looked at the issues….

Primary requirement: I want a measure of wall clock time that is guaranteed to start before any rank does work and to end after all ranks have finished their work.

Secondary goal: I also want the start time to be as late as possible relative to the initiation of work by any rank, and for the end time to be as early as possible relative to the completion of the work by all ranks.

I am not particularly concerned about OS scheduling issues, so I can assume that timers will be very close to the execution time of the the preceding statement completion and the subsequent statement initiation. Any deviations caused by stalls between timers, barriers, and work must be in the direction of increasing the reported time, not decreasing it. (This is a corollary of my initial requirement.)

The discussion here will be based on a simple example, where the “t” variables are (local) wall clock times for MPI rank k and WORK() represents the parallel workload that I am testing.

Generically, I want:

      t_start(k) = time()
      WORK()
      t_end(k) = time()

but for an MPI job, the methodology needs to be provably correct for arbitrary (real) skew across nodes as well as for arbitrary offsets between the absolute time across nodes. (I am deliberately ignoring the rare case in which a clock is modified on one or more nodes during a run — most time protocols try hard to avoid such shifts, and instead change the rate at which the clock is incremented to drive synchronization.)

After some thinking, I came up with this pseudo-code, which is executed independently by each MPI rank (indexed by “k”):

      t0(k) = time()
      MPI_barrier()
      t1(k) = time()

      WORK()

      t2(k) = time()
      MPI_barrier()
      t3(k) = time()

If the clocks are synchronized, then all I need is:

    tstart = min(t1(k)), k=1..numranks
    tstop  = max(t2(k)), k=1..numranks

If the clocks are not synchronized, then I need to make some use of the barriers — but exactly how?

In the pseudo-code above, the barriers ensure that the following two statements are true:

For the start time, t0(k) is guaranteed to be earlier than the initiation of any work.
For the end time, t3(k) is guaranteed to be later than the completion of any work.

These statements are true for each rank individually, so the tightest bound available from the collection of t0(k) and t3(k) values is:

      tstop - tstart > min(t3(k)-t0(k)),   k=1..numranks

This gives a (tstop – tstart) that is at least as large as the time required for the actual work plus the time required for the two MPI_barrier() operations.