John McCalpin's blog

Dr. Bandwidth explains all….

Archive for July, 2018

Comments on timing short code sections on Intel processors

Posted by John D. McCalpin, Ph.D. on 23rd July 2018

(From a recent post of mine on the Intel software developer forums — some potentially useful words to go along with my new low-overhead-timers project…)

Updates on 2019-01-23 in blue.

There are lots of topics that you need to be aware of when attempting fine-grain timing.  A few of the more important ones are:

  • The RDTSC instruction increments at the rate of the “base” (or “nominal”) processor frequency, while instructions are executed at the “core frequency”.  The “core frequency” may be higher or lower than the “base” frequency, and it may change during your measurement interval.
    • If you have the ability to “pin” the processor frequency to match the “base” frequency, interpreting the results is often easier.
    • Whether you can fix the frequency or not, you will still need to measure several different things to be sure that you can unambiguously interpret the results.  More on this below.
  • With Turbo mode enabled, Intel processors will change their frequency based on how many cores are active.  When running a single user thread, you will often get the advertised single-core Turbo frequency, but if the operating system enables more cores to handle (even very short-lived) background processes, your frequency may drop unexpectedly.
  • Recent Intel processors often throttle down to a low frequency when not in use, and (depending on processor generation, BIOS settings, and OS settings) it may take longer than expected for the frequency to ramp back up to the expected values.
    • I usually precede the code that I want to test with a “warm-up” loop consisting of at least a few seconds of execution of instructions using the same SIMD width as the code that I want to test.
  • Always pin the thread you want to test to a single logical processor (if possible).
    • This allows you to use the RDPMC instruction to read the logical processor’s fixed-function performance counters.
    • It also reduces the chance of frequency changes or other stalls that may be incurred when moving a thread context to a different core.

For measurements of short duration (<< 1 second)

  • Intel processors will be halted during frequency changes, and recent Intel processors (Haswell and newer) will also be halted when activating and/or deactivating the portions of the pipeline(s) needed for 256-bit SIMD instructions and for 512-bit SIMD instructions.
    • The duration of these halts varies by product and in some cases by the amount of the frequency change.  I have seen values as low as 6 microseconds and as high as 50 microseconds for these types of transitions.

For measurements of very short duration (< 100’s of cycles)

  • The RDTSC instruction is not ordered with respect to the execution of other instructions.  Intel processors have gained increasing ability to execute instructions out of order over the past decade, allowing the execution of these instructions to be moved further away from where one might expect — in either direction.
  • The RDTSCP instruction is partially ordered — it will not execute until all prior instructions in program order have executed.
    • RDTSCP can still be executed later than expected, but not earlier.
    • This partial ordering can help expose the execution time of long-latency instructions (such as memory accesses or mispredicted branches) that occur shortly before the final value of the TSC is read using RDTSCP.
  • The LFENCE (“Load Fence”) instruction was originally intended to order memory load operations, but was later extended (architecturally) to order instruction execution.
    • The LFENCE instruction will not execute until all prior instructions have “completed locally”, and no later instructions will begin execution (even speculatively) until the LFENCE instruction completes.
      • It may not be safe to assume that “completed locally” and “retired” are equivalent.
      • “Completed locally” is definitely not the same as “globally visible” — SFENCE is still required if you need to ensure that stores are globally visible before continuing.
    • LFENCE is a fairly lightweight instruction, though the cost depends on the nature of the surrounding instructions. 
      • The repeat rate for consecutive LFENCE instructions is 4 cycles for mainstream Intel processors starting with Sandy Bridge (through at least Skylake).
    • The combination “LFENCE; RDTSC” has a slightly stronger ordering than RDTSCP.
      • RDTSCP waits until prior instructions have completed, but does not prevent later instructions from beginning execution before the RDTSCP instruction executes.  
        • If you want a lower bound on the execution time between “start” and “stop” instructions, the minimum requirement would be to add “RDTSC; LFENCE” before the “start” instruction and to add “RDTSCP” after the “stop” instruction. 
  • The Intel branch predictors are stranger than you might expect, and branch misprediction overheads are not trivial.
    • If you repeatedly execute an inner loop with a trip count of less than about 30, the branch predictor will “remember” which iteration is the final iteration of the loop, and it will correctly predict the loop exit.
    • If you increase the inner loop trip count to 35 or more, the branch predictor will not “remember” which iteration is the final iteration, so the final loop iteration will include a mispredicted branch, with an associated overhead of 15-20 cycles.
    • This can be very hard to understand if you are looking at results for loop trip counts from (for example) 16 to 64 and you see an unexpected bump of 15-20 cycles once the trip count exceeds a limit (typically in the 32-34 range).
    • This is even more confusing when you consider vectorization and loop unrolling, which the compiler may change significantly from one compilation to the next as you fiddle with your code.

Some recommendations:

  • A set of interfaces to the RDTSC and RDPMC instructions that have very low overheads are available at low-overhead-timers
  • I recommend measuring a minimum of four values:
    • Elapsed TSC cycles (using RDTSC or RDTSCP)
    • Instructions — using the RDPMC instruction with counter number (1<<30)+0
    • Core Cycles not halted — using the RDPMC instruction with counter number (1<<30)+1
    • Reference Cycles not halted — using the RDPMC instruction with counter number (1<<30)+2
  • If you have the ability to program the general-purpose core performance counters, I also recommend measuring at least two more values:
    • Instructions executed in kernel mode.
    • Core cycles not halted in kernel mode.
  • Compute these metrics:
    • Core Utilization = (Elapsed Reference Cycles not Halted) / (Elapsed TSC cycles)
      • If this is not very close to 1.000, the processor has been halted for frequency and/or pipeline activation issues, and you need to try to figure out why.
    • Average frequency while not halted = (Elapsed Core Cycles not Halted) / (Elapsed Reference Cycles not Halted) * Base_GHz
      • This should be compared to the expected frequency for your processor, given the number of cores that you think should be active.
    • Average net frequency = (Elapsed Core Cycles not Halted) / (Elapsed TSC cycles) * Base_GHz
      • This will tell you how much of your expected frequency has been lost due to processor halts.
    • Instructions Retired / Instructions Expected
      • For simple loops, you can look at the assembly code and count instructions.
      • This value will change significantly (and repeatably) if the compiler changes the vectorization of the loop.
      • This will change randomly (upward) if the OS schedules another process on the same logical processor during your measured section.
      • For measurements of 10,000 instructions or less, this will increase by a noticeable amount if an OS timer interrupt occurs during your measured section.
    • Kernel instructions / Total instructions
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.
    • Core Cycles not Halted in Kernel Mode / Core Cycles not Halted
      • Should be zero for short intervals (<1 millisecond) that don’t include a kernel timer interrupt.   Discard tests with non-zero values for these short cases.
      • Should be very small (<<1%) for any test that does not include an explicit call to a system routine.

Posted in Computer Architecture, Performance, Performance Counters | Comments Off on Comments on timing short code sections on Intel processors