Counting Stall Cycles on the Intel Sandy Bridge Processor

Intuition might suggest that defining what a “stall cycle” is on a processor should be relatively straightforward. For some processors, this
is actually the case — particularly in-order processors with a very small number of execution units and a very small number of non-pipelined
instructions. For modern out-of-order processors, coming up with a precise and quantitative definition of “stall” involves numerous subtleties,
and deriving a methodology to measure such stalls is even more difficult.

This week I did some testing of “stalls” using the hardware performance counters in the Intel Xeon E5-2680 (“Sandy Bridge EP”) processors in
the Stampede system at TACC.

I found performance counter events that count stalls at two different places in the processor pipeline (with a third mentioned below, but not tested here):

Two events count cycles in which uops are not sent from the RAT (Register Alias Table — the register renaming unit) to the RS (Reservation Station —
queues uops until the instructions defining their source operands have been dispatched, then dispatches “ready” uops to the execution ports)
1. Event 0x0E, Umask 0x01: UOPS_ISSUED with the CMASK and INVERT flags: 0x01c3010e
  - Intel’s VTune calls this UOPS_ISSUED.STALL_CYCLES
2. Event 0xA2, Umask 0x01: RESOURCE.STALLS.ANY
  - Consistently delivers values about 1% to 3% lower than the UOPS_ISSUED.STALL_CYCLES event in my tests.
Two events count cycles in which no uops are dispatched from the RS to any of the execution units (aka “ports”).
1. Event 0xA3, Umask 0x04: CYCLE_ACTIVITY.CYCLES_NO_DISPATCH with CMASK=4: 0x044304a3
  - I got the CMASK value from VTUNE — the documentation in Vol 3 of the SW Developer’s Guide is not very helpful.
2. Event 0xB1, Umask 0x02: UOPS_DISPATCHED.STALL_CYCLES_CORE: 0x01c302b1
  - This is very similar to an event used by VTune, but I use Umask 0x02 rather than 0x01. This will only make a difference on a system with
    HyperThreading enabled, and I don’t have any systems configured that way to test right now.
  - These two events differed by no more than a part per million in my tests.

As discussed in the Intel forum thread (link), the first two events can easily overcount stalls in codes that have a “stall-free” IPC of less than 4. For example, a code with a “stall-free” IPC of 1 could show 75% stall cycles using these events, with uops transferred from the RAT to the RS in one block of 4 uops every 4 cycles (leaving 3 cycles idle).

The second two events typically undercount stalls because they consider a cycle to be a “non-stall” cycle if any uops are dispatched from the RS to the execution units, even when those uops subsequently get rejected and retried because their input data is not in the cache. Using the STREAM benchmark as my test case, I often saw that the total number of uops dispatched to the execution ports was 20%-50% higher than the number of uops issued from the RAT to the RS. (This was based on a small number of test cases which were not intended to approach the upper bound on uop retries, so I assume that the worst case fraction of retries could be much higher. I have seen retries of floating-point instructions exceeding 12x, and that was not intended to be a worst-case upper bound either.)

Unfortunately, there is no way to count these execution retries directly, and no way to determine how many cycles had instructions dispatched that were all rejected and retried.

Note that one can also count cycles in which no instructions are retired. This was also discussed in the forum thread above, and has the same theoretical problem as counting at issue — the processor can retire at least four instructions per cycle, so if the non-stalled IPC is less than four, burstiness of instruction retirement can result in non-zero stall cycle counts even if there are some instructions executing every cycle.

None of this discussion so far has explicitly dealt with the cause of the stalls. Intel provides a very interesting performance counter event that provides some insight into this issue. Event 0xA3 CYCLE_ACTIVITY has Umasks for “CYCLES_L2_PENDING” (0x01) and “CYCLES_NO_DISPATCH” (0x04). Again, the documentation in Vol 3 of the SW developer’s guide is not adequate to understand how to program this unit, but fortunately Intel’s VTune provides an example. The VTune event CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING is created with this event by combining the two Umasks and including a CMASK value of 5, giving the encoding: 0x054305a3. (It is not at all clear why the CMASK value should be 5 in this case, but the event is clearly non-standard since the combined Umask values are treated as a logical AND rather than the logical OR typically assumed for combined Umasks.)

In experiments with the STREAM benchmark, where the actual number of stall cycles should be around 90%, the values produced by CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING varied between 30% and 93% of the CYCLE_ACTIVITY.CYCLES_NO_DISPATCH counts (without the L2_PENDING qualifier). The lower values were seen with tests using streaming (nontemporal) stores, while the higher values were seen using ordinary (allocating) stores. This pattern makes it clear that this event counts store misses (RFO’s) in the “L2_PENDING” category, but it leaves a “hole” in the memory stall cycle identification in the case where the memory stalls are due to streaming stores.

For AVX codes there is an event that catches this reasonably well: Event 0xA2, Umask 0x08: RESOURCE_STALLS.SB (cycles with no issue from the RAT to the RS because the store buffers are full) shows 70%-91% of the total cycles have issue stalls due to full store buffers. So looking at the max of CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING and RESOURCE_STALLS.SB gives a good indication of stalls due to memory for codes with either allocating stores or streaming stores.
For SSE codes with streaming stores the RESOURCE_STALLS.SB event is only 20%-37% of the total cycles. Even if you add the percentage stalls from this number to the percentage stalls using CYCLE_ACTIVITY.STALL_CYCLES_L2_PENDING you only get 45% – 59% of the total cycles, so I don’t yet have a set of events that can identify that all of the stall cycles are actually memory stalls. (Adding stall cycles in this way is not generally a good idea, since cycles can be stalled for both reasons. I only add the two here to show that they are both much too small to account for all of the stall cycles.)

Summary: There are resources available to help identify memory-related stall cycles, but they are not as precise as one might like. In most cases these counters can identify when memory stalls are dominating execution time, and this is really what a performance analyst is looking for. Once the problem is identified, tuning work is primarily based on execution time of the code section of interest, with hardware performance counters playing (at most) an advisory role.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.