As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.
This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:
anon says
Short comment re: https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/
On most intel, the branch predictor remembers a distribution over short (length ~30) subsequences. This is often enough to reconstruct much longer sequences of branches, similar to how short reads can be used to assemble a long genome. A typical intel processor can almost perfectly predict 1:1 random periodic branching patterns of period 2000; and having a benchmarking loop induces a periodic pattern. See https://discourse.julialang.org/t/psa-microbenchmarks-remember-branch-history/17436 for a discussion in the julialang forums.
Depending on context, missing a branch can be much more expensive than expected: The missed branch can lead speculative execution into a rabbit hole that eats memory bandwidth, replaces good cache entries with garbage, and misses the opportunity to fetch the correct lines. If the speculative execution window is especially long (the missing branch is waiting on memory in order to resolve), then this gets worse.
Sorry for replying here. The comment section of your relevant post was already closed (feel free to move this reply there).
John D. McCalpin, Ph.D. says
Thanks for the comments! (I can’t figure out how to move the comment in WordPress, so I will leave it here…)
I have not done many experiments with the branch predictor for nested loops or other sequences of conditional branches. In some unpublished work on signal processing algorithms, I noticed that the branch predictor “remembered” which array indices would pass/fail certain compares, so that the performance increased if I re-ran the filter on the same input data. If I recall correctly, it took 4-5 iterations to reach asymptotic performance, but since this performance artifact was inappropriate for the actual use case (which would only see the data once) I was more focused on avoiding it than I was on understanding the details.
I have added some updates to that post relating to the LFENCE instruction….