• Skip to main content
  • Skip to primary sidebar
UT Shield
The University of Texas at Austin

January 9, 2019, Filed Under: Computer Architecture, Performance

New Year’s Updates

As part of my attempt to become organized in 2019, I found several draft blog entries that had never been completed and made public.

This week I updated three of those posts — two really old ones (primarily of interest to computer architecture historians), and one from 2018:

  • July 2012: Local and Remote Memory Latency on AMD Processors in 2-socket and 4-socket servers
  • December 2013: Notes on Memory Bandwidth on the Xeon Phi (Knights Corner) Coprocessor
  • January 2018: A Peculiar Throughput Limitation in the Intel Xeon Phi x200 (Knights Landing) Processor

Reader Interactions

Comments

  1. anon says

    January 18, 2019 at 9:59 am

    Short comment re: https://sites.utexas.edu/jdm4372/2018/07/23/comments-on-timing-short-code-sections-on-intel-processors/

    On most intel, the branch predictor remembers a distribution over short (length ~30) subsequences. This is often enough to reconstruct much longer sequences of branches, similar to how short reads can be used to assemble a long genome. A typical intel processor can almost perfectly predict 1:1 random periodic branching patterns of period 2000; and having a benchmarking loop induces a periodic pattern. See https://discourse.julialang.org/t/psa-microbenchmarks-remember-branch-history/17436 for a discussion in the julialang forums.

    Depending on context, missing a branch can be much more expensive than expected: The missed branch can lead speculative execution into a rabbit hole that eats memory bandwidth, replaces good cache entries with garbage, and misses the opportunity to fetch the correct lines. If the speculative execution window is especially long (the missing branch is waiting on memory in order to resolve), then this gets worse.

    Sorry for replying here. The comment section of your relevant post was already closed (feel free to move this reply there).

  2. John D. McCalpin, Ph.D. says

    January 23, 2019 at 12:40 pm

    Thanks for the comments! (I can’t figure out how to move the comment in WordPress, so I will leave it here…)

    I have not done many experiments with the branch predictor for nested loops or other sequences of conditional branches. In some unpublished work on signal processing algorithms, I noticed that the branch predictor “remembered” which array indices would pass/fail certain compares, so that the performance increased if I re-ran the filter on the same input data. If I recall correctly, it took 4-5 iterations to reach asymptotic performance, but since this performance artifact was inappropriate for the actual use case (which would only see the data once) I was more focused on avoiding it than I was on understanding the details.

    I have added some updates to that post relating to the LFENCE instruction….

Primary Sidebar

Recent Posts

  • Single-core memory bandwidth: Latency, Bandwidth, and Concurrency
  • Dr. Bandwidth is moving on…
  • The evolution of single-core bandwidth in multicore systems — update
  • “Memory directories” in Intel processors
  • The evolution of single-core bandwidth in multicore processors

Tags

accelerated computing arithmetic cache communication configuration coprocessor Distributed cache DRAM Hash functions high performance computing Knights Landing memory bandwidth memory latency microprocessors MMIO MTRR Multicore processors Opteron STREAM benchmark synchronization TLB Virtual Memory Xeon Phi

UT Home | Emergency Information | Site Policies | Web Accessibility | Web Privacy | Adobe Reader

© The University of Texas at Austin 2025