Following up on Part 1 and Part 2, and Part 3, it is time to into the ugly stuff — trying to control DRAM bank and rank access patterns and working to improve the effectiveness of the memory controller prefetcher. Background: Banks and Ranks The DRAM installed in the system… read more
Optimizing AMD Opteron Memory Bandwidth, Part 3: single-thread, read-only
Following up on Part 1 and Part 2, it is time to look at adding explicit prefetching to try to increase read bandwidth. About Prefetching The AMD Opteron Family10h processors have two different “hardware” prefetch mechanisms, and also allow “software” prefetch instructions. The “core prefetcher” is (as the name implies)… read more
Optimizing AMD Opteron Memory Bandwidth, Part 2: single-thread, read-only
In a previous entry, I started discussing the issues related to memory bandwidth for a read-only kernel on a sample AMD Opteron system. The naive implementation gave a performance of 3.393 GB/s when compiled at “-O1” (hereafter “Version 001”) and 4.145 GB/s when compiled at “-O2” (hereafter “Version 002”). Today… read more