One of the (many) minimally documented features of recent Intel processor implementations is the “memory directory”. This is used in multi-socket systems to reduce cache coherence traffic between sockets. I have referred to this in various presentations as: “A Memory Directory is one or more bits per cache line… read more
microprocessors
Mapping addresses to L3/CHA slices in Intel processors
Starting with the Xeon E5 processors “Sandy Bridge EP” in 2012, all of Intel’s mainstream multicore server processors have included a distributed L3 cache with distributed coherence processing. The L3 cache is divided into “slices”, which are distributed around the chip — typically one “slice” for each processor core. Each… read more
A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing)
A peculiar throughput limitation on Intel’s Xeon Phi x200 (Knights Landing) Introduction: In December 2017, my colleague Damon McDougall (now at AMD) asked for help in porting the fused multiply-add example code from a Colfax report (https://colfaxresearch.com/skl-avx512/) to the Xeon Phi x200 (Knights Landing) processors here at TACC. There was… read more