Opteron Memory Latency vs CPU and DRAM Frequency

In an earlier post (link), I made the off-hand comment that local memory latency was only weakly dependent on CPU and DRAM frequency in Opteron systems. Being of the scientific mindset, I went back and gathered a clean set of data to show the extent to which this assertion is correct.

The following table presents my most reliable & accurate measurements of open page local memory latency for systems based on AMD Opteron microprocessors. These values were obtained using a pointer-chasing code (similar to lat_mem_rd from lmbench) carefully implemented to avoid memory prefetching by either the CPU cores or the Memory Controller. These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory, resulting in loading every cache line but without activating the core prefetcher (which requires the stride to be no more than one cache line) or the DRAM prefetcher (which requires that the stride be no more than four cache lines on the system under test). Large pages were used to ensure that each 32kB region was mapped to contiguous DRAM locations to maximize DRAM page hits.

The system used for these tests is a home-built system:

AMD Phenom II 555 Processor
- Two cores
- Supported CPU Frequencies of 3.1, 2.5, 2.1, and 0.8 GHz
- Silicon Revision C2 (equivalent to “Shanghai” Opteron server parts)
4 GB of DDR3/1600 DRAM
- Two 2GB unregistered, unbuffered DIMMs
- Dual-rank DIMMs
- Each rank is composed of eight 1 Gbit parts (128Mbit x8)

The array sizes used were 256 MB, 512 MB, and 1024 MB, with each test being run three times. Of these nine results, the value chosen for the table below was an “eye-ball” average of the 2nd-best through 5th-best results. (There was very little scatter in the results — typically less than 0.1 ns variation across the majority of the nine results for each configuration.)

Processor Frequency	DDR3/1600	DDR3/1333	DDR3/1066	DDR3/800
3.2 GHz	51.58 ns	54.18 ns	57.46 ns	66.18 ns
2.5 GHz	52.77 ns	54.97 ns	58.30 ns	65.74 ns
2.1 GHz	53.94 ns	55.37 ns	58.72 ns	65.79 ns
0.8 GHz	82.86 ns	87.42 ns	94.96 ns	94.51 ns

Notes:

The default frequencies for the system are 3.2 GHz for the CPU cores and 667 MHz for the DRAM (DDR3/1333)

Summary:
The results above show that the variations in (local) memory latency are relatively small for this single-socket system when the CPU frequency is varied between 2.1 and 3.2 GHz and the DRAM frequency is varied between 533 MHz (DDR3/1066) and 800 MHz (DDR3/1600). Relative to the default frequencies, the lowest latency (obtained by overclocking the DRAM by 20%) is almost 5% better. The highest latency (obtained by dropping the CPU core frequency by ~1/3 and the DRAM frequency by about 20%) is only ~8% higher than the default value.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.

John D. McCalpin, Ph.D. says

May 4, 2011 at 3:28 pm

One version of the pointer-chasing code that I use is set up for a stride of 5 cache lines (320 Bytes), but set up to wrap around each 32kB region. I use this on large pages so that 32kB of aligned contiguous virtual addresses maps to 32kB of aligned contiguous physical addresses, which is small enough to be accessed in “open page” mode on any of my test systems.

Ensuring “open page” mode reduces the DRAM component of the latency, since only the column address needs to be sent.

Using a stride of 320 Bytes ensures that neither the core prefetcher nor the DRAM prefetcher on AMD Opteron processors will prefetch the data.

Allowing the addresses to wrap in each 32kB region enables me to load every cache line in the region, minimizing the overhead of opening the DRAM page (or pages, depending on the specifics of the DRAM configuration).

The code fragment looks like (apologies for the lack of formatting – the blog engine does not seem to want to obey my formatting instructions in comments):

#define REGION 32768
int nregions,nlines;
int source,target;
int imin, imax;

printf(“Modulo Initialization: base addr0 : %#018p \n”,addr0);

nregions = len/REGION;
nlines = REGION/64;

printf(” SEGSIZE corresponds to %d regions\n”,nregions);
printf(” Each region contains %d cache lines\n”,nlines);

for (j=0; j<nregions; j++) {
printf("starting setting for %d kB region number %d\n",REGION/1024,j);

imin = j*REGION;
imax = (j+1)*REGION-stride;

for (i=0; i<nlines; i++) {
source = j*REGION + (i*64) % REGION;
target = j*REGION + (i*64 + stride) % REGION;
/* printf("Source/target: %d, %d\n",source,target); */
*(char **)&addr0[source] = (char *)&addr0[target];
}
source = imax;
target = (j+1)*REGION;
*(char **)&addr0[source] = (char *)&addr0[target];
/* printf("Overwrite Source/target: %d, %d\n",source,target); */
}
source = imax;
target = 0;
*(char **)&addr0[source] = (char *)&addr0[target];

Comments

Kshitij Sudan says

April 26, 2011 at 6:30 pm

could you please elaborate a bit more on the pointer chasing code that performs open page accesses? Specifically this line in the blog text >> “These results were obtained using a pointer-chasing stride of 320 Bytes, modulo 32768 Bytes. This scheme results in five passes through each 32 kB region of memory … “
John D. McCalpin, Ph.D. says

May 4, 2011 at 3:28 pm

One version of the pointer-chasing code that I use is set up for a stride of 5 cache lines (320 Bytes), but set up to wrap around each 32kB region. I use this on large pages so that 32kB of aligned contiguous virtual addresses maps to 32kB of aligned contiguous physical addresses, which is small enough to be accessed in “open page” mode on any of my test systems.

Ensuring “open page” mode reduces the DRAM component of the latency, since only the column address needs to be sent.

Using a stride of 320 Bytes ensures that neither the core prefetcher nor the DRAM prefetcher on AMD Opteron processors will prefetch the data.

Allowing the addresses to wrap in each 32kB region enables me to load every cache line in the region, minimizing the overhead of opening the DRAM page (or pages, depending on the specifics of the DRAM configuration).

The code fragment looks like (apologies for the lack of formatting – the blog engine does not seem to want to obey my formatting instructions in comments):

#define REGION 32768
int nregions,nlines;
int source,target;
int imin, imax;

printf(“Modulo Initialization: base addr0 : %#018p \n”,addr0);

nregions = len/REGION;
nlines = REGION/64;

printf(” SEGSIZE corresponds to %d regions\n”,nregions);
printf(” Each region contains %d cache lines\n”,nlines);

for (j=0; j<nregions; j++) {
printf("starting setting for %d kB region number %d\n",REGION/1024,j);

imin = j*REGION;
imax = (j+1)*REGION-stride;

for (i=0; i<nlines; i++) {
source = j*REGION + (i*64) % REGION;
target = j*REGION + (i*64 + stride) % REGION;
/* printf("Source/target: %d, %d\n",source,target); */
*(char **)&addr0[source] = (char *)&addr0[target];
}
source = imax;
target = (j+1)*REGION;
*(char **)&addr0[source] = (char *)&addr0[target];
/* printf("Overwrite Source/target: %d, %d\n",source,target); */
}
source = imax;
target = 0;
*(char **)&addr0[source] = (char *)&addr0[target];

Reader Interactions

Comments