John McCalpin's blog

Dr. Bandwidth explains all….

Archive for January, 2013

STREAM version 5.10 released

Posted by John D. McCalpin, Ph.D. on 17th January 2013

After much too long a delay, version 5.10 of the STREAM benchmark has been released (at least in the C language version).

Although version 5.10 of the benchmark still measures exactly the same thing as previous versions, a number of long-awaited features have finally been integrated.

  • Updated Validation Code
  • Array indexing now allows arrays with more than 2 billion elements
  • Data type used can now be overridden from the default “double” to “float” with a single compile flag
  • Many small output formatting changes to account for computers getting bigger and faster

The validation code update is the biggest change to version 5.10 of stream.c.
With previous versions, the validation code was subject to accumulated round-off error that could cause the code to report that validation failed with large array sizes — even if nothing was actually wrong. The revised code eliminates this problem and has been tested to array sizes of 10 billion elements with no problems.

Previous version of STREAM were limited to 32-bit array indices. Version 5.10 defines the array indices using a type that will map to a 64-bit integer on 64-bit machine — thus allowing arrays with more than 2 billion elements. Most compilers require an additional command-line flag like “-mcmodel=medium” to allow full 64-bit addressing. The changes to STREAM in version 5.10 are required in addition to the extra command-line flag.

Dr. Bandwidth also found eight older submissions (from 2009 through early 2012) that somehow got lost in my mailbox and never posted to the site. These are listed on the STREAM benchmark What’s New page.

Along with these older submissions, four new submissions have just been added to the site, ranging from a Raspberry Pi delivering about 200 MB/s to a Xeon Phi SE10P coprocessor delivering over 160,000 MB/s — that’s an 800 to 1 ratio of sustained memory bandwidths measured for single-chip systems!

Users of systems at the Texas Advanced Computing Center will be interested in seeing new results posted for three different components of the Stampede system:

  • Stampede Compute Nodes: Dell_DCS8000 servers with two Xeon E5-2680 (8-core, 2.7 GHz) processors
  • Stampede Coprocessors: Intel_XeonPhi_SE10P Coprocessor (61-core, 1.1 GHz)
  • Stampede Large Memory Nodes: Dell_PowerEdge_820 servers with four Xeon E5-4650 (8-core, 2.7 GHz) processors

Posted in Performance | 3 Comments »

Counting binary vs decimal powers in the STREAM benchmark

Posted by John D. McCalpin, Ph.D. on 5th January 2013

A question came up recently about my choice of definitions for “MB” used in the computation of memory bandwidth (in “MB/s”) in the STREAM benchmark.

According to this reference from NIST, the convention is:

Binary Powers Value abbreviation full name
2^10 1,024 KiB kibibyte
2^20 1,048,576 MiB mebibyte
2^30 1,073,741,824 GiB gibibyte
Decimal Powers Value abbreviation full name
10^3 1,000 kB kilobyte
10^6 1,000,000 MB megabyte
10^9 1,000,000,000 GB gigabyte

Since its inception in 1991, the STREAM benchmark has reported the amount of memory used in MiB (2^20) and (more recently) GiB (2^30), but always reports the transfer rates in MB/s (10^6).

An example may make my motivation more clear.

Suppose a computer system reads 524,288 Bytes and writes 524,288 Bytes in 1.000 seconds, for a total of 1,048,576 Bytes transferred in 1.000 seconds.
The corresponding performance could be reported in a variety of ways:

  • Option 1: report as 1,048,576 Bytes/s
  • Option 2: report as 1.000000 MiB/s
  • Option 3: report as 1.049 MB/s

From my perspective:

  • Option 1 gives inconveniently large numbers.
  • Option 2 is consistent with typical units for memory storage, but:
    • it is not consistent with typical units for counting arithmetic operations (more on that below), and
    • it would allow unscrupulous parties (or simply parties with different opinions about how to “properly” count) to change the definition of “MB” from 2^20 to 10^6, allowing them to report values that were almost 4.9% higher than the *same* performance on other systems.
  • Option 3 is what I chose. It is consistent with how FLOPS are counted and it preempts the potential “performance inflation” from abusing Option 2.

Note that if floating-point arithmetic operation counts define “MFLOPS” as 10^6 FP Ops/s (as is typical), then “balance” ratios of (MB/s)/MFLOPS require that (MB/s) also be defined using a decimal base.
These “balance” ratios are an important output of the STREAM benchmark project.
(Aside: I would not encourage anyone to consider a 5% difference in “balance” to mean very much — these are intended as relatively coarse scaling estimates.)

Posted in Performance | Comments Off on Counting binary vs decimal powers in the STREAM benchmark