Sometimes things that seem incredibly difficult aren’t really that bad….
I have been reviewing technology challenges for “exascale” computing and ran across an interesting comment in the 2008 “Technology Challenges in Achieving Exascale Systems” report.
In Section 5.8 “Application Assessments”, Figure 5.16 on page 82 places “Ordered Summation” in the upper right hand corner (serial and non-local) with the annotation “Just plain hard to speed up”.
The most obvious use for ordered summation is in computing sums or dot products in such a way that the result does not depend on the order of the computations, or on the number of partial sums used in intermediate stages.
Interestingly, “ordered summation” is not necessary to obtain sums or dot products that are “exact” independent of ordering or grouping. For the very important case of “exact” computation of inner products, the groundwork was laid out 30 years ago by Kulisch (e.g., Kulisch, U. and Miranker, W. L.: Computer Arithmetic in Theory and Practice, Academic Press 1981, and US Patents 4,622,650 and 4,866,653). Kulisch proposes using a very long fixed point accumulator that can handle the full range of products of 64-bit IEEE values — from minimum denorm times minimum denorm to maximum value times maximum value. Working out the details and allowing extra bits to prevent overflow in the case of adding lots of maximum values, Kulisch proposed an accumulator of 4288 bits to handle the accumulation of products of 64-bit IEEE floating-point values. (ref and ref).
For a long time this proposal of a humongously long accumulator (4288 bits = 536 Bytes = 67 64-bit words) was considered completely impractical, but as technology has changed, I think the approach makes a fair amount of sense now — trading computation of these exact inner products for the potentially much more expensive communication required to re-order the summation.
I have not looked at the current software implementations of this exact accumulator in detail, but it appears that on a current Intel microprocessor you can add two exact accumulators in ~133 cycles — one cycle for the first 64-bit addition, and 2 cycles for each of the next 66 64-bit add with carry operations. (AMD processors provide similar capability, with slightly different latency and throughput details.) Although the initial bit-twiddling to convert from two IEEE 64-bit numbers to a 106-bit fixed point value is ugly, the operations should not take very long in the common case, so presumably a software implementation of the exact accumulator would spend most of its time updating exact accumulator from some “base” point (where the low-order bits of the product sit) up to the top of the accumulator. (It is certainly possible to employ various tricks to know when you can stop carrying bits “upstream”, but I am trying to be conservative in the performance estimates here.)
Since these exact accumulations are order-independent you use all of the cores on the chip to run multiple accumulators in parallel. You can also get a bit of speedup by pipelining two accumulations on one core (1.5 cycles per Add with Carry throughput versus 2 cycles per Add with Carry Latency). To keep the control flow separate, this is probably done most easily via HyperThreading. Assuming each pair of 64-bit IEEE inputs generates outputs that average ~1/3 of the way up the exponent range, a naïve implementation would require ~44 Add With Carry operations per accumulate, or about 30 cycles per update in a pipelined implementation. Add another ~25 cycles per element for bit twiddling and control overhead gives ~55 cycles per element on one core, or ~7.5 cycles per element on an 8-core processor. Assume a 2.5GHz clock to get ~3ns per update. Note that the update is associated with 16 Bytes of memory traffic to read the two input arrays, and that the resulting 5.3 GB/s of DRAM bandwidth is well within the chip’s capability. Interestingly, the chip’s sustained bandwidth limitation of 10-15 GB/s means that accumulating into a 64-bit IEEE value is only going to be 2-3 times as fast as this exact technique.
Sending exact accumulators between nodes for a tree-based summation is easy — with current interconnect fabrics the time required to send 536 Bytes is almost the same as the time required to send the 8 Byte IEEE partial sums currently in use. I.e., with QDR Infiniband, the time required to send a message via MPI is something like 1 microsecond plus (message length / 3.2 GB/s). This works out to 1.0025 microseconds with an 8 Byte partial sum and 1.167 microseconds for a 536 Byte partial sum, with the difference expected to decrease as FDR Infiniband is introduced.
I don’t know of anyone using these techniques in production, but it looks like we are getting close to the point where we pay a slight performance penalty (on what was almost certainly a small part of the code’s overall execution time) and never again need to worry about ordering or grouping leading to slightly different answers in sum reductions or dot products. This sounds like a step in the right direction….