Is that just because multiplication is so commonly needed that there is double h...

jcranmer · on July 22, 2020

A floating point add works roughly like this:

0. Extract mantissa and exponents from the inputs [basically free in a HW implementation].

1. Combine the two exponents and shift the mantissas so that they same the share scale.

2. Add the two mantissas together.

3. Normalize the result back into a floating-point number.

Now look at the process for a floating-point multiply operation:

0. Extract mantissa and exponents from the inputs [basically free in a HW implementation].

1. Add the two exponents together.

2. Multiply the two mantissas together.

3. Normalize the result back into a floating-point number.

Steps 0 and 3 are identical in both cases. I'm not really knowledgeable about hardware multiplier implementations, but doing an additional "add" operation can be pretty close to free in some implementations. In any case, adding in an extra adder to the process is going to be quite cheap, even if not free.

What this means is that there is not much extra hardware to turn a FMUL ALU op into an FMA ALU op... and if you make just a single FMA ALU op, you can use that hardware to implement FADD, FMUL, and FMA with nothing more than microcode.

In other words, instead of thinking of FMA as something worth pouring more resources into, think of it as FADD as not worth enough to be made its own independent unit.

dreamcompiler · on July 23, 2020

FP units are not used in integer bignum operations. I'm still curious if "multiplies can be faster" applies to modern integer ALUs.

jabl · on July 23, 2020

Many important numerical algorithms like dot products, matrix multiplication, FFT can be implemented in terms of FMA. So yes, it makes sense for processor designers to devote resources to FMA units.

Also, FMA units can be used for pure adds or multiplies as well. Not sure of instruction sets that have separate instructions for addition and multiplication in addition to FMA (like x86), are they all executed on the same functional units or are there separate units. And if executed on the same functional units, what's the limitation preventing only one add per cycle vs. two FMA's as on Haswell.