Fast Exponentiation and Floating Point Representation

Where I briefly explain an old school trick to approximate the exponential function, by abusing the way floating points are represented in memory. The method is known since at least 1999, where it appears in a paper by Nicol Schraudolph.

Floating Point Representation

Floating points are absolutely ubiquitous in programming these days, and virtually every architecture implements the IEEE 754 standard. Paradoxically, it seems few software engineers are aware of the definition and subtleties of the representation, and even fewer know of the kind of abuse they lend themselves to.

An almost mythical example of such hacks is the (misattributed) Quake III fast inverse square root algorithm:

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;
    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}  

There is nothing I can say about it that has not been extensively said already and isn’t documented and linked from the Wikipedia page, so I won’t waste any time on that and focus on the exponential instead. By the end, that should still demystify the line:

    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?

The IEEE 754 Standard

Although there has historically been many ways to represent floating points (different bases, with or without gradual underflow etc…), we have eventually settled on a standard that I’ll explain. Unless you code on some pretty old/esoteric machine, or do some AI stuff that require the usage of more exotic formats, chances are this is what you’re dealing with.

I’ll focus on 32-bit floating points (float in C/C++) here, but the standard is similarly defined for 16-bit and 64-bit floating points, just with a couple different constants and bit partition.

Let ff be a 32-bit floating point. Its binary expression is partitioned as follows:

s e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m1 m0

Let e=i=07ei2iandm=i=022mi2i, e = \sum_{i = 0}^{7} e_i 2^i \quad \text{and} \quad m = \sum_{i = 0}^{22} m_i 2^i,

then,

f=(1)s×2e127×(1+223m). f = (-1)^s \times 2^{e - 127} \times \left(1 + 2^{-23}m\right).

Remark: There may be something surprising about having the exponent being the biased integer e127e - 127 and using the binary representation of the exponent as an unsigned integer, rather than doing the usual “ones complement” representation to have a signed integer instead. This is for good reasons as it allows the ordering of floating points to coincide with the lexicographic ordering of their binary representation.

There are some subtleties here, as the previous formula does not hold for e=0e = 0 or e=255e = 255, which allows on the one hand for gradual underflow with subnormal numbers and for having dedicated representations of infinity and NaN on the other hand. It’s good to be aware of it to have the full picture, but we won’t need that here and we ignore edge cases for simplicity.


The Algorithm

Code

Now we know what floating points look like under the hood, let’s see how to abuse them to approximate the exponential. Let’s start with a somewhat less hacky way to do type punning than is used in the fast inverse square root algorithm, by defining the following union:

typedef union F32 {
    float value;
    int32_t bits;
} F32;

The C++ masochists are welcome to reinterpret_cast or bit_cast to their hearts’ content instead.

With that, we can now also write a seemingly undecipherable algorithm to approximate the exponential.

float fast_exp(float f) {
    float fb = f * 12102203;
    F32 expo = {.bits = (int)fb + 1065307417};
    return expo.value;
}

Explanation

There are only three lines, and essentially three things to explain in that piece of code:

  1. What is the constant 1210220312102203?
  2. What does the cast to int, then the reinterpretation of the bits to a floating point achieve?
  3. What is the constant 10653074171065307417?

The Constant 12102203

Why 12102203? Well, it turns out this is the value of log2(e)×223\text{log}_2(e) \times 2^{23}. Notice this is an integer because we have 23 bits of precision with 32-bit floating points, and because log2(e)\text{log}_2(e) is in the range [1,2)\left[1,2\right).

The Constant 1065307417

This one is more tricky. “Half” of it is straightforward, and the other one is an optimisation. The constant 1065307417 can be written as: 1065307417=(12723)45779. 1065307417 = (127 \ll 23) - 45779.

Here, 127 is the bias of 32-bit floating points, it is shifted left 23 times (or multiplied by 2232^{23}), and 45779 is an optimisation constant computed to minimise the maximum relative error between our approximation of exponential and the hardware returned value of the exponential function. A mathematical derivation of the optimisation constant is presented in Schraudolph’s paper. These days, computers are powerful enough that you could brute-force one out (though you’d have to do the same for every floating point format, instead of having a closed formula…).

Cast and Bit/Reinterpret Cast

Perhaps the most interesting, and most illuminating part, is realising that reinterpreting a (non-negative) signed integer as a floating point is akin to applying the function x2x127x \mapsto 2^{x-127}, and thus, the exponential. Let’s make this a bit more precise.

Let xx be a non-negative integer, with binary representation x30...x0x_{30}...x_{0} (x31=0x_{31} = 0 since the integer is non-negative). Reinterpreting xx bits as a float yields the floating point xf=2(x23)127(1+i=022xi2i23). x_f = 2^{(x \gg 23) - 127} \left( 1 + \sum_{i=0}^{22}x_i2^{i-23}\right).

Making Sense of it All

Given the previous expression of xfx_f, the goal is the following: for a given floating point ff that we wish to get the exponential of, we try and compute an integer fbf_b such that 2(fb23)1272^{(f_b \gg 23) - 127} is as close as possible to efe^f.

First, remark that 2xlog2(e)=exlog2(e)ln(2), 2^{x\, \text{log}_2(e)} = e^{x\, \text{log}_2(e)\text{ln}(2)}, so 2xlog2(e)=exln(e), 2^{x\, \text{log}_2(e)} = e^{x\, \text{ln}(e)}, which implies 2xlog2(e)=ex. 2^{x\, \text{log}_2(e)} = e^{x}.

Thus, we want (fb23)127(f_b \gg 23) - 127 close to f×log2(e)f \times \text{log}_2(e). Let’s derive this:

(fb23)127f×log2(e) (f_b \gg 23) - 127 \simeq f \times \text{log}_2(e) is equivalent to, by multiplying each side by 2232^{23} fb127×223f×log2(e)×223 f_b - 127 \times 2^{23} \simeq f \times \text{log}_2(e) \times 2^{23} and we get fbf×log2(e)×223+127×223. f_b \simeq f \times \text{log}_2(e) \times 2^{23} + 127 \times 2^{23}. We should recognise here (part of) the constants which were necessary to readjust the exponentiation induced by the bit-cast of an integer to a floating point.

This kind of computation can be extended to understand what happens between the 2i2^i and 2i+12^{i + 1} images — spoilers: it’s a piecewise affine function — as well as doing a rigorous analysis of the errors. This is done in the original paper I mentioned at the very start.


Conclusion

So there you have it. Taking advantage of the hardware representation of floating points, it is possible to do a (a priori) fast approximation of the exponential function, with only one multiplication, one cast to int, and one addition.

If you were to check the relative error on the entire 32-bit floating points range (with some attention about possible edge cases with 0 and INF/NaN), you’ll find it to be bounded by around 3%. This is not bad for what little work our function does but can be improved with further adjustments… at the cost of performance of course.

Now, as always with these kinds of old hacks, we must remember that hardware (and compilers) have come a long way. In particular, we have historically optimised to remove a lot of the bottlenecks of old (for instance, we went from being mostly limited by the speed of floating point arithmetics to being limited by memory).

Because of this, more often than not, what used to be efficient is now counterproductive. Some of them can certainly be adapted on more modern hardware and still provide a speedup, but it probably won’t come “for free” anymore. And in many cases, modern hardware handles these for us in some form. Time is far better spent understanding how to leverage SIMD instructions than doing obscure type punning and constructions of magic constants in software (whether this is what actually happens in hardware or not).