2026-02-13

Fast Exponentiation and Floating Point Representations

Where I briefly explain an old school trick to approximate the exponential function, by abusing the way floating points are represented in memory. The method is known since at least 1999, where it appears in a paper by Nicol Schraudolph.

Floating Point Representation

Floating points are absolutely ubiquitous in programming these days, and virtual every architecture implements the IEEE 754 standard. Paradoxically, it seems few software engineers are aware of the definition and subtleties of the representation, and even less so of the sort of abuse they lend themselves to.

An almost mythical example of such hacks is the (misattributed) Quake III fast inverse square root algorithm:

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;
    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}  

There is not anything I can say about it that has not been extensively said already and isn’t documented and linked from the Wikipedia page, so I won’t waste any time on that and focus on the exponential instead. By the end, that should still demystify the line:

    i  = 0x5f3759df - ( i >> 1 );               // what the fuck?

The IEEE 754 Standard

Although there has historically been many ways to represent floating points (different bases, with or without gradual underflow etc…), we have eventually settled on a standard that I’ll explain. Unless you code on some pretty old/esoteric machine, or do some AI stuff that require the usage of more exotic formats, chances is are this is what you’re dealing with.

We’ll focus on 32 bits floating points (float in C/C++) here, but the standard is similarly defined for 16 bits and 64 bits floating points, just with a couple different constants and bit partition.

Let ff be a 32 bits floating point. Its binary expression is partitioned as follows:

s e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m1 m0

Let e=i=07ei2iandm=i=022mi2i, e = \sum_{i = 0}^{7} e_i 2^i \quad \text{and} \quad m = \sum_{i = 0}^{22} m_i 2^i,

then,

f=(1)s×2e127×(1+223m). f = (-1)^s \times 2^{e - 127} \times \left(1 + 2^{-23}m\right).

Remark: There may be something surprising about having the exponent being the biased integer e127e - 127 and using the binary representation of the exponent as an unsigned integer, rather than doing the usual “ones complement” representation to have a signed integer instead, but this is for good reasons as it allows the ordering of floating points to coincide with the lexicographic ordering of their binary representation.

There are some subtleties here, as the previous formula does not hold for e=0e = 0 or e=255e = 255, which allows on the one hand for gradual underflow with subnormal numbers and for having dedicated representations of infinity and NaN on the other hand. It’s good to be aware of it to have the full picture, but we won’t need that here and we ignore edge cases for simplicity.

The Algorithm

Code

Now we know what floating points look like under the hood, let’s see how to abuse them to approximate the exponential. Let’s start with a somewhat less hacky way to do type punning than is used in the fast inverse square root algorithm, by defining the following union:

typedef union F32 {
    float value;
    int32_t bits;
} F32;

The C++ masochists are welcome to reinterpret_cast or bit_cast to their heart’s content instead.

With that, we can now also write a seemingly undecipherable algorithm to approximate the exponential

float fast_exp(float f) {
    float fb = f * 12102203;
    F32 expo = {.bits = (int)fb + 1065307417};
    return expo.value;
}

Explanation

There are only three lines, and essentially three things to explain in that piece of code:

  1. What is the constant 1210220312102203?
  2. What does the cast to int, then the reinterpretation of the bits to a floating point achieve?
  3. What is the constant 10653074171065307417?

The Constant 12102203

Why 12102203? Well, it turns out this is the value of log2(e)×223\text{log}_2(e) \times 2^{23}. Notice this is an integer because we have 23 bits of precision with 32bits floating points.

The Constant 1065307417

This one is more tricky. “Half” of it is straightforward, and the other one is an optimisation. The constant 1065307417 can be written as: 1065307417=(12723)45779. 1065307417 = (127 \ll 23) - 45779.

Here, 127 is the bias of 32-bit floating points, it is shifted left 23 times (or multiplied by 2232^{23}), and 45779 is an optimisation constant computed to minimise the maximum relative error between our approximation of exponential and the hardware returned value of the exponential function. A mathematical derivation of the optimisation constant is presented in Schraudolph’s paper. Perhaps sadly, these days, you can just brute-force one.

Cast and Bit/Reinterpret Cast

Perhaps the most interesting, and most illuminating part, is realising that reinterpreting a (non-negative) signed integer as a floating point is akin to applying the function x2x127x \mapsto 2^{x-127}, and thus, the exponential. Let’s make this a bit more precise.

Let xx be a non-negative integer, with binary representation x30...x0x_{30}...x_{0} (x31=0x_{31} = 0 since the integer is non-negative). Reinterpreting xx bits as a float yields the floating point xf=2(x23)127(1+i=022xi2i23). x_f = 2^{(x \gg 23) - 127} \left( 1 + \sum_{i=0}^{22}x_i2^{i-23}\right).

Making Sense of it All

Given the previous expression of xfx_f, the goal is the following: for a given floating point ff that we wish to get the exponential of, we try and compute an integer fbf_b such that 2(fb23)1272^{(f_b \gg 23) - 127} is as close as possible to efe^f.

First, remark that 2xlog2(e)=exlog2(e)ln(2), 2^{x\, \text{log}_2(e)} = e^{x\, \text{log}_2(e)\text{ln}(2)}, so 2xlog2(e)=exln(e), 2^{x\, \text{log}_2(e)} = e^{x\, \text{ln}(e)}, which implies 2xlog2(e)=ex. 2^{x\, \text{log}_2(e)} = e^{x}.

Thus, we want (fb23)127(f_b \gg 23) - 127 close to f×log2(e)f \times \text{log}_2(e). Let’s derive this:

(fb23)127f×log2(e) (f_b \gg 23) - 127 \simeq f \times \text{log}_2(e) is equivalent to, by multiplying each side by 2232^{23} fb127×223f×log2(e)×223 f_b - 127 \times 2^{23} \simeq f \times \text{log}_2(e) \times 2^{23} and we get fbf×log2(e)×223+127×223. f_b \simeq f \times \text{log}_2(e) \times 2^{23} + 127 \times 2^{23}. We should recognise here (part of) the constants which were necessary to readjust the exponentiation induced by the bit-cast of an integer to a floating point.

This kind of computation can be extended to understand what happens between the 2i2^i and 2i+12^{i + 1} images — spoilers: it’s a piecewise affine function — as well as doing a rigorous analysis of the errors. This is done in the original paper I mentioned at the very start.

Conclusion

So there you have it. Taking advantage of the hardware representation of floating points, it is possible to do a (a priori) fast approximation of the exponential function, with only one multiplication, one cast to int, and one addition.

If you were to check the relative error on the entire 32-bit floating points range (with some attention about possible edge cases with 0 and INF/NaN), you’ll find it to be bounded by around 3%. This is not bad for what little work our function does but can be improved with further adjustments… at the cost of performance of course.

Now, as always with these kinds of old hacks, we must remember that hardware (and compilers) have come a long way, and it can be hard to predict whether or not these are more efficient than the current floating point operations and compiler optimisations, and in which circumstances. Some of them may be better suited to modern hardware (maybe easier to do SIMD optimisation on them for instance, or in unoptimised shader code) than others. One shouldn’t implement them blindly (and nostalgically), but rather understand, profile, and validate.