Where I briefly explain an old school trick to approximate the exponential function, by abusing the way floating points are represented in memory. The method is known since at least 1999, where it appears in a paper by Nicol Schraudolph.
Floating points are absolutely ubiquitous in programming these days, and virtual every architecture implements the IEEE 754 standard. Paradoxically, it seems few software engineers are aware of the definition and subtleties of the representation, and even less so of the sort of abuse they lend themselves to.
An almost mythical example of such hacks is the (misattributed) Quake III fast inverse square root algorithm:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
} There is not anything I can say about it that has not been extensively said already and isn’t documented and linked from the Wikipedia page, so I won’t waste any time on that and focus on the exponential instead. By the end, that should still demystify the line:
i = 0x5f3759df - ( i >> 1 ); // what the fuck?Although there has historically been many ways to represent floating points (different bases, with or without gradual underflow etc…), we have eventually settled on a standard that I’ll explain. Unless you code on some pretty old/esoteric machine, or do some AI stuff that require the usage of more exotic formats, chances is are this is what you’re dealing with.
We’ll focus on 32 bits floating points (float in C/C++) here, but the standard is similarly defined for 16 bits and 64 bits floating points, just with a couple different constants and bit partition.
Let be a 32 bits floating point. Its binary expression is partitioned as follows:
| s | e7 | e6 | e5 | e4 | e3 | e2 | e1 | e0 | m22 | m21 | … | m1 | m0 |
|---|
Let
then,
Remark: There may be something surprising about having the exponent being the biased integer and using the binary representation of the exponent as an unsigned integer, rather than doing the usual “ones complement” representation to have a signed integer instead, but this is for good reasons as it allows the ordering of floating points to coincide with the lexicographic ordering of their binary representation.
There are some subtleties here, as the previous formula does not hold for or , which allows on the one hand for gradual underflow with subnormal numbers and for having dedicated representations of infinity and NaN on the other hand. It’s good to be aware of it to have the full picture, but we won’t need that here and we ignore edge cases for simplicity.
Now we know what floating points look like under the hood, let’s see how to abuse them to approximate the exponential. Let’s start with a somewhat less hacky way to do type punning than is used in the fast inverse square root algorithm, by defining the following union:
typedef union F32 {
float value;
int32_t bits;
} F32;The C++ masochists are welcome to reinterpret_cast
or bit_cast to their heart’s content instead.
With that, we can now also write a seemingly undecipherable algorithm to approximate the exponential
float fast_exp(float f) {
float fb = f * 12102203;
F32 expo = {.bits = (int)fb + 1065307417};
return expo.value;
}There are only three lines, and essentially three things to explain in that piece of code:
int, then the
reinterpretation of the bits to a floating point achieve?Why 12102203? Well, it turns out this is the value of . Notice this is an integer because we have 23 bits of precision with 32bits floating points.
This one is more tricky. “Half” of it is straightforward, and the other one is an optimisation. The constant 1065307417 can be written as:
Here, 127 is the bias of 32-bit floating points, it is shifted left 23 times (or multiplied by ), and 45779 is an optimisation constant computed to minimise the maximum relative error between our approximation of exponential and the hardware returned value of the exponential function. A mathematical derivation of the optimisation constant is presented in Schraudolph’s paper. Perhaps sadly, these days, you can just brute-force one.
Perhaps the most interesting, and most illuminating part, is realising that reinterpreting a (non-negative) signed integer as a floating point is akin to applying the function , and thus, the exponential. Let’s make this a bit more precise.
Let be a non-negative integer, with binary representation ( since the integer is non-negative). Reinterpreting bits as a float yields the floating point
Given the previous expression of , the goal is the following: for a given floating point that we wish to get the exponential of, we try and compute an integer such that is as close as possible to .
First, remark that so which implies
Thus, we want close to . Let’s derive this:
is equivalent to, by multiplying each side by and we get We should recognise here (part of) the constants which were necessary to readjust the exponentiation induced by the bit-cast of an integer to a floating point.
This kind of computation can be extended to understand what happens between the and images — spoilers: it’s a piecewise affine function — as well as doing a rigorous analysis of the errors. This is done in the original paper I mentioned at the very start.
So there you have it. Taking advantage of the hardware
representation of floating points, it is possible to do a (a
priori) fast approximation of the exponential function, with
only one multiplication, one cast to int, and one
addition.
If you were to check the relative error on the entire 32-bit floating points range (with some attention about possible edge cases with 0 and INF/NaN), you’ll find it to be bounded by around 3%. This is not bad for what little work our function does but can be improved with further adjustments… at the cost of performance of course.
Now, as always with these kinds of old hacks, we must remember that hardware (and compilers) have come a long way, and it can be hard to predict whether or not these are more efficient than the current floating point operations and compiler optimisations, and in which circumstances. Some of them may be better suited to modern hardware (maybe easier to do SIMD optimisation on them for instance, or in unoptimised shader code) than others. One shouldn’t implement them blindly (and nostalgically), but rather understand, profile, and validate.