Floating points are absolutely ubiquitous in programming these days, and virtually every architecture implements the IEEE 754 standard. Paradoxically, it seems few software engineers are aware of the definition and subtleties of the representation, and even fewer know of the kind of abuse they lend themselves to.
An almost mythical example of such hacks is the (misattributed) Quake III fast inverse square root algorithm:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
} There is nothing I can say about it that has not been extensively said already and isn’t documented and linked from the Wikipedia page, so I won’t waste any time on that and focus on the exponential instead. By the end, that should still demystify the line:
i = 0x5f3759df - ( i >> 1 ); // what the fuck?Although there has historically been many ways to represent floating points (different bases, with or without gradual underflow etc…), we have eventually settled on a standard that I’ll explain. Unless you code on some pretty old/esoteric machine, or do some AI stuff that require the usage of more exotic formats, chances are this is what you’re dealing with.
I’ll focus on 32-bit floating points (float in C/C++) here, but the standard is similarly defined for 16-bit and 64-bit floating points, just with a couple different constants and bit partition.
Let be a 32-bit floating point. Its binary expression is partitioned as follows:
| s | e7 | e6 | e5 | e4 | e3 | e2 | e1 | e0 | m22 | m21 | … | m1 | m0 |
|---|
Let
then,
Remark: There may be something surprising about having the exponent being the biased integer and using the binary representation of the exponent as an unsigned integer, rather than doing the usual “ones complement” representation to have a signed integer instead. This is for good reasons as it allows the ordering of floating points to coincide with the lexicographic ordering of their binary representation.
There are some subtleties here, as the previous formula does not hold for or , which allows on the one hand for gradual underflow with subnormal numbers and for having dedicated representations of infinity and NaN on the other hand. It’s good to be aware of it to have the full picture, but we won’t need that here and we ignore edge cases for simplicity.
Now we know what floating points look like under the hood, let’s see how to abuse them to approximate the exponential. Let’s start with a somewhat less hacky way to do type punning than is used in the fast inverse square root algorithm, by defining the following union:
typedef union F32 {
float value;
int32_t bits;
} F32;The C++ masochists are welcome to reinterpret_cast
or bit_cast to their hearts’ content instead.
With that, we can now also write a seemingly undecipherable algorithm to approximate the exponential.
float fast_exp(float f) {
float fb = f * 12102203;
F32 expo = {.bits = (int)fb + 1065307417};
return expo.value;
}There are only three lines, and essentially three things to explain in that piece of code:
int, then the
reinterpretation of the bits to a floating point achieve?Why 12102203? Well, it turns out this is the value of . Notice this is an integer because we have 23 bits of precision with 32-bit floating points, and because is in the range .
This one is more tricky. “Half” of it is straightforward, and the other one is an optimisation. The constant 1065307417 can be written as:
Here, 127 is the bias of 32-bit floating points, it is shifted left 23 times (or multiplied by ), and 45779 is an optimisation constant computed to minimise the maximum relative error between our approximation of exponential and the hardware returned value of the exponential function. A mathematical derivation of the optimisation constant is presented in Schraudolph’s paper. These days, computers are powerful enough that you could brute-force one out (though you’d have to do the same for every floating point format, instead of having a closed formula…).
Perhaps the most interesting, and most illuminating part, is realising that reinterpreting a (non-negative) signed integer as a floating point is akin to applying the function , and thus, the exponential. Let’s make this a bit more precise.
Let be a non-negative integer, with binary representation ( since the integer is non-negative). Reinterpreting bits as a float yields the floating point
Given the previous expression of , the goal is the following: for a given floating point that we wish to get the exponential of, we try and compute an integer such that is as close as possible to .
First, remark that so which implies
Thus, we want close to . Let’s derive this:
is equivalent to, by multiplying each side by and we get We should recognise here (part of) the constants which were necessary to readjust the exponentiation induced by the bit-cast of an integer to a floating point.
This kind of computation can be extended to understand what happens between the and images — spoilers: it’s a piecewise affine function — as well as doing a rigorous analysis of the errors. This is done in the original paper I mentioned at the very start.
So there you have it. Taking advantage of the hardware
representation of floating points, it is possible to do a (a
priori) fast approximation of the exponential function, with
only one multiplication, one cast to int, and one
addition.
If you were to check the relative error on the entire 32-bit floating points range (with some attention about possible edge cases with 0 and INF/NaN), you’ll find it to be bounded by around 3%. This is not bad for what little work our function does but can be improved with further adjustments… at the cost of performance of course.
Now, as always with these kinds of old hacks, we must remember that hardware (and compilers) have come a long way. In particular, we have historically optimised to remove a lot of the bottlenecks of old (for instance, we went from being mostly limited by the speed of floating point arithmetics to being limited by memory).
Because of this, more often than not, what used to be efficient is now counterproductive. Some of them can certainly be adapted on more modern hardware and still provide a speedup, but it probably won’t come “for free” anymore. And in many cases, modern hardware handles these for us in some form. Time is far better spent understanding how to leverage SIMD instructions than doing obscure type punning and constructions of magic constants in software (whether this is what actually happens in hardware or not).