The IEEE-754 Standard for floating-point numbers was introduced initially in 1985 to solve the problem of diverse floating point implementations prevent code being portable and increase stability across platforms.
It has been widely adopted and revised over the last 30 years and if you have dealt with any real numbers in your applications then they are probably following this standard.
My work over the last year has had be analysing the error of different mathematical functions and how that error can accumulate and be mitigated through different programming patterns. One of the things I looked into was the basic mathematical functions used for activation functions in neural networks and how they can be approximated for better performance. In doing so we met some resistance from people who are passionate about mathematical functions being correctly implemented and conforming to the standards, in particular a promise of 1 unit in last place (ULP) correctness for the elementary functions.
Being interested in further work in approximating these functions I set out to discover how they are written to ensure the correctness they promise and if they are only 1 ULP correct, where do the mistakes lie on the function domain.
In doing so, I have discovered that none of the major mathematical libraries that are used throughout computing are actually rounding correctly as demanded in any version of IEEE 754 after the original 1985 release.
Implementation
I am going to use the example of the sin function as it is relatively simple to explain and will work as a good demonstration for all aspects of what I am trying to show without burdening this page will tens of pages of results.
The ‘sin’ function is a repeating and offset y-symmetric function, and each segment of the curve is the same shape in a different rotation. What I mean is that the Sine curve repeats every 2pi and the part of the curve in the negative y is a reflection of the part of the curve in positive y. And that each curve in the sin curve can be construct from a section.
This lets us do two things treat any input x as the remainder of x/2pi to find out where in the 0-2pi range we lie (this is commonly called range reduction), and then determine which section of the curve we are in and process a more precise approximation of that simpler curve.
Without getting into too much details the implementation looks a little bit like this:
function sinnaive(x::AbstractFloat) # Take value and range reduce to [0,pi/2] C = pi/2.0 k = Int64( floor( x /C ) ) phase = floor(k) % 4 # naive range reduction y = x - (k * C) if(phase == 0) return approxsincurve(y) elseif (phase == 1) return approxcoscurve(y) elseif (phase == 2) return -approxsincurve(y) elseif (phase == 3) return -approxcoscurve(y) else @assert false ("Invalid Range Reduction") end return Inf end
This makes the secret to getting really good well fitted Sine results rely on getting a very good approximation of the curve segments and making sure you are in the right place. There are lots and lots of other tricks that go into making these implementations perfect. Mostly ensuring correct rounding of sub-expressions and making use of high internal precision instructions where possible (Fused Multiple Add, for example).
But naturally, as this relies on an approximation at the base level, it is very tricky to be able to get a perfect answer when working in limited precision. This is where we get the 1 ULP error bound all the engineers who work on this will tell you about.
Interesting thing though: as the exact technique to use isn’t specified in the standard and correctness may rely hardware instructions, we are left with different implementations producing error in different positions. As an example, here is the sin function from Microsoft’s CRT Library that is used for C and C++ in the MSVC compiler, and the results from the implementation of sin in the Julia Language (a language which is being pushed for as a fast numerical computing alternative to Python).
The vertical lines each represent a value which is incorrectly rounded. In total, Julia has fewer than Microsoft’s implementation. However, they are much more spread out where as Microsoft has chosen an implementation which leads to high levels of correct in large blocks and all the errors concentrated in areas once every ~pi/2.
If we assume we are sticking to what the engineers said, these are both accurate to within 1ULP, but do in a very different way. With this method of implementation we are effectively producing a side-channel of information which could tell a user exactly which library you linked with if they are able to query basic functions in your application. Probably mostly harmless, but it is information the common user probably doesn’t know is there.
Non-compliance
It was this thought of side-channel information that led me to consult with different mathematical software communities to investigate if people were aware of this and has the different concentrations of errors ever caused any problems. My thoughts being: “If the error is concentrated there is likely less consistency and gradient decent algorithms may incur reduced efficiency as sequential numbers are more likely than average to be the same if rounded in the wrong direction”.
I was resounding told that the absolute error in the numbers are too small to be a problem. Frankly, I did not believe this. Having worked on games where tiny floating point errors propagated into large visual messes I wanted to look further. So I consulted the standard to see if there were any extra conditions on this 1ULP error I had been told about so that we wouldn’t have cross platform mismatches which the standard originally sought to prevent.
This is where I found out that the 1ULP constraint on elementary functions is not in the standard. It is not in the 1985 version: the 1985 version states no restrictions on the correctness of these functions. To my understanding, the only mention of them is to say that the standard is provided to guide their development.
In the next revision, that is IEEE 754-2008 that changes a lot. It has “Section 9. Recommended operations” where it lists all the recommended functions, the exceptions they should throw, their domain and most importantly - how correct they should be:
So what does correctly rounded mean? It means less than 0.5 ULP error. It means that there should be zero mismatches between the correct result and the result given. It means that all versions of a mathematics library which is IEEE compliant should return the same results when in the same rounding mode. This statement remains in the IEEE 754-2019 revision that was put out only last July.
All the standard Maths libraries which claim to be IEEE complaint I have tested are not compliant with this constraint.
Response
I initially spoke to a friend of mine, from an old games company where we used to work together, as we had ran into some problems with these functions when trying to do deterministic planet generation. People on different machines were seeing different patterns being generated which meant that it broke an aspect of our multiplayer game. We had chalked this up to GPUs having their own rules and the 1ULP error on the CPU side. So I want to open this little sub-section with this anecdote to show that this non-conformance does have real world impact on sensitive applications.
I brought my findings next to the Julia community (as I have been using it for numerical computing in my approximation analysis tools) and the C++ community because that is where I am more knowledgeable.
The response I received was very jovial but discouraging. It seems that many people who are involved with implementing these functions are aware that they do not follow that requirement of the standard but are blasé about it. Other stated things which showed they did not understand or haven’t read the standard. And some others all together dismissed the requirements of the standards on the grounds they were too restricting.
There were some very interesting quotes (names removed) when I asked why this is known about but not fixed ranging from benign disregard for the standard to placing the burden of correctness on the user who should know that the functions are wrong:
My favourite response however, was in the Julia community. A backhanded call-to-arms if you will:
I am not a great writer or the most diplomatic person you want answering your e-mails. But I do care about performance and 30x feels like an overstatement to get correct rounding at 32-bit to me. So I figured, I will handle the complaints by implementing it in a way which isn’t 30x slower and there there will be no complaints and my job is done!
IEEE Correct Implementation
As it turns out, most modern 64-bit processors have pretty decent 64-bit floating-point units. Which leads me to think, why don’t we just implement 32-bit sin as 64-bit sin and round it correctly when we are done. The theory being that in most cases the rounded value will be correct and it should only be incorrect in the case the bad double rounding that can occur, but at the density of number in the output range -1.0f to 1.0f this is both incredibly unlikely to occur and very easy to check as we can test every input value in the reduced range of 0.f to 2.f*pi trivially on a modern machine and go much further if we want absolute confidence.
So we implement 32-bit sin like this:
function IEEE_sin(x::Float32, correctlyRounded=true)::Float32 if(correctlyRounded) return Float32.(sin(Float64(x))) else return sin(x) end end function IEEE_sin(x::Any, correctlyRounded=true)::Any return sin(x) end
And then we test it.
The current Julia Implementation:
sin: Max Err(ULP): 5.00897385615018681127840279120244109e-01. Mean Err(ULP): 2.50005459168491000625254536923056698e-01 sin: Max Err(Abs): 5.9604645e-8. Mean Err(Abs): 7.545918e-12 sin: Mismatch Count: 10827
Our IEEE Compliant Implementation:
IEEE_sin: Max Err(ULP): 4.99999995510513399926379631449132211e-01. Mean Err(ULP): 2.50005375827442751842250455717785186e-01 IEEE_sin: Max Err(Abs): 0.0. Mean Err(Abs): 0.0 IEEE_sin: Mismatch Count: 0
Performance Comparison:
Base.sin: Trial(1.042 ms) IEEE compliant sin: Trial(1.041 ms)
What we end up with a function which is correctly rounded, consistent across implementations and runs (based on benchmarking using the standard Julia tools) at roughly the same cost.
I repeated these tests for a large selection of the recommended functions in the IEEE 754 section. This approach did not work for log with a number of mismatches close to zero. There is a number ways to fix this, the slow way is to go higher precision, but I also experimented with measuring the distance to the nearest float at a higher precision and then doing a selection. This seems to have resolved the issue.
At the end of the work I did, I have a selection of most of the recommended functions in 32-bit all correctly rounded. This worked in all the languages I tested.
Conclusion
To wrap this up, it seems like we could be compliant with IEEE 754 in 32-bit implementations without a significant impact on performance on most modern hardware. Being correct for all values at higher precision is more expensive and difficult to verify (table makers dilemma) but that is no reason to not make that function available as the standard says rather than picking an arbitrary level of error and sticking to it without any constraints on the negative side-effects this may cause.
At the end of the day the point of the standard is to prevent unexpected behaviour and the current approaches have lots of unexpected behaviour in them especially between implementations. Starting with fixing 32-bit implementations and then we can move onto the more difficult problems.
A criticism I seem to face when presenting this was that many people thought I didn’t want the fast 1ULP solutions. I do want those, in fact I want a dynamic library to implement arbitrary levels of precision so that we do not waste processing on work we do not have to do to. It seems we have all been doing that without knowing because of this hidden quirk of all the libraries not following this specific part of the standard.
In my dynamic library though I would ensure that the results are deterministic between implementations. sin(x) in library A really should match sin(x) in library B. Without basic ideas of function and value equality the whole floating-point model really falls apart in quiet scary ways.
Thanks for reading!
For anyone hiring, I am coming to the end of my PhD and looking for interesting work. If that is you feel free to contact me at ngt26 at cam.ac.uk.