Tracking the Floating Point Error
Floating Point Format
Before starting error discussion on floating point numbers, let us first review the format of floating point numbers. There are two types of floating point numbers: binary floating point and decimal floating point. In this article, the discussion focuses on the binary floating point, which is commonly used for describing real numbers in computers.
The basic concept of the floating point number can be described by the following:
- Sign bit, which is a 1-bit data, represents a positive or negative sign of the value
- Exponent bits describe the power of 2 as
- Significand bits divide the range between and into equal intervals to represent real numbers inbetween
This concept can be described by Eq. (1) and a 32-bit floating point number example for showing PI is described in Figure 1.
Figure 1: Structure of IEEE 754 floating-point format, PI in 32-bit floating point number as an example.
For typical floating point formats in 16, 32, 64, or 128 bits, the corresponding parameters are specified by the following table.
Parameter | binary16 | binary32 | binary64 | binary128 |
---|---|---|---|---|
bias | 15 | 127 | 1023 | 16383 |
sign bit, | 1 | 1 | 1 | 1 |
exponent bits, | 5 | 8 | 11 | 15 |
significand bits, | 10 | 23 | 52 | 112 |
precision in bits, | 11 | 24 | 53 | 113 |
unit roundoff, |
Rounding Error
To handle real numbers in a computer, the given number has to be converted to a floating point number. In IEEE 754-2019 [1], the following five types of rounding attributes are defined for the conversion. For the binary floating point numbers, the default rounding method is roundTiesToEven.
Rounding Attribute | Rounding to … |
---|---|
roundTiesToEven | the nearest floating point number. If two floating point numbers are equally near, one with an even least significant digit shall be returned. |
roundTiesToAway | the nearest floating point number. If two floating point numbers are equaly near, one with greater magnitude shall be returned. |
roundTowardPositive | the floating point number closest to and no less than the given number. |
roundTowardNegative | the floating point number closest to and no greater then the given number. |
roundTowardZero | the floating point number closest to and no greater in magnitude than the given number. |
Error Estimation
When describing a real number in a floating point number, the amount of error varies depending on . However, the error ratio can be generally described in the following form.
As the first step, the absolute amount of error is smaller than the amount of 1 significant bit.
Furthermore, assuming that is rounded to the nearest floating point number, the amount of error can be reduced to half.
Based on this fact, the error ratio can be evaluated as shown below.
If the exponent part of the floating point number is , the range of original real number should be . So, it looks insufficient to represent by when evaluating the error ratio. However, if , the maximum error is less than half of . Thus, such case does not correspond to the maximum error ratio, and can be used to represent for error ratio evaluation.
Error Amount for Inner Product
When and are error-free floating point numbers, it is common to assume that the error ratio caused by a single arithmetic operation is described by
For now, we won’t go into the dtails of the internal calculation steps of each operation, but accept this assumption for further error evaluation. The main focus of the following discusion is the error amount of the inner product as shown in Eq. (7). Since the amount of error depends on the order of operations, we assess the error amount when executing the operations from the left side to the right side.
Looking into the inner product operation from the concrete descriptions are the following.
n-th order inner product can be described by
Since for any , it is possible to bound the error range by simpler form. In the discussion of [2], the range of the multiple product is described by the following relation.
Using this relation, the amount of error can be bounded as shown below.
Bernoulli’s Inequality
Actually, the relation of Eq. (12) can be written in slightly tighter form as shown in the equations below [3]. The discussion assumes the parameter range of , and .
The lower bound Eq. (14) corresponds to Bernoulli’s inequality, and it can be confirmed easily by the mathematical induction. When , the following inequality is valid.
Assuming that is valid, the following relation can be obtained by multiplying .
This completes the induction proof for Eq. (14).
Applying to Eq. (14), we obtain the following relation.
If , the following relation is obtained.
Reference
- “IEEE Standard for Floating-Point Arithmetic” in IEEE Std 754-2019 (Revision of IEEE 754-2008), pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
- Nicholas J. Higham, “Accuracy and Stability of Numerical Algorithms, Second Edition”, Society for industrial and applied mathematics, 2002.
- Dragoslav S. Mitrinović, “Analytic Inequalities”, Springer Berlin, Heidelberg, 1970.
Related Articles
Uniformly Distribute Points on Primitive Shapes
Mathematical methods for uniformly distributing points on primitive geometric surfaces including rectangles, triangles, cylinders, discs, spheres, cones, and parabolic surfaces for ray tracing applications.
Proving Runge-Kutta Method (RK4) is 4th Order
If you want to calculate the time evolution of ODE, one of the most common and simple methods is the Runge-Kutta method (RK4). In this article, we will confirm that RK4 achieves fourth-order accuracy.
MRG32k3a Random Number Generator on GPU for Web Applications
MRG32k3a is one of the common algorithms for random number generation for parallel computing. In this article, we will implement the MRG32k3a algorithm on GPUs using WebGL2, with skip-ahead functionality.