# Floating point correlation

Time：2020-10-19

Read some IEEE 754 implementation of floating-point arithmetic related articles

• IEEE 754 (IEEE 754-2019)
• Floating-point arithmetic
• Significand
• JavaScript floating point number trap and solution
• Basic field: on floating point numbers
• In depth analysis of floating point numbers
• What is the difference between quiet NaN and signaling NaN?
• Why is the most secure integer in JavaScript 2 to the 53rd power minus one?
• How numbers are encoded in JavaScript
• How to understand the rounding scheme of IEEE754? -Haifeng’s answer – Zhihu
• ECMA 262
• Discussion on the precision of IEEE 754 floating point number
• Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic – Kahan
• What Every Computer Scientist Should Know About Floating-Point Arithmetic
• What you should know about floating point numbers

Write and read（Unless otherwise specified below, all symbols used in expression are the corresponding initial letters of English; floating-point numbers also refer to binary floating-point numbers; all contents are based on IEEE 754-2019)

name radix Significant bits (including 1 implied integer digit) Decimal digits (precision = LG2 ^ signed bits) Exponential bit Fixed offset value E min E max
Binary 16 semi precision floating point number 2 1 + 10 = 11 lg2^11 ≈ 3.31 5 2^(5-1) – 1 = 15 -14 = 1 – +15 2^(5-1) – 1 = +15
Binary 32 single precision floating point number 2 24 7.22 8 127 −126 +127
Binary 64 double precision floating point number 2 53 15.95 11 1023 −1022 +1023
Binary 128 four precision floating point numbers 2 113 34.02 15 16383 −16382 +16383
Binary 256 eight precision floating point numbers 2 237 71.34 19 262143 -262142 +262143
``````31
|
| 30    23 22                    0
| |      | |                     |
type -+-+------+-+---------------------+ value
Special value * 0.0

min subnormal number * 00000000 00000000000000000000001 ±2^−23 × 2^−126 = ±2−149 ≈ ±1.4×10^-45
max subnormal number * 00000000 11111111111111111111111 ±(1−2^−23) × 2^−126 ≈ ±1.18×10^-38

min normal number * 00000001 00000000000000000000000 ±2^−126 ≈ ±1.18×10^-38
±1.0 * 01111111 00000000000000000000000 ±1.0
max normal number * 11111110 11111111111111111111111 ±(2−2^-23) × 2^127 ≈ ±3.4×10^38

Special value * 11111111 10000000000000000 ±∞
Special value 0 11111111 10000000000000000 qnan
Special value 0 11111111 010000000000000000 snan
-----+-+------+-+---------------------+
| |      | |                     |
| +------+-+---------------------+
|    |    |           |
|    |    v           |
|    |the implicit bit|
|    v                v
| exponent         fraction
v
sign

32-bit single precision floating-point number``````

## Floating point storage structure consists of three parts

1. S sign bit sign
• 0 is positive
• 1 is negative
2. E is the digit exponent
• Exponential bit（Offset index, also known asOrder code）It is represented by an unsigned integer with a range of:`[0, 2^e - 1]` (Offset index = Real index + Fixed offset value. Fixed offset value = `2^(e-1) - 1`)
1. Offset index:`0`expressFloating point numbers in non canonical formorSpecial value ± 0

1. If the decimal part of the mantissa is not 0, it meansUnconventional floating point numbers
2. If the decimal part of the mantissa is 0, it meansSpecial value ± 0(sign bit dependent)
2. Offset index:`(0, 2^(e-1) - 1)`expressNegative index

3. Offset index:`2^(e-1) - 1`express± 0 index

4. Offset index:`(2^(e-1) - 1, 2^e - 1)`expressPositive index

5. Offset index:`2^e - 1`expressSpecial value ±∞orSpecial value Nan

1. If the decimal part of the mantissa is 0, it meansSpecial value ±∞(sign bit dependent)
2. If the decimal part of the mantissa is not 0, it meansSpecial value Nan
• qNaN(quiet Nan) the highest decimal part of the mantissa is 1
• When you change the highest bit to 0, you may getSpecial value ±∞(sign bit dependent)
• sNaN(signaling Nan) the highest decimal part of the mantissa is 0
• The highest bit is changed to 1qNaN
• Generally, qnan is used to make the operation normal, and snan is used to throw an exception (whether to throw an exception depends on the state of floating point unit FPU). See the difference between qnan and snan
• useAdvantages of offset index: can be represented as an unsigned integer of E units in lengthReal indexThis makes it easier to compare the exponents of two floating-point numbers
3. Mantissa / significant number (significant is also called mantissa, which is equal to the implicit bit + fraction)
• Conventional and non conventional floating point numbers
• The offset index: (0, 2 ^ e – 1), that is [1, 2 ^ e – 2], representsFloating point numbers in reduced form. reduced floating point numberThe implied integer bit is 1
• If the offset index is 0 and the decimal part of mantissa is not 0, it means thatFloating point numbers in non canonical form. nonstandard floating point numberThe implied integer bit is 0
• The migration index of non conventional floating-point numbers is 1 less than that of conventional floating-point numbers
• For example, the migration index of single precision floating-point number with minimum specification form (32-bit = 1s + 8e + 23F) is 1: (- 126 + 127), and the actual index is – 126; while the migration index of non conventional single precision floating-point number is 0: (- 126 + 127 – 1), and the corresponding actual index is – 126 instead of – 127
• useAdvantages of implicit integer bits: increased the effective length of 1-bit floating-point number
• useThe advantages of non conventional floating point numbers(advantages of gradual overflow gradual underflow): avoidedSudden downward overflowBreak underflow, so that the gap between each floating-point number is consistent=`2^(-f + (1 - (2^(e-1) - 1)))`

## Characteristics of floating point numbers

• It can only be expressed precisely by binary scientific notation`(-1)^s*m*2^e`If M exceeds the precision, it will be rounded to zero automatically

This is why floating-point numbers such as 0.1 and 1.1 cannot be accurately stored

``````//The following is a double precision floating-point number implemented by JavaScript, with an accuracy of 15.95 and about 16 significant digits
Significant number is the length of all numbers in a number from the first non-zero number of the number

(0.1). To precision (16); // 0.100000000000 "for 0.1, the significant number is 16 bits
(0.1).Toprecision (17); // "0.1000000000001" for 0.1, the valid number is 17 bits
(0.1).toPrecision(18);  // "0.100000000000000006"
(0.1).toPrecision(22);  // "0.1000000000000000055511"

(1.1). To precision (16); // 1.10000000000 "for 1.1, the significant number is 16 bits
(1.1). To precision (17); // 1.10000000000001 "for 1.1, the significant number is 17 bits
(1.1).toPrecision(18);  // "1.10000000000000009"
(1.1).toPrecision(22);  // "1.100000000000000088818"

1.000000000000001; // 1.000000000000001, the number of significant digits is 16
1.000000 million 0001; // 1 the 1 of the 17th bit is omitted``````
• Statute formMaximum value of floating point number: `±(1 + (2^-1 + 2^-2 + ... + 2^-f)) * 2^(2^(e-1) - 1)` <=> `±(2 - 2^-f) * 2^(2^(e-1) - 1)`.

For double precision floating-point numbers, the maximum value of its specification is as follows:`±(2- 2^-52) * 2^1023 === ±1.7976931348623157e+308`, `1.7976931348623157e+308`It’s also in JavaScript`Number`Object static properties`MAX_VALUE`(note that it is not a safe integer), greater than this value means ∞（`Number.MAX_VALUE * 1.000000000000001 === Infinity; Number.MAX_VALUE + 1e+292 === Infinity`)

• Non conventional formMinimum value of floating point number: `±2^(-f + (1 - (2^(e-1) - 1)))`.

For the double precision floating-point number, its non conventional minimum value is:`±2^(-52-1022) === ±5e-324`, `5e-324`It’s also in JavaScript`Number`Object static properties`MIN_VALUE`Is less than 0

• Safe integer range of floating point numbers(the safe integer range means that floating-point numbers and integers can be one-to-one):`[-(2^m - 1), 2^m - 1]`For double precision floating-point numbers, the safe integer is:`±2^53 - 1 === ±9007199254740991`A floating-point number corresponds to multiple real numbers, as shown in the figure below

This is also in JavaScript`Number`Object static properties`MAX_SAFE_INTEGER`and`MIN_SAFE_INTEGER`Value of

`2^53 + 1`Expressed in binary as:`1000...0001 `(54 bits in total, two ones are 2 ^ 53 and 2 ^ 0 respectively)`1.000...0001 * 2^53`Since the mantissa of a double precision floating-point number can hold up to 52 bits of binary, the last one is bound to be discarded`2^53 + 1`And`2^53`Storage consistency, i.e`2^53 === 2^53 + 1`, 2 ^ 53 is not a safe integer

• An integer that can be represented exactly by a floating-point number (except for numbers in the range of safe integers)Take double precision floating-point numbers as an example: since the decimal part of mantissa can only store 52 digits at most, there are two types of integers that are larger than the safe integer range of floating-point numbers and need to be accurately represented

• One is to increase the size of the index in the range of the index and keep the mantissa always at`1.0`Number of:`2^54`, `2^55`, `2^56`, …, `2^1023`, these are exact numbers

• The other is the number whose exponent and mantissa are changed at the same time: for numbers between [2 ^ 53,2 ^ 54), because there are 53 decimal places in the mantissa, and the 53rd digit is bound to be omitted. So long as we guarantee that the 53rd digit of the mantissa is 0, then the number can be accurately guaranteed, that is, the even number between [2 ^ 53, 2 ^ 54) can ensure that the 53rd digit is 0, so it can be accurately expressed;

In the same way, the number between [2 ^ 54, 2 ^ 55), the 53rd and 54th bits are bound to be omitted. So long as we guarantee that the 53rd and 54th bits of the number are both 0, then the number can be accurately guaranteed, that is, between [2 ^ 54, 2 ^ 55), the spacing becomes a multiple of 4, so as to ensure that the 53rd and 54th bits are both 0, and can be expressed accurately, and so on

## Comparison of floating point numbers

• Floating point numbers are basically compared in the order of sign bit, exponential field and mantissa field. Obviously, all positive numbers are greater than negative numbers. When the sign is the same, the larger binary representation of the exponent is, the larger the floating-point value is; if the sign bit and index bit are the same, the floating-point value of larger mantissa is larger

## Floating pointFive rounding methods(four rounding methods for binary floating point numbers)

• Round to the nearest value

• Round to the nearest value, roundtiestoeven: will round the result to the nearest and representable value. If the same is close, select the least significant bit that is even (the least significant bit of mantissa is 0); if the least significant bit is the same (for example, the least significant bits of decimal floating-point numbers 9.5, 9 and 1 * e ^ 1 are all odd), select the one with larger magnitude（For positive numbers, the larger the order is; for negative numbers, the smaller the order is）This is usually the default rounding method for binary floating-point numbers and the recommended rounding method for decimal floating-point numbers

``````//Round to nearest, ties to even example

//9.5 represents a floating-point number in binary scientific notation, rounded to one digit
9.5 => 1001.1 => 1.0011 * 2^3
//The two nearest to it are 10 and 9, respectively
10 => 1010   => 1.010 * 2^3
9 => 1001   => 1.001 * 2^3
//The distances between 10 and 9 and 9.5 were
1.010 * 2^3 - 1.0011 * 2^3 = 0.0001 * 2^3  // 0.1
1.0011 * 2^3 - 1.001 * 2^3 = 0.0001 * 2^3  // 0.1
//The distance is the same, and the least significant bit is compared
//The least significant bit of 1.010 * 2 ^ 3 is even
//The least significant bit of 1.001 * 2 ^ 3 is odd
//Therefore, 9.5 is rounded to one digit and is 10 instead of 9

//0.95 is represented as a floating-point number in binary scientific notation, rounded to one bit
0.95 => 0.11 1100 1100 1100 1100 1100 1
//The two nearest to him are 1 and 0.9, respectively
1 => 1.00 0000 0000 0000 0000 0000 0
0.9 => 0.11 1001 1001 1001 1001 1001 1
//The distance between 1 and 0.9 and 0.95 was 0.95
1.00 0000 0000 0000 0000 0000 0 - 0.11 1100 1100 1100 1100 1100 1 = 0.00 0011 0011 0011 0011 0011 1
0.11 1100 1100 1100 1100 1100 1 - 0.11 1001 1001 1001 1001 1001 1 = 0.00 0011 0011 0011 0011 0011 0

0.00 0011 0011 0011 0011 0011 1 > 0.00 0011 0011 0011 0011 0011 0
//0.9 is a little closer to 0.95, so 0.95 rounded to one digit is 0.9 instead of 1``````
• Round to the nearest value, roundtiestoaway: rounds the result to the nearest and representable value. If it is as close as possible, select a larger order of magnitude（For positive numbers, the larger the order is; for negative numbers, the smaller the order is), Binary floating-point numbers do not require this roundingAnd decimal floating-point numbers should provide this rounding method for users to choose

• Directional rounding

• Round to ward positive, also known as rounding up ceil: rounds the result to positive infinity
• Round toward negative, also known as rounding down floor: rounds the result in the direction of negative infinity
• Round towardzero, also known as truncation: rounds the result in the direction of 0
• In JavaScript`Math.round(x)`Rounding of static methods

• Returns the integer closest to X. if two integers are equal and close, then it is closer to + ∞; if it is already an integer, it returns itself

## Binary floating point numberexception handling

• For operations that are not defined mathematically, such as 0 / 0, sqrt (- 1.0), etc., qnan is returned by default
• Division by zero. The divisor is zero and the divisor is a finite non-zero number. The default return is ±∞
• The result of the operation exceeds the range e Max that can be expressed by the exponent, and it returns ±∞ by default
• Underflow. The result of the underflow. Operation exceeds the range of normal numbers of the specified floating-point number. By default, it returns the subnormal numbers or 0 of the non conventional floating-point number (following the rounding rule)
• Inexact. The result of the inexact. Operation cannot be expressed exactly. The rounding value of the exact result is returned by default (following the rounding rules)

## On line conversion (binary and decimal) links to floating point numbers

• IEEE-754 Floating Point Converter – Single precision 32-bit
• IEEE754 Single precision 32-bit
• IEEE754 Double precision 64-bit