Link to the original text: https://ssshooter.com/2020-09…

The image above is from Wikipedia.

IEEE-754 standard is a floating-point number standard. There are three formats: 32-bit, 64-bit, and 128 bit (the above two pictures are 32-bit and 64 bit respectively, and the structure is consistent). JavaScript uses 64 bit, which is commonly known as “double precision”. This paper will explain IEEE-754 standard with 64 bit as an example.

As can be seen from the figure, IEEE-754 standard divides 64 bits into three parts

**sign**, 1 bit, 0 is positive and 1 is negative**exponent**, index, 11 bit**fraction**, decimal part, 52 bit

For example convenience, we use the following numbers to introduce the IEEE-754 standard

0100000001101101001000000000000000000000000000000000000000000000

No more than 64 bits. Count if you don’t believe it

## sign

Number 63 (also the first number seen from left to right), in the example,**Sign**The value of 0 means that this is a positive number.

## fraction

The reason why 0 to 51 (52 bits in total) is**“Fraction”**Because this number will be placed in the`1.`

(there will be exceptions, I’ll say later).

In the example, the 52 bits belonging to the fraction are:

`1101001000000000000000000000000000000000000000000000`

These 52 digits are referred to as`f`

(f stands for fraction), plus the`1.`

The so-called`1.f`

That’s true:

`1.1101001000000000000000000000000000000000000000000000`

If you ask why the fortress 1 is in front, I haven’t checked it. In short, it is so stipulated. It is really a “decimal”

But get this long string`1.f`

How to use it? We have to combine the exponent part.

## exponent

For a clearer explanation**Exponent**The conversion from binary system to decimal system is a “table” in this paper

```
%00000000000 0 → −1023 (lowest number)
%01111111111 1023 → 0
%11111111111 2047 → 1024 (highest number)
%10000000000 1024 → 1
%01111111110 1022 → −1
```

Please note that 011111111111 stands for 0, positive up and negative down

Taking out the 52-62 bits (11 bits in total) of the above example, we get the following results:`10000000110`

, and then to the decimal number 1030, because 1023 is 0, so subtract 1023 to get the real result, which is 7.

To use this exponent, we multiply the 1. F obtained above to the 7th power of 2 (omit the following 0 to save space)

1.f × 2^{e−1023} = 1.1101001 × 2^{7} = 11101001

（**Attention, this is two! In! System!**Analogy to decimal system is similar: 1.3828171 × 10^{7} = 13828171）

This is what floating point numbers are called**Floating point**The position of the decimal point can drift left and right with the value of the index, so that a number can be represented more precisely;

On the contrary**Fixed point**For example, if the maximum number is 1111111111.1111111, and the decimal point is always fixed in the middle, there is no way to express the number whose absolute value is less than or greater than 1111111111.1111111111.

After the combination of “fraction” and “exponent” to get 11101001, it can be converted to decimal system, plus the sign (sign bit) with little explanation (0 means positive number)

So for example

0100000001101101001000000000000000000000000000000000000000000000

It’s actually stored in IEEE-754`233`

## exceptional case

When the exponent is – 1023 (that is, the minimum value, represented by seven zeros in binary), it is a special case called denormalized.

The calculation formula of the current value is changed as follows:

0.f × 2^{−1022}

This is a special case where f does not precede 1, which can be used to represent very small numbers

## summary

The big man’s summary is too incisive:

expression | Value |
---|---|

(−1)^{s} × %1.f × 2^{e−1023} |
normalized, 0 < e < 2047 |

(−1)^{s} × %0.f × 2^{e−1022} |
denormalized, e = 0, f > 0 |

(−1)^{s} × 0 |
e = 0, f = 0 |

NaN | e = 2047, f > 0 |

(−1)^{s} × ∞ (infinity) |
e = 2047, f = 0 |

The first line is normal, the second line is the above`0.f`

Denormalized, the third line is actually all zeros.

The fourth and fifth line is that the 11 bits of E are all 1. If f is greater than 0, it is Nan, and if f is equal to 0, it is infinite.

## Hands on conversion IEEE-754

Using the formula summarized above, it should not be difficult to calculate IEEE-754 back to decimal, but how to calculate IEEE-754 through decimal number by yourself?

Let’s put together a simple figure: – 5432.1, and then paste the 64 bit composition diagram, so that we don’t have to go over it

### step1

When we see the minus sign, there’s no doubt that sign is 1. We’ve got the first puzzle,**s = 1**。

### step2

The second step is to convert 432.1 to binary.

**Positive part conversion**Until the result is 0:

calculation | result | remainder |
---|---|---|

432/2 | 216 | 0 |

216/2 | 108 | 0 |

108/2 | 54 | 0 |

54/2 | 27 | 0 |

27/2 | 13 | 1 |

13/2 | 6 | 1 |

6/2 | 3 | 0 |

3/2 | 1 | 1 |

1/2 | 0 | 1 |

Write the result from bottom to top: 110110000

**Negative part conversion**Until the result is 0:

calculation | result | Individual position |
---|---|---|

0.1*2 | 0.2 | 0 |

0.2*2 | 0.4 | 0 |

0.4*2 | 0.8 | 0 |

0.8*2 | 1.6 | 1 |

0.6*2 | 1.2 | 1 |

0.2*2 | 0.4 | 0 |

0.4*2 | 0.8 | 0 |

0.8*2 | 1.6 | 1 |

0.6*2 | 1.2 | 1 |

0.2*2 | 0.4 | 0 |

0.4*2 | 0.8 | 0 |

0.8*2 | 1.6 | 1 |

0.6*2 | 1.2 | 1 |

It’s endless. Smart people should see that this has entered an infinite cycle.

It’s like one-third of the decimal system equals 0.33333333 The binary “ten” is equal to 0.00011001100110011 , are infinite recurring decimals.

Then combine the integer and the decimal part: 110110000.0 [0011]

### step3

F × 2^{e−1023}Format of

1.1011000000011001100110011001100110011001100110011010 × 2^{8}

Fill 52 places of F with infinite recurring decimals,

**f = 1011000000011001100110011001100110011001100110011010**

8 = e − 1023, then E is 1031, converted to binary,

**e = 10000000111**

### step4

The jigsaw puzzle is all together. Let’s put it together! s + e + f！

1100000001111011000000011001100110011001100110011001100110011010

This is the real body of IEEE-754 double precision floating-point number-5432.1.

## Why not

Programmers suffer because of the loss of precision. This problem does not only happen in JavaScript, but poor JavaScript has more strange settings. We often bind the 0.1 + 0.2 problem to JavaScript. In fact, Java and other languages using IEEE-754 standard will have this problem (however, Java and BigDecimal can only cry）。

So why is it not accurate?

### Situation one

Let’s start with the most common situation

```
0.1 + 0.2 // 0.30000000000000004
1 - 0.9 // 0.09999999999999998
0.0532 * 100 // 5.319999999999999
```

I used to think that if you multiply 100 into an integer and then add and subtract, you won’t lose precision. But the fact is, the number calculated by multiplication itself has already gone out of shape.

Let’s go back to the cause. In fact, it’s the same as the above calculation of 0.1, because it can’t be divided completely.

But why?! Clearly printed out, he is the normal 0.1 ah! Why 1 – 0.9 out of 0.1 is not 0.1!

I would like to make a superficial guess:

```
console.log((0.1).toFixed(30))
//Output '0.10000000000000 551115123126'
console.log((1.1 - 1).toFixed(30))
//10000001883 '
```

adopt`toFixed`

We can see more accurate`0.1`

What’s the number, and you can see it clearly`0.1`

and`1.1 - 1`

It’s not the same number at all, even if it’s in the decimal system`0.1`

But in binary terms, it’s an inexhaustible number, so it’s slightly different when you do the calculation.

Under what circumstances will “0.1” be regarded as “0.1”? The answer is:

- Less than 0.100000000000124 (etc.)
- Greater than 0.09999999999999987 (etc.)

As for how to know exactly how IEEE-754 does “valuation”, the answer may be found here. Curious babies can delve into it

In a word, because of the inexhaustible division and the error in the calculation, beyond a certain value, one number becomes another.

### Situation 2

The second kind of uncertainty is because**It’s too big**。

We know that double precision floating-point numbers have 52 decimal places. If you add the previous 1, then the maximum is**And it can be expressed accurately**That’s the integer of`Math.pow(2,53)`

。

```
console.log(Math.pow(2, 53))
//Output 9007199254740992
console.log(Math.pow(2, 53) + 1)
//Output 9007199254740992
console.log(Math.pow(2, 53) + 2)
//Output 9007199254740994
```

Why is + 2 accurate again? Because in this range, the multiple of 2 can still be accurately expressed. Up again, when the numbers arrive`Math.pow(2,54)`

After that, you can only accurately express the multiple of 4, the 55th power is the multiple of 8, and so on.

```
console.log(Math.pow(2, 54))
//Output 18014398509481984
console.log(Math.pow(2, 53) + 2)
//Output 18014398509481984
console.log(Math.pow(2, 53) + 4)
//Output 18014398509481988
```

So although floating-point numbers can represent the maximum and minimum numbers, they are not so accurate. However, they are better than fixed points which can not be expressed at all.

## Practical links

Decimal to IEEE-754

IEEE-754 to decimal

Do it yourself conversion from decimal to IEEE-754