Link to the original text: https://ssshooter.com/2020-09…
The image above is from Wikipedia.
As can be seen from the figure, IEEE-754 standard divides 64 bits into three parts
- sign, 1 bit, 0 is positive and 1 is negative
- exponent, index, 11 bit
- fraction, decimal part, 52 bit
For example convenience, we use the following numbers to introduce the IEEE-754 standard
No more than 64 bits. Count if you don’t believe it
Number 63 (also the first number seen from left to right), in the example,SignThe value of 0 means that this is a positive number.
The reason why 0 to 51 (52 bits in total) is“Fraction”Because this number will be placed in the
1.(there will be exceptions, I’ll say later).
In the example, the 52 bits belonging to the fraction are:
These 52 digits are referred to as
f(f stands for fraction), plus the
If you ask why the fortress 1 is in front, I haven’t checked it. In short, it is so stipulated. It is really a “decimal”
But get this long string
1.fHow to use it? We have to combine the exponent part.
For a clearer explanationExponentThe conversion from binary system to decimal system is a “table” in this paper
%00000000000 0 → −1023 (lowest number) %01111111111 1023 → 0 %11111111111 2047 → 1024 (highest number) %10000000000 1024 → 1 %01111111110 1022 → −1
Please note that 011111111111 stands for 0, positive up and negative down
Taking out the 52-62 bits (11 bits in total) of the above example, we get the following results:
10000000110, and then to the decimal number 1030, because 1023 is 0, so subtract 1023 to get the real result, which is 7.
To use this exponent, we multiply the 1. F obtained above to the 7th power of 2 (omit the following 0 to save space)
1.f × 2e−1023 = 1.1101001 × 27 = 11101001
（Attention, this is two! In! System!Analogy to decimal system is similar: 1.3828171 × 107 = 13828171）
This is what floating point numbers are calledFloating pointThe position of the decimal point can drift left and right with the value of the index, so that a number can be represented more precisely;
On the contraryFixed pointFor example, if the maximum number is 1111111111.1111111, and the decimal point is always fixed in the middle, there is no way to express the number whose absolute value is less than or greater than 1111111111.1111111111.
After the combination of “fraction” and “exponent” to get 11101001, it can be converted to decimal system, plus the sign (sign bit) with little explanation (0 means positive number)
So for example
It’s actually stored in IEEE-754
When the exponent is – 1023 (that is, the minimum value, represented by seven zeros in binary), it is a special case called denormalized.
The calculation formula of the current value is changed as follows:
0.f × 2−1022
This is a special case where f does not precede 1, which can be used to represent very small numbers
The big man’s summary is too incisive:
|(−1)s × %1.f × 2e−1023||normalized, 0 < e < 2047|
|(−1)s × %0.f × 2e−1022||denormalized, e = 0, f > 0|
|(−1)s × 0||e = 0, f = 0|
|NaN||e = 2047, f > 0|
|(−1)s × ∞ (infinity)||e = 2047, f = 0|
The first line is normal, the second line is the above
0.fDenormalized, the third line is actually all zeros.
The fourth and fifth line is that the 11 bits of E are all 1. If f is greater than 0, it is Nan, and if f is equal to 0, it is infinite.
Hands on conversion IEEE-754
Using the formula summarized above, it should not be difficult to calculate IEEE-754 back to decimal, but how to calculate IEEE-754 through decimal number by yourself?
Let’s put together a simple figure: – 5432.1, and then paste the 64 bit composition diagram, so that we don’t have to go over it
When we see the minus sign, there’s no doubt that sign is 1. We’ve got the first puzzle,s = 1。
The second step is to convert 432.1 to binary.
Positive part conversionUntil the result is 0:
Write the result from bottom to top: 110110000
Negative part conversionUntil the result is 0:
It’s endless. Smart people should see that this has entered an infinite cycle.
It’s like one-third of the decimal system equals 0.33333333 The binary “ten” is equal to 0.00011001100110011 , are infinite recurring decimals.
Then combine the integer and the decimal part: 110110000.0 
F × 2e−1023Format of
1.1011000000011001100110011001100110011001100110011010 × 28
Fill 52 places of F with infinite recurring decimals,
f = 1011000000011001100110011001100110011001100110011010
8 = e − 1023, then E is 1031, converted to binary,
e = 10000000111
The jigsaw puzzle is all together. Let’s put it together! s + e + f！
This is the real body of IEEE-754 double precision floating-point number-5432.1.
So why is it not accurate?
Let’s start with the most common situation
0.1 + 0.2 // 0.30000000000000004 1 - 0.9 // 0.09999999999999998 0.0532 * 100 // 5.319999999999999
I used to think that if you multiply 100 into an integer and then add and subtract, you won’t lose precision. But the fact is, the number calculated by multiplication itself has already gone out of shape.
Let’s go back to the cause. In fact, it’s the same as the above calculation of 0.1, because it can’t be divided completely.
But why?! Clearly printed out, he is the normal 0.1 ah! Why 1 – 0.9 out of 0.1 is not 0.1!
I would like to make a superficial guess:
console.log((0.1).toFixed(30)) //Output '0.10000000000000 551115123126' console.log((1.1 - 1).toFixed(30)) //10000001883 '
toFixedWe can see more accurate
0.1What’s the number, and you can see it clearly
1.1 - 1It’s not the same number at all, even if it’s in the decimal system
0.1But in binary terms, it’s an inexhaustible number, so it’s slightly different when you do the calculation.
Under what circumstances will “0.1” be regarded as “0.1”? The answer is:
- Less than 0.100000000000124 (etc.)
- Greater than 0.09999999999999987 (etc.)
As for how to know exactly how IEEE-754 does “valuation”, the answer may be found here. Curious babies can delve into it
In a word, because of the inexhaustible division and the error in the calculation, beyond a certain value, one number becomes another.
The second kind of uncertainty is becauseIt’s too big。
We know that double precision floating-point numbers have 52 decimal places. If you add the previous 1, then the maximum isAnd it can be expressed accuratelyThat’s the integer of
console.log(Math.pow(2, 53)) //Output 9007199254740992 console.log(Math.pow(2, 53) + 1) //Output 9007199254740992 console.log(Math.pow(2, 53) + 2) //Output 9007199254740994
Why is + 2 accurate again? Because in this range, the multiple of 2 can still be accurately expressed. Up again, when the numbers arrive
Math.pow(2,54)After that, you can only accurately express the multiple of 4, the 55th power is the multiple of 8, and so on.
console.log(Math.pow(2, 54)) //Output 18014398509481984 console.log(Math.pow(2, 53) + 2) //Output 18014398509481984 console.log(Math.pow(2, 53) + 4) //Output 18014398509481988
So although floating-point numbers can represent the maximum and minimum numbers, they are not so accurate. However, they are better than fixed points which can not be expressed at all.
Decimal to IEEE-754
IEEE-754 to decimal
Do it yourself conversion from decimal to IEEE-754