Why is precision lost? Teach you to understand IEEE-754!


Link to the original text: https://ssshooter.com/2020-09…

Why is precision lost? Teach you to understand IEEE-754!

Why is precision lost? Teach you to understand IEEE-754!

The image above is from Wikipedia.

IEEE-754 standard is a floating-point number standard. There are three formats: 32-bit, 64-bit, and 128 bit (the above two pictures are 32-bit and 64 bit respectively, and the structure is consistent). JavaScript uses 64 bit, which is commonly known as “double precision”. This paper will explain IEEE-754 standard with 64 bit as an example.

As can be seen from the figure, IEEE-754 standard divides 64 bits into three parts

  • sign, 1 bit, 0 is positive and 1 is negative
  • exponent, index, 11 bit
  • fraction, decimal part, 52 bit

For example convenience, we use the following numbers to introduce the IEEE-754 standard


No more than 64 bits. Count if you don’t believe itWhy is precision lost? Teach you to understand IEEE-754!


Number 63 (also the first number seen from left to right), in the example,SignThe value of 0 means that this is a positive number.


The reason why 0 to 51 (52 bits in total) is“Fraction”Because this number will be placed in the1.(there will be exceptions, I’ll say later).

In the example, the 52 bits belonging to the fraction are:


These 52 digits are referred to asf(f stands for fraction), plus the1.The so-called1.fThat’s true:


If you ask why the fortress 1 is in front, I haven’t checked it. In short, it is so stipulated. It is really a “decimal”Why is precision lost? Teach you to understand IEEE-754!

But get this long string1.fHow to use it? We have to combine the exponent part.


For a clearer explanationExponentThe conversion from binary system to decimal system is a “table” in this paper

%00000000000 0 → −1023 (lowest number)

%01111111111 1023 → 0

%11111111111 2047 → 1024 (highest number)

%10000000000 1024 → 1

%01111111110 1022 → −1

Please note that 011111111111 stands for 0, positive up and negative down

Taking out the 52-62 bits (11 bits in total) of the above example, we get the following results:10000000110, and then to the decimal number 1030, because 1023 is 0, so subtract 1023 to get the real result, which is 7.

To use this exponent, we multiply the 1. F obtained above to the 7th power of 2 (omit the following 0 to save space)

1.f × 2e−1023 = 1.1101001 × 27 = 11101001

Attention, this is two! In! System!Analogy to decimal system is similar: 1.3828171 × 107 = 13828171)

This is what floating point numbers are calledFloating pointThe position of the decimal point can drift left and right with the value of the index, so that a number can be represented more precisely;

On the contraryFixed pointFor example, if the maximum number is 1111111111.1111111, and the decimal point is always fixed in the middle, there is no way to express the number whose absolute value is less than or greater than 1111111111.1111111111.

After the combination of “fraction” and “exponent” to get 11101001, it can be converted to decimal system, plus the sign (sign bit) with little explanation (0 means positive number)

So for example


It’s actually stored in IEEE-754233

exceptional case

When the exponent is – 1023 (that is, the minimum value, represented by seven zeros in binary), it is a special case called denormalized.

The calculation formula of the current value is changed as follows:

0.f × 2−1022

This is a special case where f does not precede 1, which can be used to represent very small numbersWhy is precision lost? Teach you to understand IEEE-754!


The big man’s summary is too incisive:

expression Value
(−1)s × %1.f × 2e−1023 normalized, 0 < e < 2047
(−1)s × %0.f × 2e−1022 denormalized, e = 0, f > 0
(−1)s × 0 e = 0, f = 0
NaN e = 2047, f > 0
(−1)s × ∞ (infinity) e = 2047, f = 0

The first line is normal, the second line is the above0.fDenormalized, the third line is actually all zeros.

The fourth and fifth line is that the 11 bits of E are all 1. If f is greater than 0, it is Nan, and if f is equal to 0, it is infinite.

Hands on conversion IEEE-754

Using the formula summarized above, it should not be difficult to calculate IEEE-754 back to decimal, but how to calculate IEEE-754 through decimal number by yourself?

Why is precision lost? Teach you to understand IEEE-754!

Let’s put together a simple figure: – 5432.1, and then paste the 64 bit composition diagram, so that we don’t have to go over itWhy is precision lost? Teach you to understand IEEE-754!


When we see the minus sign, there’s no doubt that sign is 1. We’ve got the first puzzle,s = 1


The second step is to convert 432.1 to binary.

Positive part conversionUntil the result is 0:

calculation result remainder
432/2 216 0
216/2 108 0
108/2 54 0
54/2 27 0
27/2 13 1
13/2 6 1
6/2 3 0
3/2 1 1
1/2 0 1

Write the result from bottom to top: 110110000

Negative part conversionUntil the result is 0:

calculation result Individual position
0.1*2 0.2 0
0.2*2 0.4 0
0.4*2 0.8 0
0.8*2 1.6 1
0.6*2 1.2 1
0.2*2 0.4 0
0.4*2 0.8 0
0.8*2 1.6 1
0.6*2 1.2 1
0.2*2 0.4 0
0.4*2 0.8 0
0.8*2 1.6 1
0.6*2 1.2 1

Why is precision lost? Teach you to understand IEEE-754!It’s endless. Smart people should see that this has entered an infinite cycle.

It’s like one-third of the decimal system equals 0.33333333 The binary “ten” is equal to 0.00011001100110011 , are infinite recurring decimals.

Then combine the integer and the decimal part: 110110000.0 [0011]


F × 2e−1023Format of

1.1011000000011001100110011001100110011001100110011010 × 28

Fill 52 places of F with infinite recurring decimals,

f = 1011000000011001100110011001100110011001100110011010

8 = e − 1023, then E is 1031, converted to binary,

e = 10000000111


The jigsaw puzzle is all together. Let’s put it together! s + e + f!Why is precision lost? Teach you to understand IEEE-754!


This is the real body of IEEE-754 double precision floating-point number-5432.1.

Why not

Programmers suffer because of the loss of precision. This problem does not only happen in JavaScript, but poor JavaScript has more strange settings. We often bind the 0.1 + 0.2 problem to JavaScript. In fact, Java and other languages using IEEE-754 standard will have this problem (however, Java and BigDecimal can only cryWhy is precision lost? Teach you to understand IEEE-754!)。

So why is it not accurate?

Situation one

Let’s start with the most common situation

0.1 + 0.2 // 0.30000000000000004

1 - 0.9 // 0.09999999999999998

0.0532 * 100 // 5.319999999999999

Why is precision lost? Teach you to understand IEEE-754!I used to think that if you multiply 100 into an integer and then add and subtract, you won’t lose precision. But the fact is, the number calculated by multiplication itself has already gone out of shape.

Let’s go back to the cause. In fact, it’s the same as the above calculation of 0.1, because it can’t be divided completely.

But why?! Clearly printed out, he is the normal 0.1 ah! Why 1 – 0.9 out of 0.1 is not 0.1!

I would like to make a superficial guess:


//Output '0.10000000000000 551115123126'

console.log((1.1 - 1).toFixed(30))

//10000001883 '

adopttoFixedWe can see more accurate0.1What’s the number, and you can see it clearly0.1and1.1 - 1It’s not the same number at all, even if it’s in the decimal system0.1But in binary terms, it’s an inexhaustible number, so it’s slightly different when you do the calculation.

Under what circumstances will “0.1” be regarded as “0.1”? The answer is:

  • Less than 0.100000000000124 (etc.)
  • Greater than 0.09999999999999987 (etc.)

As for how to know exactly how IEEE-754 does “valuation”, the answer may be found here. Curious babies can delve into itWhy is precision lost? Teach you to understand IEEE-754!

In a word, because of the inexhaustible division and the error in the calculation, beyond a certain value, one number becomes another.

Situation 2

The second kind of uncertainty is becauseIt’s too big

We know that double precision floating-point numbers have 52 decimal places. If you add the previous 1, then the maximum isAnd it can be expressed accuratelyThat’s the integer ofMath.pow(2,53)

console.log(Math.pow(2, 53))

//Output 9007199254740992

console.log(Math.pow(2, 53) + 1)

//Output 9007199254740992

console.log(Math.pow(2, 53) + 2)

//Output 9007199254740994

Why is + 2 accurate again? Because in this range, the multiple of 2 can still be accurately expressed. Up again, when the numbers arriveMath.pow(2,54)After that, you can only accurately express the multiple of 4, the 55th power is the multiple of 8, and so on.

console.log(Math.pow(2, 54))

//Output 18014398509481984

console.log(Math.pow(2, 53) + 2)

//Output 18014398509481984

console.log(Math.pow(2, 53) + 4)

//Output 18014398509481988

So although floating-point numbers can represent the maximum and minimum numbers, they are not so accurate. However, they are better than fixed points which can not be expressed at all.

Practical links

Decimal to IEEE-754

IEEE-754 to decimal

Do it yourself conversion from decimal to IEEE-754

Recommended Today

Array of algorithms — sum of three numbers

Sum of three numbers difficultysecondary Here is an array of N integersnums, judgmentnumsAre there three elements a, B, C in a such that a + B + C = 0? Please find all triples that satisfy the condition and do not repeat.be careful:The answer cannot contain duplicate triples. Example:Given array nums = [- 1, 0, […]