First knowledge of distributed: MIT 6.284 series (I)

Time:2021-10-21

preface

This series is derived from"Manon turned over"The reading activity initiated by the knowledge planet is sponsored by the boss@My UDP does not lose packetsRecommended, there are some alternatives in this reading activity. We abandoned traditional books and began to introduce the top graduate course < 6.824 >, which was the inventor of worm virus many years agoRobert MorrisThe boss teaches and belongs toMIT, the main teaching methods are: Video + lab experiment (go language) + thesis. The whole course is in English, which is difficult.

Judgment basis of distributed system

  • Multiple cooperating computers
  • Storage for big web sites, MapReduce, peer-to-peer sharing
  • Lots of critical infrastructure is distributed

MapReduce: large scale data set computing systems, such as computing from 1 to 100 billion, can be calculated by a single computer, or can be dispersed to multiple computers by using this technology, and then the results are combined, which greatly improves the efficiency

Why distributed systems

  • To increase capacity via parallelism
  • To replicate faults via replication
  • To place computing physically close to external entities
  • To achieve security via isolation

Fault tolerance:For fault tolerance, there are two main points: one is availability and the other is recoverability

For distributed systems, generally, not all servers will be paralyzed at the same time. Therefore, both service availability and data security are more guaranteed than single services.

Difficulties of distributed

  • Additional attention needs to be paid to concurrent programming, and the ability requirements for developers are rising sharply
  • The interactions within the system are very complex
  • Unexpected error: local error
  • Expected performance often does not match actual performance

local error: suppose that the probability of a machine failure every day is one thousandth. In a single application, it may work for a long time, but in a distributed system, the number of devices increases sharply, and devices may fail every day. This is the so-called local error, which is difficult to troubleshoot and almost inevitable

Here is a comparison between single application and distributed application. The picture is from geek time · listening to the wind in left ear

First knowledge of distributed: MIT 6.284 series (I)

Solutions for distributed systems

Macro objective

We need to design a series of abstractions that can shield the complexity of distributed systems

Why set this goal?

Because the distributed system itself is complex enough, it must be simplified

What does simplification have to do with abstraction?

The most perfect abstraction I currently recognize is: file

“UNIX files are essentially a bag of bytes.” – “the art of UNIX Programming”

In UNIX, any read / write device that has I / O, whether it is a file, socket or driver, has a corresponding file descriptor after opening the device. UNIX simplifies the reading and writing of these devices in read / write. In other words, you only need to pass the open file descriptor to these two functions. The operating system kernel knows how to get the specific device information according to the file descriptor. The details of reading and writing various devices are hidden inside. All these are transparent to users. You only need to open it to get FD, The corresponding operation is enough.

Research angle

  • Implementation mode.

    • RPC remote call, thread and concurrency control
  • Performance:

    • Usually we want to provide a system with scalable performance.
    • The parallel capability can be enhanced by simply increasing the number of computers in the system, so as to partially expand the performance of the system:

      • This is effective when there is no complex interaction
      • You don’t have to hire expensive programmers to redesign the system.
    • Simply increasing the number of computers in the system does not always increase the system performance:

      • When the number of computers becomes large, the load is uneven, the performance of each computer in the system is uneven, the code that cannot be executed in parallel, and the interaction of initialization will reduce the performance of the system.
      • Access from shared resources can also cause performance bottlenecks, such as network communication or database
    • At the same time, performance cannot always be achieved by increasing the number of computers in the system:

      • For example, fast response time from a single user request
      • For example, all users want to update the same data.
      • Often these situations require better programming rather than more computers.
  • Fault tolerance:

    • A large number of servers + large systems usually mean that errors always occur
    • We need to hide these errors from the application
    • We usually want the system to have availability and recoverability

      • Availability: the system can continue to run even if an error occurs
      • Recoverability: after the error is fixed, the system can resume operation
    • It is usually possible to increase fault tolerance with a standby server
  • uniformity:

    • It is often difficult to achieve a system that works correctly:

      • It is difficult to maintain consistency between the server and its backup server, and the cost is too high
      • The client may make an error halfway.
      • The server may crash after processing and before replying
      • Poor network may make normal servers unable to provide services
    • Consistency and performance are often contradictory:

      • High consistency requires a lot of communication between various basic settings
      • Many designs are forced to provide only weak consistency in order to improve performance

uniformity: consistency seems to be the most difficult problem to solve, because it essentially includes many elements such as performance, fault tolerance, data consistency and so on

As we said earlier, in order to consider the fault tolerance and disaster recovery mechanism, data backup is required. In the distributed system, if service a modifies the value of database a, whether the value of database B should be changed immediately or delayed, what should be done if there is a problem in synchronous modification and what should be done if there is a problem in asynchronous modification

Finally, it is difficult for the industry to solve the corresponding problems. Therefore, the mainstream way is:Final consistency

That is, data inconsistency is allowed in a short time, and the performance and data security are guaranteed through final consistency

Continuous brain map

File sharing address:https://www.processon.com/vie…

First knowledge of distributed: MIT 6.284 series (I)

Contents of the next chapter

In the next chapter, we will carry out lab 1 in < 6.824 >, that is, to implement a simpleMapReduceSystem, which will be built in go language

Go language is one of the most popular languages in recent years. Its personal value is greater than the hot python

Requirements of this chapter

  • Understand the origin and challenges of distributed systems
  • Understand the distributed system solutions covered in the < 6.824 > course
  • Build a go language environment and write HelloWorld (the syntax and Mr implementation will be studied in the next chapter)

last

Related resources:

Go official mirror station

Go language IDE

Go language environment building tutorial

Go language beginner + HelloWorld

MIT curriculum home page

Chinese translation video address of station B

If you think it’s useful to you, don’t forget to like it ~ you can also scan the QR code to pay attention to me and move towards the peak of technical people together!
First knowledge of distributed: MIT 6.284 series (I)