1、 Background of the times
Around 2000, China’s Internet was still in the portal era, and the world Internet had begun to take off.
On the one hand, mainstream Internet companies represented by Google began to face the scene of data explosion; On the other hand, Internet companies at that time generally chose to use low-cost servers with low configuration. Therefore, the contradiction between the rapidly increasing amount of data and low computing and storage capacity became one of the main contradictions faced by mainstream Internet companies at that time.
Although supercomputers can solve some big data computing problems, supercomputers are expensive and are mainly used in laboratories and scientific research institutions. There is no unified and widely commercial distributed computing framework in the industry. Although some distributed computing frameworks have been developed at that time, most of them still stay in laboratories or scientific research institutions.
The reason why it has not been popularized is also relatively simple, that is, the design of those distributed frameworks is too complex, and there is no way to be well understood, let alone write code to implement and do distributed computing.
2、 The birth of Hadoop
In this era, let’s take a look at how Hadoop was born.
The first time point was in November 2003. Google published its first paper “the Google File System”, which is hereinafter referred to as GFS. The main content is how to solve the problem of distributed storage.
The second time node was in December 2004. Google published its second paper MapReduce: simplified data processing on large clusters. This paper discusses a simplified data processing model under large-scale cluster, which will be analyzed in detail later.
By the way, in 2006, Google published a third paper called BigTable: a distributed storage system for structured data.
The publication of these three papers basically marks the advent of the whole big data era. We generally call them the troika in the field of big data. The first paper solves the problem of distributed storage; The second paper solves the problem of distributed computing; The third paper solves the problem of storage and query of large-scale structured data.
Before publishing these papers, Google has built the corresponding system internally and has been widely used within the company. At the same time, Google is a search engine company. The core problem is to solve the crawling and index construction of the Internet. When the data on the Internet is particularly large, how to quickly build the index for so much data is also inconsistent with the specific implementation of various departments within Google before the birth of MapReduce, but the problems encountered are similar to the general solution ideas, such as how to divide the tasks of multiple machines How to carry out fault tolerance, but we are basically doing some work of building wheels repeatedly, and the implementation and maintenance costs are very high. At this time, Jeff Dean, the great God within Google, abstracted the architecture of MapReduce based on everyone’s practice.
But what does Google’s three papers have to do with Hadoop? At this time, we need to mention the second person – Doug Cutting. Doug Cutting is an open source system enthusiast. It has also implemented an open source version of the search engine in the same period as Google.
However, his engine has never been able to achieve Google’s search and index building ability so fast. After Google published these two papers, Doug Cutting was inspired and realized the open source versions of GFS and MapReduce. At the beginning, the two systems were mainly used to solve the data statistics problems encountered in search engines, but people soon found that the framework of MapReduce was actually very flexible and universal.
So in February 2006, Hadoop project was officially established, which includes two parts: one part is called MapReduce, which solves the problem of distributed computing; Another part is called HDFS, which solves the problem of distributed storage.
In addition, let’s also mention how the name Hadoop came from. Doug Cutting once explained that the name is actually the name of a plush toy elephant of his son, including the Yellow elephant logo of Hadoop, which is also inspired by the toy elephant. However, when we talk about playing with big data, we often say that elephants can dance, which is also an image emphasizing the flexible computing ability brought by Hadoop to big data.
3、 Characteristics of HDFS
Hadoop includes HDFS and MapReduce. Let’s take a specific look at how HDFS is designed and what are its characteristics?
The first feature: error detection and automatic recovery mechanism is the core architectural goal of HDFS. Why would error recovery be the core goal at that time? This also has a lot to do with the server configuration at that time. At that time, the typical server configuration used by Google may be a configuration of 2-core and 4-g memory, and the server is easy to fail. In order to save money, commercial companies such as Google often use cheap commercial servers, or even buy CPUs, memory and disks by themselves and assemble servers by themselves, The reliability of the assembled server is very low.
Moreover, in large-scale clusters, the failure rate will be magnified many times. We can calculate such a mathematical problem: if the probability of failure of one server is 1%, what is the probability of failure of at least one server when we use 100 servers at the same time? The answer is 63%, which means that we have most chances to encounter server failure. When we use more server resources, this probability will further increase. Therefore, error detection and fault recovery is one of the core objectives to be considered when designing distributed systems at that time.
The second feature is the large-scale data set. The typical file size on HDFS is at the level of GB ~ TB.
The third feature is streaming data access, that is, when we read the data stored on HDFS, we generally read it in the way of batch scanning, rather than randomly accessing part of the data.
The fourth feature is a simple consistency model. We generally write the data to HDFS at one time, and then we may read the data for analysis many times, which is one of the important prerequisites for our high throughput design.
The last feature is that mobile computing is more cost-effective than mobile data, which is also a very core idea. Generally speaking, the cost of reading data through the network is much higher than that through the local disk, and then the cost of reading data through the local disk is much higher than that through the local memory. Therefore, the idea of mobile computing is that we try to put the calculation in the location of the data, rather than pull the data to the place where we finally want to do the calculation for unified processing, because the cost of mobile data is very high, but placing the calculation in the location of the data can localize the calculation, which is a very efficient and cost-effective calculation method.
4、 HDFS data block
HDFS data block is a core concept of HDFS. Data block represents the smallest unit of reading and writing in HDFS file. The typical data block size is about 64M or 128 M. Why do we need to introduce the concept of data block?
The first reason is that it minimizes addressing space. Taking the mechanical disk as an example, we know that when the disk wants to read and write data, we must first find the corresponding disk sector through the rotation of the magnetic arm, and then read and write a certain data block through the magnetic head. Therefore, the process of rotating the magnetic arm to find the position is very time-consuming. If the data block is large enough, we can continuously store the data together, so that the addressing time of the magnetic arm will be greatly reduced, The efficiency of reading and writing on the whole disk is also the highest.
The second reason is that after we introduced the concept of data block, the file size of HDFS itself can be directly regarded as the number of data blocks. In theory, HDFS file can be infinite.
The third reason is that when we design the storage subsystem, the whole architecture can be more concise. We only need to regard the server disk as a data block.
The last reason is that when we do replica management and fault tolerance of data, we also take the data block as the unit. When a data block is lost or damaged, we only need to repair the current data block, which will not lead to the damage of the whole file.
5、 HDFS overall architecture
The overall architecture of HDFS is shown in the figure. HDFS is divided into two roles: one is called namenode and the other is called datanode. In a typical HDFS cluster, there is usually one working namenode and multiple working datanodes, so it is a typical master-slave architecture.
Namenode will maintain all metadata, which mainly refers to the HDFS file name, directory structure and the real storage location of the file block corresponding to each HDFS file.
A datanode node is generally deployed on each storage server, which is mainly responsible for managing the reading and writing of files stored on this server. For example, when we read a file, we first request the location of the file on the namenode, and then read the actual file data on the corresponding datanode node. When writing a file, it is similar. We first request the namenode to allocate the corresponding datanode node and location information, and then write data on the datanode node.
Another concept here is the rack. In a large HDFS cluster, the impact of the rack will be considered when reading and writing data, because the same rack often corresponds to the same switch. If you read files on machines in the same rack, the intermediate network transmission will be reduced.
Conversely, it will be more efficient to write files on the same rack. However, the limitation of the same switch will also lead to a high probability of failure of the same rack. Therefore, when writing files, we should not only consider efficiency, but also consider the reliability of files.
5.1 HDFS file reading process
As shown in the figure, when we use HDFS to read files, our client needs to do three steps: 1, 3 and 6. Step 1, we open a file using the open operation; Then step 3 reads the contents of the file using the read operation; After reading the contents of the file, we close the file with the close operation in step 6.
From the user’s point of view, this is almost no different from reading a local file, but the execution process behind it is completely different. After the open operation in step 1 is executed, the HDFS client will send a request to the namenode to obtain the actual storage location information of the specified file. After the read operation is initiated, HDFS will directly access the data of the corresponding datanode. At this time, there is no need to deal with the namenode. When reading the file, because the data may be stored on multiple datanodes, The client will read data from multiple datanode nodes in turn. After reading, we will close the file pipeline. At this time, a complete HDFS file reading process is completed.
5.2 HDFS file writing process
Let’s take another look at the process of writing files in HDFS. It is similar to reading files. When writing files, it is also divided into three steps from the perspective of the operator. In step 1, we perform the create operation to create a file. Then, in step 2, we execute the write operation and keep writing the contents of the file. Step 3: after we finish writing, execute the close operation to close the whole pipeline.
What actually happened later?
First, the client will send a request to the namenode. The namenode will check whether the client has the corresponding permission. After passing the check, it will assign the corresponding datanode and the corresponding location information to tell the client where to store the file. After the client obtains these information, it will write the file on the corresponding datanode node. After writing a file block on the first datanode, The first datanode node will also synchronize the data to the second datanode node, and the same second node will also synchronize the data to the third datanode node.
After the data is written, the third node will return the confirmation message to the second node, the second node will return the confirmation message to the first node, and the first node will return the final confirmation message to the client to confirm that the data is written. The client will send a message to the namenode to update the corresponding metadata information.
In order to ensure the high availability of data, HDFS also makes a lot of fault-tolerant mechanisms, and the core mechanism is the replica mechanism. The replica mechanism will ensure that all data will be stored in the way of 3 replicas by default, and the storage location of replicas is also very particular. The first node will generally choose to store the first replica on the machine where the client is located; Then select another node on the same rack as the first node to store the second copy; Finally, select the node with the first two nodes in different racks to store the third copy.
On the one hand, this choice considers the efficiency of storing data, on the other hand, it is also to take into account the availability of different replicas as much as possible. In addition, HDFS will also provide a checksum check code mechanism locally when writing files. Only when the check codes of data on all nodes are consistent will it be considered that the data has been written successfully, so as to ensure the availability in the writing process.
6、 Basic principles of MapReduce
We have solved the problem of distributed storage through HDFS. Let’s take another look at MapReduce, another important part of Hadoop.
MapReduce is mainly used to solve the problem of distributed computing. Its first core idea is divide and conquer, which is to divide a large task into several identical small tasks. Completing these small tasks is equivalent to completing this large task.
The second idea is mobile computing rather than mobile data. We also mentioned this idea when introducing the design idea of HDFS earlier, in order to localize computing as much as possible.
The third idea is to tolerate some node failures, which we also introduced in the HDFS section. Hardware failures are very common in large-scale clusters, so our computing framework must be able to tolerate such partial node failures.
Based on these three basic ideas, we can take a look at this specific example: if we want to count the frequency of each word in an article, how can we achieve it? Without considering the large amount of data, we can write a simple script. Now let’s assume that we want to distribute two steps according to the pseudo code method in the following example.
In the first function map, the main thing we do is to read in an article by line, then cut it according to words, and directly output each word after cutting, as well as the number of occurrences 1.
In the second function reduce, we input the output of map, but we have aggregated the same words. At this time, we only need to iteratively accumulate the number of occurrences of subsequent words for each word to calculate the number of occurrences of each word.
The calculation process of this example is very simple. Now, if we change the scenario to a scenario with a large amount of data, the data volume of the document to be calculated may be TB, and the data may also be stored on multiple machines. How can we complete the calculation of wordcount? Let’s take a look at the specific implementation process of MapReduce.
Let’s assume that a large amount of data document file has been stored on HDFS. Now we divide the file into four pieces, corresponding to split1 – split4 in the figure. Each map task will input one piece of document data. The map will divide the content of the document according to our code example above, and then output the corresponding key value pair. Key is the word we want to count, and value is the number of occurrences of this word, When the map is output, the value value is fixed to 1.
The contents of map output will be partitioned and stored on the local disk. The purpose of partitioning is to determine which reduce the contents of map output will be sent to for processing. The data in the Yellow partition 1 in the figure will eventually be sent to reduce 1 as input.
The input of reduce will be key list < value >, where the key is the same as the key output from map. List < value > is the set of values output from all maps. In the reduce stage, we can finally calculate the desired statistical solution by repeatedly accumulating the values in the set, and the output of reduce will be written to the HDFS storage system.
The intermediate process from the output of map to the input of reduce is called shuffle. The shuffle process involves the sorting and redistribution of data. It is the most core and complex process of the whole MapReduce. When the map outputs data, on the one hand, the data should be partitioned, and the data in the same partition will be sorted according to the key.
After the map phase is completed, reduce will pull down the data in the corresponding partition from the output directory of the corresponding map task. At the same time, in the reduce phase, in order to ensure that the output results of multiple maps are finally ordered, it is also necessary to do a merging and sorting in the input phase. From the whole MapReduce execution process, we can see that if the map phase task fails, we only need to re execute a single map task. We can choose to re read the data of the corresponding partition on any node and output the result to the local disk. If the reduce phase task fails, only a single reduce task needs to be executed. The same reduce task can also be executed on any new node. Go to the corresponding map output location again, pull down the data of the corresponding partition, count the final calculation results and output them to the disk.
7、 Yarn job scheduling process
Earlier, we introduced the basic principle and main execution process of MapReduce. Another important part is how multiple map and reduce tasks are allocated and executed on different nodes, and how the coordination between tasks is done. Here, we need to introduce the concept of yarn.
Yarn is a new component introduced from Hadoop 2, which is mainly used to solve the problem of task resource management and allocation. Yarn is also divided into two roles, ResourceManager and nodemanager. Resource manager is mainly used to coordinate nodemanger for global resource management and allocation. Nodemanager corresponds to a specific execution node and is responsible for managing the task instances dispatched to the node. The execution scheduling process of a typical MapReduce job is shown in the figure above:
(1) The client starts the MapReduce job.
(2) The client sends a request to the ResourceManager to request a new application ID.
(3) The client copies the code, configuration files and other local resources required to execute the job to distributed storage systems such as HDFS.
(4) The client sends the submit job request to the resource manager again.
(5) The resource manager finds a relatively idle nodemanager, and the nodemanager starts a container in which the appmaster task will be started.
(6) Appmaster will serve as the management center within the life cycle of this MapReduce job and initialize the job first.
(7) Appmaster goes to HDFS to obtain the execution resources of the job and calculate the corresponding fragment information. At this time, you can know how many execution resources this MapReduce program must need.
(8) Appmaster applies to the ResourceManager for the allocation of execution resources. The execution resources here are mainly CPU, memory and other hardware resources. In Yan, it is actually allocated with container as the carrier.
(9) After applying for the corresponding resource information, the appmaster will start the corresponding container container on the corresponding nodemanager node.
(10) Nodemanager starts a new container container and obtains the code and fragment configuration information required to execute the job.
(11) Start a map or reduce task in the container.
It is worth mentioning here that as an independent component, yarn positioning has become a general resource management and coordination center, so yarn can not only execute MapReduce jobs, but also submit new spark jobs and Flink jobs to yarn for management and execution.
8、 Hadoop ecosystem
Earlier, we introduced the two core parts of Hadoop: HDFS solves the problem of distributed storage and MapReduce solves the problem of distributed computing. Finally, we also introduced the introduction of Yan as an independent resource management and coordination center in Hadoop 2.
Based on these basic components, the whole open source ecosystem has developed more and more new components around Hadoop basic components to solve various problems encountered in the process of big data processing. Now when we talk about Hadoop in general, we often refer to the whole Hadoop ecosystem, which contains more and more functional components.
Take spark and Flink, which are hot recently. Both spark and Flink were born to solve the problem of distributed computing, so their location is at the same level as MapReduce. The files read and output by spark and Flink are generally stored on HDFS, and spark and Flink jobs can also be executed and scheduled on Yan.
After MapReduce has been widely used, people have also found that it has two core weaknesses: one is that MapReduce has low execution efficiency and is only suitable for offline computing, which is not suitable for computing scenarios with high timeliness requirements; The other is that the MapReduce architecture is too simple and only supports map and reduce operators. Many complex computing operations are difficult to implement directly based on these two operators.
Moreover, MapReduce provides a relatively low-level API interface, which depends on understanding more distributed related knowledge in program development. On the one hand, on the basis of MapReduce, people have done a lot of patching work, such as the introduction of hive; On the other hand, there are more new computing frameworks, among which spark and Flink are outstanding. Spark and Flink not only greatly improve the performance of computing, but also achieve the unity of the underlying architecture in the integration of streaming and batch, machine learning, SQL query and so on. What they expose is more advanced upper APIs, which will be more friendly to big data program developers.
Although MapReduce will be gradually replaced by new computing frameworks such as spark and Flink, the basic idea of MapReduce has been inherited and carried forward. Understanding the execution process of MapReduce is one of the foundations for understanding any distributed computing system.
At the same time, as the bottom foundation of the big data ecosystem, Hadoop and a series of distributed data processing systems developed around HDFS and Yan also make the whole Hadoop ecosystem more and more perfect and active. The basic principles and applications of Hadoop are also worthy of learning and thinking by each of our students who are getting started with big data. Thank you.
Source of Cultural Relics: official account Shence technology community