How to learn Hadoop


How to learn Hadoop, how to learn Hadoop, little partners interested in big data development can learn about it with Xiaobian.
Hadoop implements a Hadoop distributed file system (HDFS). HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; Moreover, it provides high throughput to access application data, which is suitable for applications with large data sets. HDFS relaxes the requirements of POSIX and can stream access the data in the file system.
The core design of Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, and MapReduce provides calculation for massive data. In a word, Hadoop is storage plus computing.
Hadoop is a distributed computing platform that can be easily constructed and used by users. Users can easily develop and run applications that handle massive amounts of data on Hadoop. It has the following advantages:
1. The ability of Hadoop to store and process data by bit with high reliability is trustworthy.
2. High scalability Hadoop allocates data among available computer clusters and completes computing tasks. These clusters can be easily extended to thousands of nodes.
3. High efficiency Hadoop can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
4. Hadoop with high fault tolerance can automatically save multiple copies of data and automatically reallocate failed tasks.
5. Low cost compared with all-in-one computers, commercial data warehouses, qlikview, Yonghong z-suite and other data marts, Hadoop is open source, so the software cost of the project will be greatly reduced.
Hadoop has a framework written in the Java language, so it is ideal to run on the Linux production platform. Applications on Hadoop can also be written in other languages, such as c++.
Significance of Hadoop big data processing:
Hadoop can be widely used in big data processing applications thanks to its natural advantages in data extraction, deformation and loading (ETL). The distributed architecture of Hadoop makes the big data processing engine as close to storage as possible. It is relatively suitable for batch operations such as ETL, because batch results of such operations can be directly stored. The MapReduce function of Hadoop breaks a single task, sends the fragmented task (map) to multiple nodes, and then loads (reduces) it into the data warehouse in the form of a single data set.
Hadoop consists of the following items:
1. Hadoop common: a module at the bottom of the Hadoop system, which provides various tools for Hadoop sub projects, such as configuration files and log operations.
2. HDFS: distributed file system, which provides high-throughput application data access. For external clients, HDFS is like a traditional hierarchical file system. You can create, delete, move, or rename files, and so on. However, the architecture of HDFS is based on a specific set of nodes, which is determined by its own characteristics. These nodes include namenode (only one), which provides metadata services within HDFS; Datanode, which provides storage blocks for HDFS.
Since there is only one namenode, this is a drawback of HDFS (single point of failure). Files stored in HDFS are partitioned and then copied to multiple computers (datanodes). This is very different from the traditional raid architecture. The size of the blocks (typically 64MB) and the number of blocks copied are determined by the client when the file is created. Namenode can control all file operations. All communication within HDFS is based on standard tcp/ip protocol.
3. MapReduce: a software framework set computing cluster for distributed massive data processing.
4. Avro: the RPC project hosted by Doug Cutting is mainly responsible for data serialization. It is similar to Google’s protobuf and Facebook’s thrift. Avro is used as the RPC of Hadoop in the future, which makes the communication speed of the RPC module of Hadoop faster and the data structure more compact.
5. Hive: similar to cloudbase, it is also a set of software based on Hadoop distributed computing platform that provides the SQL function of data warehouse. It simplifies the summary of massive data stored in Hadoop and ad hoc query. Hive provides a set of QL query language, which is based on SQL and is very convenient to use.
6. HBase: Based on Hadoop distributed file system, it is an open source, extensible distributed database based on column storage model, and supports the storage of structured data in large tables.
7. Pig: it is an advanced data flow language and execution framework for parallel computing. SQL like language is an advanced query language built on MapReduce. It compiles some operations into map and reduce of MapReduce model, and users can define their own functions.
8. Zookeeper: an open source implementation of Chubby in Google. It is a reliable coordination system for large-scale distributed systems. Its functions include configuration maintenance, name service, distributed synchronization, group service, etc. The goal of zookeeper is to package complex and error prone key services, and provide users with simple and easy-to-use interfaces and systems with efficient performance and stable functions.
9. Chukwa: a data acquisition system for managing large-scale distributed systems is contributed by Yahoo.
10. Cassandra: a scalable multi master database without single point of failure.
11. Mahout: an extensible machine learning and data mining library.
At the beginning of Hadoop design, the goal is to locate high reliability, high scalability, high fault tolerance and high efficiency. It is these inherent advantages of design that make Hadoop popular with many large companies as soon as it appears, and also cause widespread concern in the research community. So far, Hadoop technology has been widely used in the Internet field, such as Yahoo, Facebook, adobe, IBM, Baidu, Alibaba, Tencent, Huawei, China Mobile, etc.
As for how to learn Hadoop, we must first understand and deeply understand what is Hadoop, its principle and function, including what the basic composition is and what the functions are. Of course, before learning, we must master at least one basic language, so that we can get twice the result with half the effort.