Big data training: Hadoop ecosystem circle

Time:2022-5-27

Hadoop overview

Hadoop system is also a computing framework. Under this framework, a simple programming mode can be used to process large data sets through clusters composed of multiple computers. Hadoop is scalable. It can be easily expanded from a single server to thousands of servers, and each server performs local computing and storage.Big data training institutionsIn addition to relying on the high availability of hardware delivery, the software library itself also provides data protection and can handle failures at the application layer, so as to provide high availability services at the top of the computer cluster. The core ecosystem components of Hadoop are shown in the figure.

Big data training: Hadoop ecosystem circle

Hadoop ecosystem

Hadoop includes the following four basic modules.

1) Hadoop basic function library: a general package that supports other Hadoop modules.

2) HDFS: a distributed file system that can access data in applications with high throughput.

3) Yarn: a job scheduling and resource management framework.

4) MapReduce: a big data parallel processing program based on yarn.

In addition to the basic modules, Hadoop also includes the following projects.

1) Ambari: Web based, used to configure, manage and monitor Hadoop clusters. Support HDFS, MapReduce, hive, hcatalog, HBase, zookeeper, oozie, pig and sqoop. Ambari also provides a dashboard that displays the health status of the cluster, such as a hotspot map. Ambari graphically views the operation of MapReduce, pig and hive applications, so it can diagnose application performance problems in a user-friendly way.

2) Avro: data serialization system.

3) Cassandra: scalable NoSQL multi master database without single point of failure.

4) Chukwa: data acquisition system for large distributed systems.

5) HBase: scalable distributed database, supporting structured data storage of large tables.

6) Hive: data warehouse infrastructure, which provides data summary and command-line ad hoc query functions.

7) Mahout: Extensible machine learning and data mining library.

8) Pig: a high-level data flow language and execution framework for parallel computing.

9) Spark: a general-purpose computing engine that can process Hadoop data at high speed. Spark provides a simple and expressive programming mode, which supports ETL, machine learning, data flow processing, image computing and other applications.

10) Tez: a complete data flow programming framework, established based on yarn, provides a powerful and flexible engine, which can perform arbitrary directed acyclic graph (DAG) data processing tasks, and supports both batch processing and interactive user scenarios. Tez has been adopted by components of Hadoop ecosystem such as hive and pig to replace MapReduce as the underlying execution engine.

11) Zookeeper: a high-performance coordination service for distributed applications.

In addition to the above officially recognized Hadoop ecosystem components, there are many excellent components that are not introduced here. These components are also widely used, such as presto, impala, kylin, etc. based on hive query optimization.

In addition, around the Hadoop ecosystem, there are also a group of “partners”. Although they have not been deeply integrated into the Hadoop ecosystem, they are inextricably linked with Hadoop and play an irreplaceable role in their respective fields. The following figure shows the components in the Hadoop ecosystem integrated by Alibaba cloud e-mapreduce platform, which is more powerful than the combination provided by Apache.

Big data training: Hadoop ecosystem circle
 
The following is a brief introduction to the more important members.

1) Presto: open source distributed SQL query engine, suitable for interactive analysis and query, and the amount of data supports GB to Pb. Presto can handle multiple data sources. It is an MPP architecture query engine based on memory computing.

2) Kudu: the column storage distributed database similar to HBase can provide the function of quickly updating and deleting data. It is a big data storage engine that supports both random reading and writing and OLAP analysis.

3) Impala: an efficient fast query engine based on MPP architecture. It is based on hive and uses memory for calculation, taking into account ETL function. It has the advantages of real-time, batch processing, multi concurrency and so on.

4) Kylin: open source distributed analytical data warehouse, which provides SQL query interface and multi-dimensional analysis (OLAP) capability based on Hadoop / spark, and supports sub second level query of super large-scale data.

5) Flink: a distributed real-time processing engine for streaming and batch data with high throughput and low latency, it is a new star in the field of real-time processing.

6) Hudi: the open source data Lake solution developed by Uber. Hudi (Hadoop updates and incremental) supports the modification and incremental update of HDFS data.

Advantages and disadvantages of Hadoop

Today, Hadoop has evolved into an ecosystem. The components in the system vary greatly. Some are still in the incubation stage, some are in full bloom, and some are old. Among them, the most enduring components are HDFS and hive. The short-lived components include HBase, MapReduce and presto. Spark and Flink are in their prime.

As the old saying goes, “success is also Xiaohe, failure is also Xiaohe”. The core reason for the success of big data is open source, but its biggest problem is open source. Although many components can quickly mature by relying on open source, once they mature, there will be ecological disorder and version fragmentation. The most typical one is hive.

Hive 1. The function of the version before x is not perfect, 1 Version x and 2 Version x is gradually optimized to be basically available, to 3 There are various problems in version x, and most of the hive versions of the cloud platform stay at 2.0 Version x, the promotion of the new version is weak. In addition, hive’s computing engine is also controversial. Hive mainly supports MapReduce, tez, spark and presto. MapReduce’s computing speed has not improved over the past decade; Tez has fast computing speed, but the installation needs customized compilation and deployment; Spark has the fastest computing speed, but it is not friendly to JDBC support; Presto is fast and supports JDBC, but the syntax is inconsistent with hive. To be clear, the speed mentioned here is only relative to the MapReduce engine, which is still 1 to 2 orders of magnitude different from the speed of traditional databases.

Generally speaking, the big data platform developed based on Hadoop usually has the following characteristics.

1) Capacity expansion: it can reliably store and process Pb level data. Hadoop ecology basically adopts HDFS as the storage component, with high throughput, stability and reliability.

2) Low cost: the server group composed of cheap and general machines can be used to distribute and process data. These server clusters can total up to thousands of nodes.

3) High efficiency: by distributing data, Hadoop can process in parallel on the node where the data is located, and the processing speed is very fast.

4) Reliability: Hadoop can automatically maintain multiple backups of data and automatically redeploy computing tasks after task failure.

Hadoop ecology also has many disadvantages.

1) Hadoop adopts file storage system, so it has poor read-write timeliness. So far, there is no component that supports fast update and efficient query.

2) Hadoop ecosystem is becoming more and more complex, the compatibility between components is poor, and installation and maintenance are difficult.

3) Each component of Hadoop has a relatively single function, with obvious advantages and disadvantages.

4) The impact of cloud ecology on Hadoop is very obvious. The cloud factory agreed to make components, which further expanded the version differences and failed to form a joint force.

5) The overall ecology is based on Java development, with poor fault tolerance, low availability, and easy components to hang up.

Source: shucang baby Library