Common technology stack of big data


When it comes to big data, we have to mention the 5V characteristics of big data proposed by IBM: Volume (large amount), velocity (high speed), variety (diversity), value (low value density), veracity (authenticity). The daily work of employees in the field of big data is also closely related to the 5V. Big data technology has made rapid development in the past few decades, especially Hadoop and spark, which have built a huge technology ecosystem.

First of all, let’s understand some commonly used technologies in the field of big data through a chart. Of course, the development of big data involves more than these technologies.

BigData Stack:

Common technology stack of big data

The following introduces each technology in different layers. Of course, each layer is not strictly divided in the literal sense. For example, hive provides both data processing and data storage functions, but here it is classified as the data analysis layer

1. Data acquisition and transmission layer

  • Flume
    Flume is a distributed, reliable and highly available system for data collection, aggregation and transmission. It is commonly used in log collection system, and supports to customize all kinds of data senders to collect data, and simply preprocess the data through user-defined interceptors and transmit it to all kinds of data receivers, such as HDFS, HBase and Kafka. It was previously developed by cloudera and later incorporated into Apache
  • Logstash
    Elk is a member of the work stack, also commonly used in data collection, is an open source server-side data processing pipeline
  • Sqoop
    Sqoop is a tool for data import and export through a set of commands. The underlying engine relies on MapReduce and is mainly used for data import and export between Hadoop (such as HDFS, hive, HBase) and RDBMS (such as mysql, Oracle)
  • Kafka
    Distributed message system. Producer consumer model. Features similar to JMS are provided, but the design is completely different and does not follow the JMS specification. For example, Kafka allows multiple consumers to actively pull data, while only consumers in point-to-point mode will actively pull data in JMS. It is mainly used in data buffer, asynchronous communication, data collection, system connection and so on
  • Pulsar
    The distributed message platform of pub sub mode has flexible message model and intuitive client API. It is similar to Kafka, but pulsar supports multi tenancy. It has the concept of asset and namespace. The asset represents the tenant in the system. Suppose a pulsar cluster is used to support multiple applications, and each asset in the cluster can represent an organization’s team, a core function, or a product line. An asset can contain multiple namespace, and a namespace can contain any topic

2. Data storage layer

  • HBase
    The open source implementation based on Google BigTable is a NoSQL database system with high reliability, high performance, column oriented, scalability and typical key / value distributed storage. It is mainly used for massive structured and semi-structured data storage. It is between NoSQL and RDBMS. It can only retrieve data through row key and range of row key. Row data storage is atomic and only supports single row transaction (complex operations such as multi table join can be realized through hive support). HBase query data function is very simple, does not support join and other complex operations, does not support cross row and cross table transactions
  • Kudu
    Distributed database based on column storage, which is between HDFS and HBase. It has the real-time performance of HBase, high throughput of HDFS and SQL support of traditional database
  • HDFS
    Distributed file storage system has the characteristics of high fault tolerant, high throughput and high available. HDFS is very suitable for applications on large-scale datasets, providing high-throughput data access, and can be deployed on cheap machines. It relaxed the POSIX requirements, so that it can access the data in the file system in the form of stream. It mainly provides massive data storage services for various distributed computing frameworks, such as spark and MapReduce. At the same time, the underlying data storage of HDFS and HBase also depends on HDFS

3. Data analysis layer

  • Spark
    Spark is a fast, universal, scalable, fault-tolerant, memory iterative computing big data analysis engine. At present, the ecosystem mainly includes sparkrdd and sparksql for batch data processing, sparkstreaming and structured streaming for stream data processing, spark mllib for machine learning, graphx for graph computing and sparkr for statistical analysis. It supports Java, Scala, Python and R data languages
  • Flink
    The distributed big data processing engine can calculate the limited data stream and wireless data stream. Flink was developed on the basis of stream at the beginning of its design, and then entered the field of batch processing. Compared with spark, it is a real real-time computing engine
  • Storm
    It is a distributed real-time computing system managed by Apache. Storm is a data stream processing engine without batch processing capability. Storm provides a low-level API, and users need to implement a lot of complex logic by themselves
  • MapReduce
    The programming framework of distributed computing program is suitable for offline data processing scenarios. The internal processing flow is mainly divided into two stages: map and reduce
  • Hive
    Hive is a data warehouse tool based on Hadoop. It can map structured data files to a database table and provide HQL (SQL like language) query function. The storage depends on HDFS. Support a variety of computing engines, such as spark, MapReduce (default), tez; support a variety of storage formats, such as textfile, sequencefile, rcfile, ORC, parquet (commonly used); support a variety of compression formats, such as gzip, LZO, snappy (commonly used), bzip2
  • Tez
    Open source computing framework supporting DAG jobs. Compared with MapReduce, it has better performance mainly because it describes the job as DAG (directed acyclic graph), which is similar to spark
  • Pig
    Hadoop based large-scale data analysis platform, which includes a script language named Pig Latin to describe the data flow, parallel execution of data flow processing engine, provides a simple operation and programming interface for complex massive data parallel computing. Pig Latin itself provides many traditional data operations, and allows users to develop some custom functions to read, process and write data. The compiler of this language will transform SQL like data analysis requests into a series of optimized MapReduce operations
  • Mahout
    Mahout includes many implementations, including clustering, classification, recommendation filtering and frequent sub item mining. Mahout can be extended to the cloud by using the Apache Hadoop library
  • Phoenix
    An SQL layer built on top of HBase enables us to operate the data in HBase through standard JDBC API. Phoenix is written entirely in Java as a JDBC Driver embedded in HBase. Phoenix query engine transforms SQL queries into one or more HBase scans and orchestrates execution to generate standard JDBC result sets

4. OLAP engine

  • Druid
    It is an open source, column based, distributed storage system suitable for real-time data analysis. It can quickly aggregate, flexibly filter, query in milliseconds and import data with low latency. By using bitmap indexing to speed up the query speed of column storage, and using consise algorithm to compress bitmap indexing, the generated segments are much smaller than the original text file, and the coupling between its various components is low. If real-time data is not needed, real-time nodes can be ignored
  • Kylin
    Initially developed by eBay Inc. and contributed to the open source community’s distributed analysis engine. It provides SQL query interface and OLAP capability on Hadoop / spark to support super large scale data. It can query huge hive tables in sub seconds. Users need to have a deep understanding of the data warehouse model, and need to build a cube. It can be used with a variety of visualization tools, such as tableau, powerbi, etc., so that users can use Bi tools to analyze Hadoop data
  • Impala

    The big data query and analysis engine, which provides high-performance and low latency interactive SQL query function for HDFS, HBase and other data, is open source by cloudera. It is based on hive and uses hive’s metadata to calculate in memory. It has the advantages of real-time, batch processing and high concurrency

  • Presto
    Open source distributed big data SQL query engine is suitable for interactive analysis and query. Data from multiple data sources can be merged, and data can be read directly from HDFS without a lot of ETL operations before use

5. Resource management

  • Yarn
    Yarn is a resource scheduling platform, which is responsible for allocating resources and scheduling for computing programs, and does not participate in the internal work of user programs. The core components include: ResourceManager (Global Resource Manager), nodemanager (resource and task manager on each node)
  • Kubernetes
    K8s, also known as k8s, is an open source platform for automated container operation, which provides resource scheduling, deployment and operation, balanced disaster recovery, service registration, capacity expansion and reduction functions for containerized applications. It is embodied in: automating the deployment and replication of containers, expanding or shrinking the size of containers at any time, organizing containers into groups, and providing load balancing among containers. Kubernetes supports docker and rocket. Docker can be regarded as a low-level component used in kubernetes
  • Mesos
    Similar to yarn, it is also a distributed resource management platform, running in a unified resource management environment for MPI and spark jobs. It has good support for Hadoop 2.0, but it is not widely used in China

6. Workflow scheduler

  • Oozie
    The task scheduling framework based on workflow engine can provide the scheduling and coordination of MapReduce and pig tasks
  • Azkaban
    It is open source by LinkedIn, which is lighter than oozie. It is used to run a group of tasks in a specific order in a workflow. A kV file format is used to establish the dependency between tasks, and an easy-to-use web interface is provided for users to maintain and track the workflow of allowed tasks

7. Others

  • Ambari
    Web based installation and deployment tools support the management and monitoring of most Hadoop components, such as HDFS, MapReduce, hive, pig, HBase, etc
  • Zookeeper
    Distributed coordination service is to provide coordination services for users’ distributed applications, such as master-slave coordination, dynamic up and down line of server nodes, unified configuration management, distributed shared lock, etc. it is also a distributed program (deploying odd number computers, as long as more than half of zookeeper nodes survive, zookeeper cluster can provide services normally). It is Google Chubby: an open source implementation