How can the big data single cluster achieve 20000 + scale?


Summary:The user’s demand for multi scenario fusion analysis does not allow the cluster to be split, and the data analysis business is not allowed to be separated, resulting in the loss of association between business modules. Therefore, Huawei has started the exploration of 20000 node scale of a single cluster.

On July 9, at the big data industry summit · results conference, the Chinese Academy of Communications issued certificates for products that passed the big data product capability evaluation. Huawei cloud fusioninsight Mrs successfully passed the evaluation with full scores of all test items, successfully broke through the super large scale of 20000 nodes in a single cluster and set a new benchmark in the industry.

How can the big data single cluster achieve 20000 + scale?

In order to cope with the rapid development of 5g and IOT, big data technology is further strengthened on the basis of distributed batch processing. Fusioninsight Mrs, as Huawei’s big data product based on Hadoop ecology, has been committed to the exploration and practice of large-scale single cluster carrying capacity. The purpose is that when the data grows exponentially, Huawei’s self-developed big data products can smoothly meet the needs of users. With the acceleration of social digital transformation, the amount of data surged more than expected. At the same time, users’ demands for multi scenario fusion analysis do not allow the cluster to be split, and the data analysis business to be separated, resulting in the loss of correlation between business modules. Therefore, Huawei’s big data R & D team started the exploration of 20000 nodes in a single cluster.

Technical pain points of large-scale clusters

For a distributed system, when the cluster size changes from small to large, simple problems will become extremely complex. With the increase of nodes, the simple heartbeat mechanism will also make the master node overwhelmed. Fusioninsight Mrs cluster of 2W node faces many challenges:

1. How to realize efficient scheduling of batch, stream and interactive mixed loads for multi tenant scenarios, linear expansion of cluster scale and processing capacity, and peak and trough staggering reuse of resources between engines

The centralized storage of data can be effectively solved through large clusters, but if the data is only stored, it will not produce value. Only a lot of analysis can find value from the data. It is a common usage of big data platform to generate fixed reports through batch running tasks. If hundreds of P of data are only used for batch running, it is a waste of both data and massive computing resources; Time is money and time is efficiency. Data t + 0 entering the lake and real-time updating entering the lake is to continuously accelerate the realization of data value. Large scale clusters should be able to realize t + 0 of data, real-time data entering the lake, batch analysis of full data and interactive exploration and analysis of data analysts to ensure the maximization of the value of the platform. For example, in a large cluster, it is an important problem to be solved by the scheduling system to quickly realize the real-time entry of data t + 0 into the lake and batch analysis, and meet the ad hoc query requirements of a large number of analysts to achieve the isolation and sharing of computing resources.

2. How to face the new challenges in storage, computing and management and break through the bottleneck of multiple components

Calculation:As the cluster becomes larger, yarn’s ResourceManager has more schedulable resources and more parallel tasks, which puts forward higher requirements for the central scheduling process. If the scheduling speed cannot keep up, job tasks will accumulate at the entrance of the cluster, and the computing resources of the cluster cannot be effectively utilized.

Storage:With the increase of storage capacity, HDFS needs to manage more file objects on large-scale clusters, and the amount of HDFS namenode metadata will increase accordingly. Although the community provides a namenode Federation mechanism, the application layer needs to perceive the namespaces of different namenodes, and the use and maintenance will become extremely complex. In addition, it is also prone to the problem of uneven amount of data mapped between namespaces. At the same time, with the increase of the amount of data, the amount of data in hive metadata increases sharply, which will also form a great pressure on the metadata database. It is very easy to see that all SQL statements accumulate in the metadata query link, resulting in blocking.

Operation and maintenance management:In addition to the bottleneck problems faced by computing and storage, the operation and maintenance capacity of the platform will also encounter bottleneck problems as the scale becomes larger. For example, in the monitoring system of the system, when the node changes from 5000 to 20000, the monitoring indicators processed per second will increase from 600000 to more than 2 million per second.

3. How to improve the reliability and operation and maintenance capability of large-scale clusters to ensure that the clusters do not stop serving

The reliability of the platform has always been the focus of the platform operation and maintenance department. When the cluster undertakes the unified processing and analysis of the full amount of data of the whole group, it means that the cluster must always be online 24 hours, but the technology will continue to develop [Z (4]. The platform must ensure that the system can support subsequent updates and upgrades, so as to ensure that the cluster can continue to evolve and develop in the future.

In addition, with the increase of cluster size, the problem of insufficient room space will be highlighted. If a large cluster is simply deployed across machine rooms, it will face great challenges in bandwidth load and reliability. How to achieve the reliability of computer room level is also very important for a large-scale cluster.

Practice process of super large scale cluster optimization

In view of the above challenges, fusioninsight MRS has been systematically optimized in version 3.0. If the nodes from 500 to 5000 in that year were mainly optimized at the code level, then from 5000 to 2W, code level optimization alone can not be achieved, and many problems need architecture level optimization to be solved.

1. Self developed superior super scheduler to solve the problems of super large-scale scheduling efficiency and mixed load for multi tenant scenarios

Fusioninsight introduces data virtualization engine, provides interactive query capability on a unified large cluster, and solves the problem of query efficiency for analysts. In order to support the diversified load on the super cluster at the same time, the self-developed superior scheduler realizes the simultaneous allocation of reserved resources and shared resources for tenants. Tenants enjoy the rights and interests of reserved resources and meet the needs of resource sharing. For more important businesses, a batch of fixed machines can be allocated to a tenant by binding a fixed resource pool to achieve physical isolation. Through the cooperation between the computing engine and the scheduling engine, the business closed loop on a large platform without data coming out of the lake is truly realized.

In terms of multi tenant capability, with more and more tenants, resource isolation among tenants has become the core demand of users. The Hadoop community provides queue based computing resource isolation and quota based storage resource threshold limitation. However, when tasks or read-write operations are assigned to the same host, they will still compete for resources. For this scenario, the following methods are provided for finer grained isolation on Mrs products:

  • Label storage:Label the datanode hosting storage resources and specify the label when writing files, so as to realize the maximum isolation of storage resources. This feature can be effectively applied to the scenarios of hot and cold data storage and heterogeneous hardware resources.
  • Multi service:Multiple services of the same kind are deployed on different host resources in the same cluster. Different applications can use their own service resources according to their needs and do not interfere with each other.
  • Multiple instances:On the same host resource in the same cluster, multiple instance resources of the same service are deployed independently to make full use of the host resources and are not shared with other service instances. For example, HBase multi instance, elasticsearch multi instance, redis multi instance, etc.

2. Tackle technical difficulties and break through bottlenecks in computing, storage, management and other aspects

In terms of scheduling efficiency of computing tasks, the patented scheduling algorithm is optimized to convert one-dimensional scheduling into two-dimensional scheduling, which improves the efficiency several times compared with the open source scheduler. In the actual large-scale cluster production environment, for the performance comparison between self-developed superior and open source capacity:

  • In the case of synchronous scheduling, superior is 30 times faster than capacity
  • In the case of asynchronous scheduling, superior is twice as fast as capacity

At the same time, through the in-depth optimization of 2W clusters, the superior of fusioninsight Mrs version 3.0 can achieve the scheduling rate of 35W / s containers, which completely exceeds the user’s expectation on the scheduling rate of large-scale clusters, and the cluster resource utilization rate reaches more than 98%, nearly double the capacity of open source capacity, laying a solid foundation for the stable business of large-scale clusters.

The following figure shows the monitoring view of “resource utilization” under superior and capacity respectively: it can be seen that superior has nearly 100% resource utilization, while the resources under capacity can not be fully utilized.

How can the big data single cluster achieve 20000 + scale?

Superior resource utilization

How can the big data single cluster achieve 20000 + scale?

Capacity resource utilization

In terms of storageIn order to solve the bottleneck of HDFS in file object management, Hadoop community has launched a federated solution. However, the introduction of a large number of different namespaces directly leads to the increase in the complexity of upper level business development, management and maintenance. To solve this problem, the community also introduced the router based Federation feature. Because a layer of router is added on the namenode to interact, the performance is degraded.

To solve the above problems, fusioninsight MRS has optimized the product scheme as follows:

  • By identifying key bottlenecks in a large cluster production environment, fusioninsight Mrs uses technical solutions such as merging the interaction times in a single read-write process and using an improved data communication compression algorithm to control the performance degradation within 4%.
  • In order to solve the problem of data imbalance between different namespaces, fusioninsight Mrs uses datamovementtool to automatically balance the data between different namespaces, greatly reducing the cluster maintenance cost.

With the increase of the amount of data, hive’s metadata also faces a very big bottleneck in the face of massive tables / partitions. Although hive’s community launched the Metastore cache solution, it did not solve the problem of cache consistency among multiple metastores, resulting in the inability of this solution to be commercially available on a large-scale cluster. Fusioninsight Mrs enhances the availability of Metastore cache by introducing distributed cache redis as an alternative, combined with distributed lock, cache black-and-white list mechanism, cache life cycle management and other technical means.

In terms of operation and maintenance management, when the cluster size increases to 2W nodes, the operation and maintenance pressure increases sharply:

  • The number of monitoring indicators that the system needs to collect has also increased from 60W + pieces of data collected per second to 200W +
  • The alarm concurrent processing increased from 200 / s to 1000 / s
  • The total number of configuration management items increased from 500000 to more than 2 million

The monitoring, alarm, configuration and metadata storage modules of the master-slave mode in the original architecture of fusioninsight Mrs have been greatly challenged by the soaring amount of data. In order to solve this problem, the new version uses the mature distributed component technologies such as Flink, HBase, Hadoop and elasticsearch to adjust the original intensive master-slave mode to an elastic and scalable distributed mode, Successfully solved the problems faced by operation and maintenance management, and laid the foundation for the secondary value mining of subsequent operation and maintenance data.

3. Ensure the continuous and stable operation of the platform through rolling upgrades / patches, task level “breakpoint continuation”, cross AZ high availability and other deployment capabilities

Rolling upgrades / patches:Fusioninsight supports the rolling upgrade function from version 2.7, and realizes the business imperceptibility of platform upgrade / patch and other operations. However, with the development of time, the community capability does not support rolling upgrade, such as the large version upgrade from Hadoop 2 to Hadoop 3, which means that many super clusters have to stay in the old version and cannot be upgraded. Of course, this is unacceptable to the business. Fusioninsight MRS has successfully realized the rolling pole upgrading among Hadoop large versions through the compatibility processing of community interfaces, and completed the rolling upgrade of 1W + node cluster scale in Q2 2020. Among fusioninsight’s customers, rolling upgrade is a necessary capability for 500 + scale clusters.

Task level “breakpoint continuation”:On large-scale clusters, some super large tasks are running continuously, often including hundreds of thousands of containers. Such tasks often run for a long time. Once an individual failure occurs in the middle, the tasks may need to be re executed, resulting in a waste of a lot of computing resources.

Fusioninsight Mrs provides a variety of mechanisms to ensure the reliable operation of tasks, such as:

  • AZ aware file storage strategy is provided on the storage. The file itself and its copies are placed on different AZ respectively. When users initiate read-write operations, they give priority to looking for resources in this AZ. Cross AZ network read-write traffic will occur only in the extreme scenario of AZ failure.
  • In computing, it provides AZ aware task scheduling mechanism to fully allocate the tasks submitted by users to be completed in the same AZ, so as to avoid the consumption of network resources between different computing units of the same task.

Through the above storage block placement strategy and localized scheduling of computing tasks, the high availability of a single cluster across AZ can also be realized. When a single AZ fails, the core data and computing tasks will not be affected.


Fusioninsight Mrs single cluster 21000 nodes won the big data product capability evaluation certificate issued by the ICT Institute in July 2020, becoming the first commercial big data platform in the industry with a single cluster breaking through 2W nodes and setting a new benchmark in the industry. In the future, fusioninsight Mrs will continue to deepen the exploration and research of big data technology, further realize the separation of storage and calculation on the basis of large cluster technology, and realize the separation of data and calculation (data + metadata and calculation separation) through unified metadata and security management, so as to realize the sharing of data in a wider range, so as to realize one data, The flexible deployment and elastic expansion of multiple computing clusters can support 100000 or even millions of cluster sizes through the smoothly expanded architecture, and constantly adapt to the core demands of multi scenario integration of enterprise big data applications.

How can the big data single cluster achieve 20000 + scale?

Future architecture evolution direction

For more than a decade, fusioninsight has been committed to building an enterprise level intelligent data lake for more than 60 countries and regions and more than 3000 government and enterprise customers around the world. Combined with the platform + ecological strategy and 800 + business partners, fusioninsight has been widely used in finance, operators, government, energy, medical treatment, manufacturing, transportation and other industries to release data value in the digital transformation of government and enterprises, Help the rapid growth of government and enterprise customer business. Mrs originates from the open big data ecology and superimposes enterprise level key capabilities. It not only maintains openness, but also provides customers with an enterprise level integrated big data platform to help customers realize t + 0 data into the lake, one-stop integrated analysis and make data “smart”.

Click focus to learn about Huawei cloud’s new technologies for the first time~