The big data cluster is upgraded across multiple versions and the service is 0 interrupted because there is a TA behind it

Time:2022-6-2

Summary:On April 21, 2021, China Pacific Insurance Group and Huawei cloud completed the rolling upgrade of the world’s first big data cluster across multiple versions.

This article is shared from Huawei cloud community《Huawei cloud fusioninsight helps CPIC upgrade its business across multiple versions without interruption》, author: hourglass.

On April 21, 2021, China Pacific Insurance Group Co., Ltd. and Huawei cloud completed the rolling upgrade of the world’s first big data cluster across multiple versions, breaking through the traditional offline shutdown and multiple upgrade mode, upgrading the core existing network cluster version from fusioninsight HD C70 to fusioninsight Mrs 8.0.2 at one time, spanning C80 and 6.5.1, and completing the transformation of the big data cluster from physical machine to cloud service, This case achieved the first breakthrough in the financial industry and set a new benchmark for the industry. After a two-week upgrade implementation process operation, the smooth rolling upgrade of the upper tier services of CPIC was realized, and the whole process of cluster operation was uninterrupted and had no impact on performance. The success of this cross version rolling upgrade is of great significance to the field of financial technology, marking that China Pacific Insurance has set a new benchmark for the construction of big data service cross version upgrade, business continuity and sustainable evolution for the financial industry.

1、 Project background

China Pacific Insurance Group has chosen Huawei cloud fusioninsight to build an insurance big data platform since 2017. With the continuous deepening of cooperation between CPIC and Huawei cloud, its main internal business systems have used Huawei cloud big data platform. However, in the early stage, each business system built an independent big data cluster, so the data could not be interconnected, there was data redundancy, and multiple clusters caused maintenance difficulties. By the end of the upgrade, 18 big data clusters have been built, mainly fusioninsight HD C70.

With the rapid development of the CPIC business, there are new demands for the unified management, data sharing, upgrading and evolution of the big data platform. It is hoped that the 18 sets of production clusters in the current network will be upgraded and merged in a unified way, and the ability of sustainable evolution of big data clusters will be provided for the future.

For this reason, CPIC and Huawei cloud have decided to upgrade the existing 18 big data clusters from fusioninsight HD C70 to mrs8.0. The main objectives of the upgrade are:

  • Through upgrading and merging the original cluster, it is unified into a set of large clusters, and through resource integration, the resource utilization rate is improved;
  • Unified to the Mrs platform, the version resource monitoring is more perfect and the positioning problem is more accurate;
  • Upgraded to the cloud platform, you can flexibly allocate resources on demand, achieve an evolving Lake warehouse integrated architecture, and expand other high-level services.
    The big data cluster is upgraded across multiple versions and the service is 0 interrupted because there is a TA behind it

2、 Project content

2.1 technical challenges

Taibao big data cluster has deployed HBase, hive, HDFS, zookeeper, yarn, oozie, hue, spark and other components as required.

In addition, tens of thousands of jobs are executed in the cluster every day, which also makes it more difficult to upgrade without perception. The main challenges are as follows:

  1. Hadoop component kernel from X to 3 In the Cross University version upgrade of X, the community only provides the rolling upgrade capability of HDFS. Because the original target version of yarn is different from the original version protocol, it cannot support rolling upgrade;
  2. During the upgrade process of the native version of HDFS in the community, the deleted files will not be physically deleted, but will be moved to the trash directory. This process will cause pressure on storage resources for the rolling upgrade of high-capacity clusters and hinder the protection of residual information. If they are not cleaned up in time, they will lead to disk explosion;
  3. Hive component kernel from X to 3 During the Cross University version upgrade of X, due to problems such as incompatible metadata formats, changes in versions before and after the API, and incompatible syntax, the community native version cannot support rolling upgrade;
  4. HBase component kernel from X to 2 During the cross major version upgrade of X, there are major changes in the versions before and after the API, resulting in the inability of the community native version to support rolling upgrade;
  5. There are tens of thousands of tasks per day. How to ensure smooth operation during rolling upgrade, especially in core scenarios such as profit and loss analysis and impairment measurement;
  6. In the big data cluster environment with 600+ nodes, it is necessary to ensure that emergencies occur during the upgrade process and quickly respond to hardware (disk, memory, etc.) failures without affecting the upgrade;
  7. 70+ business system. Hundreds of businesses run on this cluster. During the rolling upgrade process, it is necessary to ensure that each business operation is not damaged.

2.2 technical support

Rolling upgrade is to upgrade / restart some nodes at one time without affecting the overall business of the cluster with the help of fusioninsight Mrs’s high availability mechanism, active / standby mode, multi replica mechanism, rack policy, etc. Cycle until all nodes of the cluster are upgraded to the new version.

The following figure shows an example of rolling upgrade of HDFS components:
The big data cluster is upgraded across multiple versions and the service is 0 interrupted because there is a TA behind it

In order to meet the above technical challenges, the project has established a rolling upgrade team, which is composed of community PMC, community committer and version developer, and mainly implemented the following technical support:

  • By means of protocol synchronization, metadata mapping transformation, API encapsulation transformation, etc., the compatibility problems caused by different community protocols, metadata formats, API changes, etc. are solved, and the normal use of low and medium version component clients in the rolling upgrade process is guaranteed;
    The big data cluster is upgraded across multiple versions and the service is 0 interrupted because there is a TA behind it
  • To solve the problem that the files were not deleted during the upgrade of the new version of the HDFS community, it additionally realized the automatic cleaning of the trash directory, converted the logical deletion into physical deletion, and added the tool for regularly cleaning the trash directory of the old version. Ensure the effective utilization of infrastructure resources and reduce storage costs;
  • In view of the performance before and after the component upgrade, the upgrade duration, the bottleneck points that may occur during and after the upgrade, the corresponding architecture adjustment and optimization are made to help realize the global controllability, the whole process insensibility and overall correctness of the rolling upgrade;
  • In terms of operation and maintenance management, the project team has specifically developed the upgrade management service interface, which can complete the rolling upgrade end-to-end and step by step, so as to view the rolling upgrade status and realize component level control. In order to reduce the impact on the continuity of key task services during the upgrade process, the project implements the function of suspension by upgrade batch, which helps to avoid risks by suspending the upgrade during key operations or peak operation hours, so as to ensure no impact on the business. In addition, in order to avoid various emergencies interrupting the upgrade process, the project has realized the fault node isolation capability. When a fault occurs, the upgrade action of the corresponding node can be skipped, ensuring the synchronization of fault handling and upgrade.

2.3 organization guarantee

After the project was launched, a joint project team was established with relevant leaders of CPIC as the project manager and Huawei’s delivery and R & D, and the R & D and operation and maintenance of CPIC as members. This upgrade targets as many as 20+ application departments, and the platform involves a large number of complex businesses. In order to ensure the success of the rolling upgrade and zero business interruption in the whole process, the project team has developed a thorough organizational guarantee system, which is led by Huawei and closely cooperated by the customer’s business departments in the six months before, during and after the upgrade.
The big data cluster is upgraded across multiple versions and the service is 0 interrupted because there is a TA behind it

Organizational guarantee of CPIC upgrade project

  1. Preparation stage before upgrading: under the overall coordination of the project team and the R & D support of Huawei, 70+ application code transformation and verification were completed, and test reports were output; In order to fully identify risks, Huawei took the initiative to provide hardware resources for the test environment, and the project team cooperated with various application departments to conduct joint tests for three upgrade drills; In order to meet the pre upgrade conditions, Huawei experts conducted research and guidance, and effectively carried out pre upgrade preparations such as cluster small file consolidation, client-side rectification, multiple cluster inspections, and repeated review and improvement of upgrade schemes;
  2. Upgrade process support: during the two-week upgrade process, Huawei arranges on-site support from experts in R & D and solutions. Huawei has worked with the joint project team of CPIC to develop systems such as 24-hour shift scheduling support, information feedback and communication between the joint project team and the application department (business verification and confirmation are required after each component is upgraded in the rolling upgrade), authorization of the joint project team for the upgrade operation, screen recording monitoring of the upgrade operation, etc;
  3. Observation after upgrade: after the rolling upgrade is completed, the joint project team coordinates with each application department to verify the application business, and all business operation reports have been output. After that, the Huawei project team continued to observe for two weeks and submitted the upgrade after confirming that the platform and application were running normally.

3、 Summary and Outlook

The rolling upgrade of the first big data cluster in the financial industry across multiple versions completed by Pacific Insurance and Huawei has realized that the upper layer business is not aware, the whole cluster operation is not interrupted, and the performance is not affected, effectively protecting the core interests of customers and setting a new benchmark for the financial industry.

With the continuous iteration and upgrading of digital technology, the traditional insurance operation mode will be changed. In the future, there will be changes in the following three directions:

  1. Realizing the transformation from large numbers to decimal numbers, strengthening the digital depiction of risk, and becoming more sensitive to the transformation from the probability of large numbers to decimal numbers in the past will fundamentally change the traditional operation mode;
  2. From entity to virtual, data has become an important means of production. Identifying and assessing the risks of new assets through massive data will become the core competence of the insurance industry;
  3. From insurance to governance, digitalization will improve the risk management ability of insurance companies, participate more in national and urban risk governance, and gradually change from loss compensation to risk management and governance.

Facing the future, PICC will join hands with Huawei to continuously innovate, continuously improve the risk ecology, implement the strategy of “customer demand orientation”, and build a “first-class insurance financial service group focusing on the main insurance industry, with sustained value growth and international competitiveness”.

Click “follow” to learn about Huawei cloud new technologies at the first time~

Recommended Today

Ansible combat MySQL installation

Do the preparatory work first, prepare a client, and then equip a master machine. Check connectivity: [[email protected] ~]# ansible mytest -m ping 10.128.25.175| SUCCESS => { “ansible_facts”: { “discovered_interpreter_python”: “/usr/bin/python” }, “changed”: false, “ping”: “pong” } [[email protected] ~]# 1. Ansible module search method 1. Ansible’s yum module search yum module online documentation: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/yum_module.html Example: – […]