Best practices of data lake on the cloud


Introduction: Shuhe technology has set up a big data team and built a big data platform since its establishment. And built its own cloudera Hadoop cluster on ECs. However, with the rapid expansion and development of the company’s Internet financial business, the responsibility of big data team is becoming more and more heavy. The real-time data warehouse demand, log analysis demand, ad hoc query demand, data analysis demand, etc. the demand put forward by each business greatly tests the ability of this cloudera Hadoop cluster. In order to reduce the pressure of cloudera cluster, we combined with our own business situation to build a data lake on alicloud that is suitable for the current realistic situation of Sunhe.

1. Shuhe Technology

Founded in August 2015, Shuhe technology is a round C financial technology company jointly invested by focus media, Sequoia Capital and Sina. The company’s vision is to be an intelligent financier who accompanies users all their lives, adhering to the values of openness, challenge, professionalism and innovation, so that everyone can enjoy the optimal solution of financial services. The company’s main products are Huanbei and latte smart investment, which mainly provide credit, financial management, e-commerce and other services, and has 80 million registered users. As a representative enterprise of domestic financial technology, Shuhe technology takes the lead in introducing big data and AI technology into intelligent customer acquisition, intelligent risk control, intelligent operation, intelligent customer service and other aspects. So far, Shuhe technology has cooperated with more than 100 financial institutions, including banks, credit, licensed consumer finance, funds and insurance.

2. Self built CDH on cloud

From the beginning of its establishment, Shuhe technology has set up a big data team and built a big data platform on a cloud manufacturer. We purchased an EC2 instance from a cloud manufacturer and built our own cloudera Hadoop cluster on the EC2 instance.
In the early days, this cloudera Hadoop cluster was only used to do t + 1 offline data warehouse. In the middle of the night, when the business day is over, we use the sqoop component to extract the full or incremental data from the business database to the Hadoop cluster. After a series of ETL cleaning with hive, the result data is generated and sent to the leaders for further decision-making, or pushed to the database for tableau report display, Or insert it into the business database for the business system to call.
However, with the rapid expansion and development of the company’s Internet financial business, the responsibility of big data team is becoming more and more heavy. The real-time data warehouse demand, log analysis demand, ad hoc query demand, data analysis demand, etc. the demand put forward by each business greatly tests the ability of this cloudera Hadoop cluster. In order to meet the demand of real-time data warehouse, we installed HBase component on cloudera cluster; In order to meet the needs of log analysis, we installed flume and Kafka components on cloudera cluster; In order to meet the demand of ad hoc query, we installed Presto component on cloudera cluster; In order to meet the needs of data analysis, we installed jupyter components on cloudera cluster. Every business requirement added is a huge challenge to the stability of the original system.

Best practices of data lake on the cloud

Cloudera cluster

In addition to the increasing business needs, the company’s organizational structure is more and more complex, the number of personnel is more and more, and the total amount of all kinds of data is rising exponentially. Various disadvantages of cloudera cluster have emerged, and gradually can not withstand these challenges.

  • Poor scalability

Cluster scale expansion needs to be operated on cloudera manager, which requires operation and maintenance personnel to master certain skills, and there are certain operational risks. In addition, if there is an emergency or temporary demand for large-scale expansion, you need to buy a large number of EC2 machines first, and then join the cluster through a series of complex operations. Afterwards, you need a series of complex operations to release these machines, and these online operations cause great trouble to the stability of the cluster’s online business.

  • The cost is very high

In terms of storage cost, at the beginning, we did not expect the rapid development of data volume in the future. We used three copies of HDFS storage in cloudera cluster, and the EC2 machine was equipped with SSD disk. In addition, the weekly data backup also occupied a lot of disk resources, and the disk cost remained high; In terms of computing costs, there are more tasks at night, less computing resources, less tasks during the day, and more computing resources.

  • Difficult cluster update

We are using cloudera version 5.5.1. We have been afraid to update it for the stable operation of the cluster for several years. Building a new version of cloudera cluster for cluster migration involves a lot of human and material resources, so this old version has been in service. Because cluster compatibility hinders us from using new open source components, or needs to spend a lot of energy to do the reconstruction of open source components, hinders the introduction of new technologies.

  • High maintenance threshold

To build a cloudera cluster and carry out subsequent maintenance has higher technical requirements for operation and maintenance personnel, but higher technical requirements are needed to solve practical problems. In addition, cloudera manager is not open source and cloudera community is not active enough, which also causes some problems to cluster operation and maintenance.

  • Poor disaster tolerance of cluster

Data disaster recovery, HDFS storage three copies can not cross the zone. Service disaster recovery, service node cannot be deployed across zones. Zone failure will affect the stability of the whole cluster.

3. Hybrid architecture on cloud

In order to reduce the pressure of cloudera cluster, we want to move part of our business to cloud vendors, and gradually form a cloud hybrid architecture.

  • According to different services and functions, several cloud EMR clusters are built

EMR clusters on these clouds share storage and metadata. However, due to the incompatibility between the EMR hive version and cloudera hive version, the metadata cannot be unified, and two sets of metadata, cloudera hive and EMR hive, are finally formed. These EMR clusters relieve the pressure of cloudera cluster

  • In order to reduce the pressure of cloudera, we design the hybrid architecture of EMR hive, chive

Chive architecture is to connect the metadata of EMR hive to cloudera hive, which is equivalent to using the storage of cloudera HDFS, but uses the computing resources of EMR. Hive hybrid architecture also greatly reduces the pressure of cloudera cluster

  • Separation of hot and cold data

The hot data on cloudera cluster is stored on HDFS, while the cold data is put on S3 bucket by cloudera hive, and the life cycle is set on S3 to put the data into cold storage regularly.

Best practices of data lake on the cloud

Hybrid architecture on cloud

With the practice of hybrid architecture on cloud, there is actually a prototype of big data lake. We want to take advantage of a cloud manufacturer’s migration to alicloud to build a data Lake suitable for the current reality of digital data.

  1. Alibaba cloud’s first generation data Lake


4.1 what is data lake

Data lake is a centralized repository that allows you to store all structured and unstructured data at any scale. You can store data as it is, without first having to structurize the data, and then use different types of engines for analysis, including big data processing, visualization, real-time analysis, machine learning, etc., to guide you to make better decisions.
Data Lake compared with data warehouse

The data of characteristic data warehouse comes from transaction system, operation database and line of business application, and the relational data comes from IOT device, website, mobile application Non relational and relational data schema design of social media and enterprise applications before the implementation of data warehouse (write schema) write in analysis (read schema) cost-effective faster query results will bring higher storage costs faster query results only need lower storage costs data quality can be used as an important fact based on highly regulated data any can or can not Regulatory data (e.g. raw data) users business analysts data scientists, data developers and business analysts (using regulatory data) analyze batch reports, Bi and visual machine learning, predictive analysis, data discovery and analysis

Basic elements of data Lake solution

  • Data mobility

Data Lake allows you to import any amount of real-time data. You can collect data from multiple sources and move it into the data Lake in its original form. This process allows you to scale to any size of data while saving time defining data structures, schemas, and transformations.

  • Secure storage and cataloging of data

Data Lake allows you to store both relational and non relational data. They also enable you to understand the data in the lake by crawling, cataloging, and indexing the data. Finally, you must protect your data to ensure that your data assets are protected.

  • analysis

Data Lake allows various roles in an organization, such as data scientists, data developers, and business analysts, to access data through their own analysis tools and frameworks of choice. This includes open source frameworks such as Apache Hadoop, Presto and Apache spark, as well as commercial products provided by data warehouse and business intelligence providers. Data Lake allows you to run analysis without moving data to a separate analysis system.

  • machine learning

The data lake will allow organizations to generate different types of insights, including reporting historical data as well as machine learning (building models to predict possible outcomes) and recommending a set of prescribed actions to achieve the best results.
According to the definition and basic elements of data lake, we implemented the first generation of data Lake solution on alicloud, which is suitable for the current reality of digital river.

4.2 design of alicloud data Lake

4.2.1 overall architecture of alicloud data Lake

Best practices of data lake on the cloud

Overall architecture of alicloud data Lake

VPC (virtual private cloud) is a customized private network created by users based on alicloud. There are two layers of logical isolation between different VPCs. Users can create and manage cloud product instances in their own VPCs, such as ECs, load balancing, RDS, etc.
We put the company’s business under two VPCs, business VPC and big data VPC. EMR extracts data from RDS, OSS and Kafka of business VPC and falls into data Lake OSS to form data of ODS layer. EMR T + 1 of core data warehouse makes ETL for data of ODS layer to generate data of CDM data warehouse layer and ads Mart layer for other big data EMR and business EMR.
The following chapters introduce our solutions and practices in alicloud data lake.

4.2.2 unified storage and metadata management

Unified storage refers to setting storage on OSS object storage as a data lake, which is used by several EMR clusters. Alibaba cloud object storage service (OSS) is a massive, secure, low-cost and highly persistent cloud storage service provided by Alibaba cloud. Its data design persistence is no less than 12 9. OSS has a platform independent restful API interface, which can store and access any type of data in any application, any time, any place. You can also use the API, SDK interface or OSS migration tool provided by alicloud to easily move massive data into or out of alicloud OSS. After the data is stored in Alibaba cloud OSS, you can choose standard storage as the main storage method, and you can also choose low-frequency access storage, archive storage, and cold archive storage with lower cost and longer storage life as the storage method of infrequent access data. These features of OSS are very suitable for data storage.
Unified metadata refers to the unified use of a set of metadata by several components in EMR, such as hive, Ranger, hue, etc. We put these EMR metadata on the external RDS instance. Alicloud relational database service (RDS) is a stable, reliable and elastic online database service. Based on alicloud distributed file system and SSD disk high-performance storage, we can quickly build stable and reliable database services. Compared with self built database, it is cheap and easy to use. It has the characteristics of flexible billing, on-demand configuration, on-demand, high performance, high availability architecture, multiple disaster recovery schemes, high security, etc. It is also suitable for unified metadata storage.

4.2.3 multi EMR and multi OSS bucket design

Using the unified OSS storage and metadata architecture, we design a multi EMR multi OSS bucket framework

Best practices of data lake on the cloud

Design of multi EMR and multi OSS buckets on data Lake

EMR T + 1 extracts business RDS to the data lake. The core data warehouse EMR performs a series of ETL operations in the hierarchical data warehouse to generate CDM common dimension layer data. Business EMR performs ETL operations based on CDM common dimension layer data to generate ads Mart layer data. EMR Presto performs on-the-spot query on CDM and ads data.
A business EMR mainly provides ad hoc query service and DAG scheduling task service. Users can only submit their own ad hoc query and scheduling tasks to their department’s EMR, and we can set the horn queue resource to isolate the resources occupied by the two tasks.

Best practices of data lake on the cloud

Business EMR services

4.2.4 design of distributed dispatching system

Airflow is a programmable, scheduling and monitoring workflow platform. Based on directed acyclic graph DAG airflow, a group of dependent tasks can be defined and executed in sequence according to the dependency. Airflow provides a wealth of command-line tools for system management and control, and its web management interface can also facilitate the control and scheduling of tasks, and real-time monitoring of task running status, which facilitates the operation, maintenance and management of the system.
There are many daemons in the airflow system, which provide all the functions of airflow. The daemons include web server, scheduler, execution unit worker, message queue monitoring tool flower, etc. These daemons are independent of each other. They are neither interdependent nor aware of each other. Each daemons only deals with the tasks assigned to them at runtime. Based on the characteristics of airflow, we build a highly available distributed scheduling system of airflow cluster based on data lake.

Best practices of data lake on the cloud

Airflow distributed scheduling system on data Lake

In order to perform tasks conveniently on EMR, we deploy the airflow worker on the gateway of EMR, because the gateway has client commands and configurations of all the components currently deployed in EMR.
We can also increase the number of daemons in a single worker node to vertically expand the worker capability and improve the cluster task concurrency, or add more gateways (one gateway deploys one worker) to horizontally expand the worker capability and improve the cluster task concurrency. In reality, in order to improve the concurrency of tasks and reduce the pressure of a single gateway, we configure two gateways and airflow workers for the core data warehouse cluster and data extraction cluster with high concurrency.
In the future, we are going to deploy two nodes for the airflow master to solve the problem of single point failure of the master node.

4.2.5 design of user permission system

User permission system is always the core of architecture design. We designed a three-tier user permission system based on data lake, the first layer of RAM access control, the second layer of EMR execution engine access, and the third layer of big data interactive analysis access.

Best practices of data lake on the cloud

Three layer user authority system on data Lake

Layer 1 access control (RAM) is a service provided by alicloud to manage user identity and resource access rights. Ram allows multiple identities to be created and managed under one alicloud account, and allows different permissions to be assigned to a single identity or a group of identities, so that different users have different access rights to resources. We bind an ECS application role to each EMR, and each ECS application role can only access the corresponding OSS bucket in the data lake.
The second layer of EMR execution engine access rights, including hiveserver2, presto, spark and other execution engines.
First of all, we need to understand that authentication is to verify whether the user’s identity is correct, and authorization is to verify whether the user’s identity operation has permissions.
Hiveserver2 supports multiple user authentication methods: none, nosasl, Kerberos, LDAP, PAM, custom, etc. Permission authentication can use hive’s own permission system, range, sentry and other open source components.
With Presto’s hive connector, Presto and hive can share the same user permission system. With the support of Alibaba cloud EMR big data team, spark client can also support this set of user permission system.
Finally, we use EMR openldap to save user and user group information, and EMR Ranger provides a centralized authority management framework. The user and group information of EMR openldap will be synchronized with the company’s ad, and the information of new employees or resigned employees in ad will be synchronized to EMR openldap in T + 1 mode.

Best practices of data lake on the cloud

Openldap and ranger user rights management system

The third layer of big data interactive analysis access. We have built a unified big data interactive analysis and query system similar to hue. By limiting the EMR access of the interactive analysis and query system, users can only access the EMR of their own department.
Through the three-tier user permission system, the data access requirements of users in the whole scene can be basically covered.

4.2.6 EMR elastic expansion design

The elastic scaling function of EMR can set the scaling policy according to the business requirements and policies. After the elastic scaling is enabled and configured, EMR will automatically add task nodes to ensure the computing power when the business demand increases, and EMR will automatically reduce task nodes to save costs when the business demand decreases.
We have run a large number of EMR clusters in our data lake. It is precisely because of the elastic scalability of EMR that we can save costs and improve execution efficiency while meeting business needs. This is also one of the most important advantages of big data cloud compared with traditional IDC self built big data cluster.
We set some elastic scaling rules as follows, which mainly follow the principle that the threshold of elastic scaling is lower than that of elastic scaling.

Best practices of data lake on the cloud

4.2.7 load balancing management

EMR cluster is stateless and can be created and destroyed at any time. However, the stability of service interface provided by EMR cluster can not be affected by the creation and destruction of EMR cluster, so we designed the unified service interface layer of EMR cluster on the data lake.
Haproxy provides high availability, load balancing and proxy based on TCP and HTTP applications, and supports virtual host. It is a free, fast and reliable solution. We use haproxy’s four layer network layer load balancing, that is, TCP and UDP load balancing to provide unified services.
In the implementation, we mainly use haproxy to proxy each EMR’s hiveserver2 interface, resoucemanger interface, hivemetastore interface, Presto HTTP interface, etc., and let haproxy support include to load multiple module configuration files, which is convenient for maintenance and restart.

Best practices of data lake on the cloud

4.2.8 OSS bucket life cycle management

Compared with the data of other data warehouse layers, the data of ODS layer of data warehouse is non renewable (the data of business RDS database will be deleted regularly, and the data warehouse undertakes the function of data backup). We put the data of ODS layer on the multi version bucket, which can also realize the regular data backup of cloudera Hadoop with snapshot, Therefore, we need to set the life cycle of ODS bucket data to ensure the security of ODS layer data and maintain the stable growth of data volume.

Best practices of data lake on the cloud

Life cycle setting of ODS multi version bucket

Hadoop HDFS file system will have a garbage collection mechanism, which is convenient to recycle the deleted data into the garbage can and avoid some misoperation to delete some important files. The data collected in the garbage can can be recovered. HDFS creates a recycle bin for each user. The directory is / user / user name / trash / files or directories deleted by the user. There is a cycle (FS. Trash. Interval) in the system recycle bin. After the cycle, HDFS will automatically delete these data completely. If it is a data Lake architecture, the recycle bin directory will be set on the OSS bucket, and HDFS will not delete these junk files regularly. Therefore, we need to set the OSS file life cycle (delete the data 3 days ago) to delete these junk files regularly.

Best practices of data lake on the cloud

Life cycle setting of dustbin

4.2.9 log management

Log service (SLS) is a one-stop service for log data. Users can quickly complete the functions of data collection, consumption, delivery, query and analysis without development. It helps to improve the operation and maintenance efficiency and establish the mass log processing capacity in DT era.
In view of the periodic deletion of EMR component logs, we must collect the historical logs of components on EMR in one place for subsequent troubleshooting. SLS is suitable for the scenario of multiple EMR logs collection on data lake. We collected the

Best practices of data lake on the cloud

4.2.10 terminal authority management

Developers need to have login permissions for specific EMR instances to facilitate development operations.

Best practices of data lake on the cloud

Terminal authority management

The terminal login method is as above. Through the company’s fortress machine, log in to the next specific Linux springboard machine of big data VPC, so as to log in to the EMR instance. Operators with different roles have specific login permissions. Big data operation and maintenance can log in to any instance of EMR Hadoop cluster with root account by using unified key pair, and then log in to any instance of EMR Hadoop cluster after switching to Hadoop account.

Best practices of data lake on the cloud

4.2.11 component UI management

Best practices of data lake on the cloud

As shown above, Knox’s address is not easy to remember, so we use the product of cloud resolution DNS.
Alibaba cloud DNS is a safe, fast, stable and extensible authoritative DNS service. It is used by enterprises and developers to convert domain names that are easy to manage and identify into digital IP addresses used by computers for interconnection communication, so as to route users’ access to corresponding websites or application servers.
We use alias record to point the easily remembered domain name to Knox domain name, which solves this problem well.

4.2.12 monitoring alarm management

Emr-apm provides EMR cluster users, especially cluster operation and maintenance personnel, with a complete set of tools to monitor the cluster, monitor services, monitor the overall operation of the job, and check and solve the problems of cluster operation.
The horn-home chart is often used to show the elastic scaling instance of history

Best practices of data lake on the cloud

Horn-home chart in EMR APM Market

The horn-queue chart shows the resource usage and task execution of each queue in the history

Best practices of data lake on the cloud

Horn-queue chart in EMR APM Market

Best practices of data lake on the cloud

Horn-queue chart in EMR APM Market

Cloudmonitor is a service for monitoring alicloud resources and Internet applications. Cloud monitoring service can be used to collect Alibaba cloud resources or user-defined monitoring indicators, detect service availability, and set alerts for indicators. It enables you to have a comprehensive understanding of the resource usage, business operation status and health of alicloud, and respond to abnormal alarms in time to ensure the smooth operation of applications.
We use multiple EMR core components alarm information on the data lake to access the cloud monitoring, so that the cloud monitoring can unify the phone, pin, mail alarm to the relevant responsible person.

Best practices of data lake on the cloud

4.2.13 ad hoc query design

The ability of ad hoc query is the test of data output ability. We have developed a unified big data interactive query system, which supports hiveserver2 and presto. By limiting the query entry of uniform usage, users can only submit ad hoc query jobs on the EMR of their department. The computing resources occupied by Presto will interact with the computing resources occupied by Hadoop. We have built a set of EMR Presto cluster independently to provide Presto ad hoc query service for unified usage.

Best practices of data lake on the cloud

Design of ad hoc query on data Lake

On the basis of meeting the basic needs of users’ ad hoc query, we also do a lot of personalized needs.

  • Access to work order approval system of the company
  • Component service status monitoring reminder
  • Mutual conversion of hivesql syntax and prestosql syntax
  • Metadata display, including sample data display, blood relationship display, scheduling information display, statistical information, etc

Best practices of data lake on the cloud

4.2.14 cluster security group design

The security group of ECS instance is a kind of virtual firewall with state detection and packet filtering capabilities, which is used to divide the security domain in the cloud. Security group is a logical group, which is composed of instances with the same security protection requirements and mutual trust in the same region.
All EMRs on the data Lake must be bound with specific security groups to provide services for the outside world. We assign different security groups to different instance groups of big data cluster.

Best practices of data lake on the cloud

4.2.15 data desensitization design

Sensitive data mainly includes customer information, technical information, personal information and other high-value data. These data exist in big data warehouse in different forms. The leakage of sensitive data will bring serious economic and brand losses to enterprises.
EMR Ranger supports data masking for hive data, desensitizes the returned result of select, and shields sensitive information from users. However, EMR Ranger is only applicable to hiveserver2 scenarios, not Presto scenarios.

Best practices of data lake on the cloud

The sensitive field scanning of the data lake is carried out according to the preset sensitive field rules, including hour level incremental scanning and day level full scanning. The scan results are written into Ranger’s Metadatabase through Ranger mask restful API. When the user’s ad hoc query passes hiveserver2 and hits the sensitive field, only the first few preset characters in the sensitive field are normally displayed, and all the following characters are desensitized with X.

Best practices of data lake on the cloud

Effect of Ranger desensitization

4.2.16 yard queue design

A business EMR mainly provides ad hoc query service and DAG scheduling task service. Users can only submit their own ad hoc query and scheduling tasks to their department’s EMR, and we can set the horn queue resource to isolate the resources occupied by the two tasks.

Best practices of data lake on the cloud

4.3 EMR management of data Lake

EMR governance plays an important role in data Lake governance. EMR governance includes stability governance, security governance, implementation efficiency governance and cost governance.

4.3.1 adjust EMR pre stretching time

The task of T + 1 in the middle of the night has the requirement of timeliness. We need to prepare sufficient computing resources in advance when the operation starts at 0 o’clock. Due to the limitation of EMR’s current elastic scalability architecture, graceful offline will lead to the fact that the capacity reduction and expansion cannot be parallel.

  • The pre expansion time should be delayed as far as possible without affecting the zero point data warehouse operation
    The EMR OpenAPI is scheduled to be executed, and the graceful offline parameters can be temporarily shortened. The pre expansion time can be delayed from 22:00 to 23:30.
  • Check the task running monitoring and restore the elastic scaling time as early as possible
    Check the EMR APM monitor, observe the execution time of the task, adjust the flexible telescopic lower limit in advance, restore elastic stretch from 10:00 to 6:00.
    Before and after optimization, the average number of online nodes from 22:00 to 10:00 was reduced from 52 to 44.
    4.3.2 change the EMR elastic scaling policy. The elastic scaling function can set the scaling policy according to the business requirements and policies. After the elastic scaling is enabled and configured, EMR will automatically add task nodes to ensure the computing power when the business demand increases, and EMR will automatically reduce task nodes to save costs when the business demand decreases. The payment methods of task node include annual and monthly package, quantity based instance and bidding instance. In the case of full elastic scaling, we should use bidding examples as far as possible. We can refer to Alibaba cloud’s best practice for EMR elastic low-cost offline big data analysis
  • Give priority to bidding cases and give details according to quantity cases
    This scheme takes into account the stability of cluster computing power, cost and elastic scalability, and uses as many bidding instances as possible. Only when the ECS in the zone is short of inventory, can the quantity based instances be used.
    Flexible configuration
  • Zone migration
    The inventory of different zones is different. We should try our best to deploy or migrate the EMR cluster to the zone with abundant inventory, so that we can use the bidding instance to reduce the cost as much as possible
  • Flexible strategy adjustment
    The nature of tasks in the night is different from that in the day. For example, DW queue is mainly used for scheduling tasks in the night, while ad hoc query is mainly used for default queue in the day. We can use scheduling to refresh the queue resources regularly, effectively use the queue resources, so as to avoid the waste of queue resources.
    After a series of optimization, the EMR cluster cost is reduced by 1 / 5
    4.3.3 optimize the EMR cloud disk space. The elastic instance of EMR can use cloud disk, which includes efficient cloud disk, SSD and ESSD
  • ESSD cloud disk: an ultra-high performance cloud disk product based on the new generation of distributed block storage architecture, combined with 25ge network and RDMA technology, a single disk can provide up to 1 million random read / write capabilities and lower single path delay capabilities. It is recommended to use in large OLTP database, NoSQL database, elk distributed log and other scenarios.
  • SSD cloud disk: a high-performance cloud disk product with stable high random read-write performance and high reliability. It is recommended to be used in I / O intensive applications, small and medium-sized relational databases and NoSQL databases.
  • Efficient cloud disk: it is a cloud disk product with high cost performance, medium random read-write performance and high reliability. It is recommended to use it in the scenarios of development and test business and system disk.
    At present, considering the cost performance, we chose the ESSD cloud disk. According to the daily cloud disk monitoring of elastic nodes, the number and capacity of elastic instance data disks are reasonably determined.
    4.3.4 selection of EMR machine group
    In a business EMR, it mainly provides ad hoc query service and DAG scheduling task service. Elastic scaling is more suitable for DAG scheduling scenarios, but not for ad hoc query scenarios, because ad hoc query has the characteristics of short query time and high frequency. Based on the above factors, we tend to reserve a fixed number of task instances, and it is more appropriate to pay in advance for these instances.
    So we set up two task machine groups, the first paid task machine group and the second paid task machine group. The first paid task machine group mainly meets the demand of ad hoc query, and the second paid flexible task machine group meets the demand of DAG scheduling task

4.3.5 EMR cost control

Best practices of data lake on the cloud

In our company’s product consumption distribution, ECS accounts for a large proportion of the total cost, and EMR elastic instances account for the majority of ECs. Therefore, we need to pay attention to the EMR expense account to effectively control the cost.
We can use the detailed list subscription service to call subscribebilltooss to export the detailed list data of Alibaba cloud OSS subscription bill to the big data hive table, and calculate the daily expense report of each EMR through a series of ETL. The cost of EMR mainly includes annual and monthly case cost, volume case cost, bidding case cost, cloud disk cost and reserved voucher deduction cost. Alibaba cloud provides a way to tag resources to achieve account splitting. Specifically, we tag EMR clusters to achieve account splitting management among multiple business clusters. Please refer to [best practice of enterprise account splitting under single account](…
Through the report, we find that the cost of emr-a’s 30 machines is not proportional to the cost of emr-b’s 50 machines. Through the analysis of the cost composition, we find that emr-a is in the resource shortage zone, and uses a large number of quantity based instances and reserved instance coupons, while emr-b is in the resource surplus zone, and uses a large number of bidding instances, and the cost of quantity based instances + reserved coupons is much higher than that of bidding instances.
In addition, we also calculate the cost of each SQL in EMR to urge the business to optimize big SQL and offline useless SQL. We pull the memoryseconds index in resourcemanger, and the calculation formula is SQL cost = memoryseconds of SQL / total memoryseconds of EMR * total EMR cost.

4.3.6 purchase of RI reserved deduction voucher

The reserved instance voucher is a kind of deduction voucher, which can deduct the bill of the pay as you go instance (excluding preemptive instance) and reserve instance resources. Compared with the case of monthly package, the combination mode of reserved case voucher and pay as you go case can take into account both flexibility and cost.
Reserved instance ticket supports region and availability zone. The region level reserved instance voucher supports the matching of pay as you go instances across available zones in the specified region. The zone level reserved instance coupon can only match the pay as you go instances in the same zone.
There are three types of payment: full prepayment, partial prepayment and zero prepayment. Different payment types correspond to different charging standards.
Because we use the flexible strategy of bidding instance first and quantity based instances, we purchase a part of reserved instance tickets prepaid across the zone 0 to offset the quantity based instances with elastic expansion. The figure below shows the usage of each accounting period of the reserved instance coupon.

Best practices of data lake on the cloud

It can be seen that the utilization rates of two ECS specification reserved instance tickets are 0% and 62.5% respectively, which do not reach the expected 100%. The reason is that in the later period, the resources are switched from quantity to preemptive instance, and the reserved instance ticket does not support preemptive instance. On the whole, after using the reserved sample coupon, the cost of pay as you go can be saved about 40%. For more details, please refer to “RI and SCU full link use practice”.

Flexible guarantee

Flexible guarantee provides 100% certainty guarantee for the daily flexible resource demand of flexible payment. Through the flexible guarantee, we only need to pay a lower guarantee fee, and then we can exchange the fixed period (support 1 month to 5 years) of resource certainty guarantee. When purchasing the elastic guarantee, set the attributes such as availability zone and instance specification, and the system will reserve the resources matching the attributes in the form of private pool. When you create a pay as you go instance, you can choose to use the capacity of the private pool to ensure 100% success.
We know that there will be a shortage of resources in Alibaba cloud before and after the double 11, and some of the company’s t + 1 tasks are extremely important tasks. In order to protect EMR elastic resources during the double 11 period at a low cost, we have bound some important EMR elastic private pools on the data Lake to ensure that these important EMR elastic resources will be available during this period.

4.4 data Lake OSS governance

The above describes the governance of EMR on the data lake, and the following describes the governance of OSS, the storage medium of the data lake.

4.4.1 multi version bucket management of data warehouse ODS

Version control is a data protection function for the storage space (bucket) level of OSS. After version control is turned on, the operations of data coverage and deletion will be saved in the form of historical version. You can restore the object stored in the bucket to the historical version at any time after covering or deleting the object in error.
In order to ensure the security of data, we use the function of HDFS snapshot in cloudera Hadoop. In the data Lake architecture, we use the version control function of OSS to ensure the security of data on the data lake.
OSS supports setting lifecycle rules, automatically deleting expired files and fragments, or dumping expired files to low frequency or archive storage types, so as to save storage costs. We also need to set the life cycle of multi version bucket to save cost, keep the current version and automatically delete the historical version after 3 days.

Best practices of data lake on the cloud

4.4.2 log bucket management

As can be seen from the figure below, the standard storage grew linearly before September 28, and the cold storage life cycle was set on September 28. The cold storage grew linearly, and the standard storage basically remained unchanged, while the unit price of standard storage was 0.12 yuan / GB / month, and that of archive storage was 0.033 yuan / GB / month, saving about 72.5% of the cost when 330t data was converted to cold storage.

Best practices of data lake on the cloud

4.4.3 management of data bin and market bin

Under the data Lake architecture, the HDFS recycle bin directory of EMR is set on the OSS bucket, and HDFS will not delete these garbage files on a regular basis. Therefore, we need to set the life cycle of the HDFS garbage can to delete these garbage files in the garbage can on a regular basis.

Best practices of data lake on the cloud

4.4.4 monitoring objects in the barrel

Object storage OSS supports the storage space list function. It can regularly export the information of the object in the bucket to the specified bucket to help understand the state of the object, simplify and accelerate the workflow and big data tasks. The bucket list function scans the objects in the bucket on a weekly basis. After scanning, a list report in CSV format will be generated and stored in the specified bucket. In the inventory report, you can selectively export the metadata information of the specified object, such as file size, encryption status, etc.
We export the file in CSV format by setting the storage space list and put it into hive table. We report regularly to monitor the changes of objects in the bucket, find out the abnormal growth and deal with it.

  1. Alibaba cloud’s second generation data Lake


The execution engine of the first generation of data lake is EMR, and the storage medium is OSS. When our company introduced dataphin data center, its execution engine and storage are maxcompute, and our current data warehouse execution engine EMR is two sets of heterogeneous execution engines. The problems are as follows

  • Storage redundancy
    The storage resources of EMR are put on the OSS object storage, and the storage resources of maxcompute are put on Pangu, which causes the redundancy of storage resources.
  • Inconsistent metadata
    The metadata of EMR is uniformly placed on the external RDS database, and the metadata of maxcompute is placed in the MC metadata database. The metadata of EMR and maxcompute are not unified, so they cannot be shared.
  • The user rights are not unified
    EMR’s user authority system is built with openldap and ranger, while maxcompute’s user authority system is built with maxcompute’s own user authority system.
  • The lake can not flow freely
    According to the nature of tasks and task charging rules, tasks with high throughput, high complexity and low concurrency are suitable for EMR, while tasks with low throughput, low complexity and high concurrency are suitable for maxcompute; In addition, we can put the computing resources of double eleven on maxcompute to solve the problem of insufficient EMR resources. At present, the execution engine cannot be selected freely
    Alibaba cloud has provided two integrated solutions, one is based on HDFS storage, which maps hive metadata to maxcompute by creating external projects。 We use another DLF (data lake formation) scheme based on data lake to realize the integration of lake and warehouse. We will migrate EMR metadata and maxcompute metadata to DLF, and use OSS as the unified storage at the bottom. We will open up the data Lake built by EMR and the data warehouse built by maxcompute, so that data and computing can flow freely between the lake and the warehouse, and truly realize the integration of the lake and the warehouse. That is the essence of data Lake: unified storage, unified metadata and free access execution engine. 5.1 construction of alicloud data Lake alicloud data lake formation (DLF) is a fully hosted service that can quickly help users build data lakes on the cloud. The product provides unified permission management, metadata management and metadata automatic extraction capabilities for data lakes on the cloud.
  • Unified data storage
    The construction of alicloud data lake uses the object storage service (OSS) of alicloud as the unified storage of data lake on the cloud. In the cloud, multiple computing engines can be used to face different big data computing scenarios, such as open source big data e-mapreduce, real-time computing, maxcompute interactive analysis (hologres), machine learning Pai, etc, But you can use a unified data Lake storage scheme to avoid the complexity and operation and maintenance costs of data synchronization.
  • Diversified templates for entering the lake
    Alicloud data Lake construction can extract data from multiple data sources into the data lake. Currently, it supports relational database (MySQL), alicloud log service (SLS), alicloud table storage (OTS), alicloud object service (OSS) and Kafka. Users can specify the storage format to improve the computing and storage efficiency.
  • Data Lake metadata management
    Users can define the format of data and metadata for centralized and unified management to ensure the quality of data.
    5.2 alicloud data Lake solution we mainly use the unified metadata management function and unified user rights management function of alicloud data Lake products. As shown in the figure, EMR and maxcompute share the metadata, user rights and rights management functions of DLF.
    Data lake system architecture based on DLF
    The data flow diagram of the data lake is as follows
    Data flow diagram
  • EMR etlx extracts the data of RDS and OSS into the data lake, that is, the data of ODS layer falls into the data lake.
  • In dataphin data center, the data of data lake is modeled by dimension (the middle table of modeling includes fact logic table and dimension logic table, which use maxcompute inner table and do not fall into the data Lake). Finally, the result of dimension modeling is generated on the data lake of CDM layer or ads layer.
  • EMR or other execution engines perform ad hoc query analysis or scheduling on the ads layer data on the data lake.

Author:Cheng Junjie
Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission