[practice case] Application of data insights in Midea HVAC and buildings

Time:2021-6-11

Introduction: for more detailed information about data insights of databricks, you can go to the product details page to view:https://www.aliyun.com/produc…

author
Intelligent Technology Department of advanced research center of Midea HVAC & Building Division

Background of Midea HVAC IOT data platform construction

Midea HVAC and building division (hereinafter referred to as Midea HVAC) is one of the five major sectors of Midea Group. Its products cover multiple units, large chillers, unit units, room air conditioners, escalators, escalators, freight elevators, building automation software and building weak current integrated solutions, and are sold to more than 200 countries at home and abroad. At present, the equipment data cloud of the business division only stays at the level of data storage, lacking the platform for mining data value, resulting in a large number of data waste, and constantly consuming storage resources, increasing storage costs and maintenance costs. On the other hand, the existing data-driven applications lack of deployment platform, which is difficult to produce practical value. Therefore, a unified and universal IOT data platform is urgently needed to support the rapid analysis and modeling of equipment operation data.

The construction of our IOT data platform is based on alicloud databricks data insight fully hosted spark products. The following is the overall business architecture diagram. In the later chapters of this paper, we will share some thoughts on the selection of IOT data platform construction technology and the application practice of spark technology stack, especially delta lake scene.
[practice case] Application of data insights in Midea HVAC and buildings

Choose Spark & delta lake

In terms of technology selection of data platform computing engine layer, as our data team has just been established, we have done a lot of research in the early stage of architecture selection. Considering all aspects, we hope to choose a mature and unified platform, which can not only support data processing, data analysis scene, but also support data science scene. In addition, the team members have rich experience in Python and spark, so they locked the target to the spark technology stack from the beginning.

Choose databricks data insight delta Lake

Through various technical exchanges and practical proof of concept with the alicloud computing platform team, we finally chose alicloud databricks data insight product. As the parent company of spark engine, its commercial version of spark engine, fully managed spark technology stack, unified data engineering and data science are all important reasons for us to choose databricks data insight.

Specifically, the core advantages provided by data insights are as follows:

SaaS fully hosted spark: no operation and maintenance, no need to pay attention to the underlying resources, reduce operation and maintenance costs, and focus on business analysis
Complete spark technology stack integration: one stop integration of spark engine and delta Lake data lake, 100% compatible with open source spark community version; The fastest way to experience the features of the latest version of spark
Total cost reduction: commercial versions of spark and delta lake have significant performance advantages; At the same time, based on the architecture of computing and storage separation, the storage relies on Alibaba cloud OSS object storage and is accelerated by the jindofs cache layer of Alibaba cloud; It can effectively reduce the total cost of cluster
High quality support and SLA guarantee: alicloud and databricks provide technical support covering the whole stack of spark; Provide commercial SLA guarantee and 7 * 24-hour expert support services for databricks

Overall architecture of IOT data platform

[practice case] Application of data insights in Midea HVAC and buildings

The overall architecture is shown in the figure above.

The IOT data we access is divided into two parts: historical stock data and real-time data. At present, the historical stock data is imported into delta Lake table in batches from different customer relationship databases in days by spark SQL; The real-time data is collected to cloud Kafka through IOT platform and written to delta Lake table in real time after consumption through spark structured streaming. In this process, we sink the real-time data and historical data into the same delta table, which can greatly simplify our ETL process (refer to the case section below). Downstream of the data pipeline, we connect data analysis and data science workflow.

IOT data collection: from little data to big data

As a typical application of IOT scenario, the core data of Midea HVAC comes from IOT terminal equipment. In the whole IOT environment, there are numerous terminal sensors distributed. From a small dimension, the data generated by the sensor itself belongs to small data (or little data). Simultaneous interpreting all sensors into a large IoT network, generating data from different sensors through Gateway and connecting to the cloud, and eventually forming Big Data in the cloud.

In our scenario, IOT platform will initially parse the data of different protocols, and send the parsed semi-structured JSON data to cloud Kafka via the network through customized hardware network devices. Cloud Kafka acts as the entrance of the whole data pipeline.

Data into the lake: Delta Lake

The data in IOT scenario has the following characteristics:

Time series data: the data records generated by sensors contain time related information, and the data itself has time attribute, so there may be certain correlation between different data. Using as of join to join different time series data is the basis of downstream data prediction and analysis
Real time of data: the sensor generates data in real time and transmits it to the data pipeline with the lowest delay, triggers the rule engine, generates alarms and events, and notifies the relevant staff.
Huge amount of data: in IOT network environment, thousands of devices and their sensors all over the world gather massive data to the platform through access services
Diversity of data protocols: generally, in different types of devices accessed by IOT platform, the types of upload data protocols are diverse, and the data encoding format is not unified

The above characteristics of IOT data bring many challenges to data processing, data analysis and data science. Fortunately, these challenges can be well handled with the help of spark and delta lake. Delta Lake provides acid transaction guarantee, supports incremental updating of data table and stream batch writing data at the same time. With the help of spark structured streaming, IOT time series data can enter the lake in real time.

The following is delta Lake’s classic three-level data table architecture. Specifically, for Midea’s HVAC IOT data scenario, we define the data tables at each level as follows:
[practice case] Application of data insights in Midea HVAC and buildings

· bronze table: raw data is stored. After data is consumed from Kafka via spark structured streaming, it is uploaded into delta Lake table, which is the only real data table (single source of truth)
· silver table: this table is an intermediate table generated on the basis of processing the data of bronze table. In the scene of Midea HVAC, the steps of data processing involve some complex time series data calculation logic, which are packaged in pandas UDF and provided for spark calculation
· gold table: the data of silver table is subject to schema constraint and further cleaned, and then the data is imported into gold table, which is provided for downstream ad hoc query analysis and data scientific use

Data analysis: ad hoc query

Our internal version of SQL query and data visualization platform is customized on the basis of open source superset. We connect to the data bridges data insight spark thrift server service through pyhive, and can submit SQL to the cluster. The commercial version of thrift server has been enhanced in terms of availability and performance. Data bricks data Insight provides LDAP Based User Authentication for JDBC connection security authentication. With the help of superset, data analysts and data scientists can quickly and efficiently explore the data of delta Lake table.

Data science: workspace

Building energy consumption prediction and equipment fault diagnosis prediction are the two main business objectives of Midea HVAC IOT big data platform construction. In the downstream of IOT data pipeline, machine learning platform needs to be connected. At this stage, in order to support the data science scene more quickly and conveniently, we will connect the databricks data insight cluster with alicloud data development platform DDC. DDC integrates jupyter notebook, which is more friendly in the context of data science. By using pyspark on jupyter, you can run jobs to the data insight cluster of databricks; At the same time, you can also schedule jobs with the help of Apache airflow. At the same time, considering the machine learning model construction, iterative training, index detection, deployment and other basic links, we are also exploring mlops, which is still in preparation.

Introduction to typical application scenarios

Delta Lake data into the lake (batch flow integration)

Using UDF function to define merge rule of stream data writing to delta Lake

%spark
import org.apache.spark.sql._
import io.delta.tables._
 
// Function to upsert `microBatchOutputDF` into Delta table using MERGE
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {
  // Set the dataframe to view name
 microBatchOutputDF.createOrReplaceTempView("updates")
  // Use the view name to apply MERGE
  // NOTE: You have to use the SparkSession that has been used to define the `updates` dataframe
 microBatchOutputDF.sparkSession.sql(s"""
   MERGE INTO delta_{table_name} t
   USING updates s
   ON s.uuid = t.uuid
   WHEN MATCHED THEN UPDATE SET 
   t.device_id = s.device_id,
   t.indoor_temperature =
s.indoor_temperature,
   t.ouoor_temperature = s.ouoor_temperature,
   t.chiller_temperature =
s.chiller_temperature,
   t.electricity = s.electricity,
   t.protocal_version = s.protocal_version,
   t.dt=s.dt,
   t.update_time=current_timestamp()
   WHEN NOT MATCHED THEN INSERT 
   (t.uuid,t.device_id,t.indoor_temperature,t.ouoor_temperature ,t.chiller_temperature
,t.electricity,t.protocal_version,t.dt,t.create_time,t.update_time)
  values
(s.uuid,s.device_id,s.indoor_temperature,s.ouoor_temperature,s.chiller_temperature,s.electricity,s.protocal_version
,s.dt,current_timestamp(),current_timestamp())
   """)
}

Using spark structured streaming real-time stream to write delta Lake

%spark
 
import org.apache.spark.sql.functions._
import org.apache.spark.sql.streaming.Trigger
 
def getquery(checkpoint_dir:String,tableName:String,servers:String,topic:String ) {
   var streamingInputDF =  
 spark.readStream
   .format("kafka")
   .option("kafka.bootstrap.servers", servers)
   .option("subscribe", topic)     
   .option("startingOffsets", "latest")  
   .option("minPartitions", "10")  
   .option("failOnDataLoss", "true")
   .load()
val resDF=streamingInputDF
   .select(col("value").cast("string"))
   .withColumn("newMessage",split(col("value"), " "))
   .filter(col("newMessage").getItem(7).isNotNull)
   .select(
       col("newMessage").getItem(0).as("uuid"),
       col("newMessage").getItem(1).as("device_id"),
       col("newMessage").getItem(2).as("indoor_temperature"),
       col("newMessage").getItem(3).as("ouoor_temperature"),
       col("newMessage").getItem(4).as("chiller_temperature"),
       col("newMessage").getItem(5).as("electricity"),
       col("newMessage").getItem(6).as("protocal_version")
   )
   .withColumn("dt",date_format(current_date(),"yyyyMMdd"))  
val query = resDF
     .writeStream
     .format("delta")
     .option("checkpointLocation", checkpoint_dir)
     . trigger (trigger. Processingtime ("60 seconds") // time interval for stream processing
     .foreachBatch(upsertToDelta _) // Reference the upserttodelta function
     .outputMode("update")
   query.start()
}

Data disaster recovery: deep clone

Since delta Lake data is only accessed to real-time data, for stock historical data, we use sparksql to sink delta Lake tables at one time. In this way, we only maintain one delta table during flow and batch processing, so we only merge these two parts of data once at the beginning. At the same time, in order to ensure the high security of data, we use databricks deep clone to do data disaster recovery, which will be updated regularly every day to maintain a slave table for standby. For daily new data, using deep clone will only insert new data and update the data that needs to be updated, which can greatly improve the execution efficiency.

CREATE OR REPLACE TABLE delta.delta_{table_name}_clone
 
DEEP CLONE delta.delta_{table_name};

Performance Optimization: optimize & z-ordering

In the stream processing scenario, a large number of small files will be generated, and the existence of a large number of small files will seriously affect the read performance of the data system. Delta Lake provides the optimize command, which can merge and compress small files. In addition, for ad hoc query scenarios, we can effectively improve the query performance with the help of z-ordering mechanism provided by delta lake because it involves the query of multi-dimensional data in a single table. Thus greatly improve the performance of reading tables. Delta lake itself provides auto optimize option, but it will sacrifice a small amount of write performance and increase the delay of data writing to delta table. On the contrary, the execution of optimize command will not affect the performance of writing, because delta lake itself supports mvcc, and supports the concurrent execution of write while optimizing. Therefore, we adopt the scheme of regularly triggering the implementation of optimize. We use optimize to merge small files once an hour, and execute vaccum to clean up expired data files

OPTIMIZE delta.delta_{table_name} ZORDER by device_id, indoor_temperature;
set spark.databricks.delta.retentionDurationCheck.enabled = false;
VACUUM delta.delta_{table_name} RETAIN 1 HOURS;

In addition, for the ad hoc query scenario, since it involves the query of multi-dimensional data in a single table, we can effectively improve the query performance with the help of z-ordering mechanism provided by delta lake.

Summary and Prospect

We quickly built IOT data processing platform based on commercial spark and delta Lake Technology stack provided by Alibaba cloud’s databricks data insight products. Databricks data insight is fully managed and free of operation and maintenance, the advantages of commercial engine performance and the architecture of computing / storage separation, which saves us the overall cost. At the same time, the rich features provided by the data insight products themselves greatly enhance the productivity of our data team and lay the foundation for the rapid development and delivery of data analysis business. In the future, Midea HVAC hopes to work with Alibaba cloud data bridges data insight team to output more advanced industry solutions for IOT scenarios.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.