Abstract:Dirty data has a serious impact on the correctness of data calculation. Therefore, we need to explore a method to realize the reliability and correctness of spark writing elasticsearch data.
The combination of spark and elastic search (ES) is a hot topic of big data solutions in recent years. One is an excellent distributed computing engine, the other is an excellent search engine. In recent years, more and more mature solutions have been implemented in the industry products, including the spark + es + HBase log analysis platform we are familiar with.
At present, Huawei’s DLI service has fully supported spark / Flink cross source access to elasticsearch. In this paper, we will choose the more classic distributed consistency problem to discuss.
Distributed consistency problem
Data fault tolerance is one of the main problems of big data computing engine. At present, mainstream open source big data, such as Apache spark and Apache Flink, have fully implemented the exact once semantics, ensuring the correctness of internal data processing. However, when the calculation results are written to the external data source, because of the diversity of the external data source architecture and access methods, a unified solution has not been found to ensure the consistency (we call the consistency problem of sink operator). In addition, ES itself does not have the ability of transaction processing, so how to ensure the consistency of writing es data has become a hot topic.
Let’s give a simple example to illustrate. In Figure 1, in sparkrdd (assuming a task here), each blue line represents 1 million pieces of data, and 10 blue lines indicate that 10 million pieces of data are ready to be written into an index of CSS (Huawei cloud search service, internal es). During the writing process, the system failed, resulting in only half (5 million pieces) of data successfully written.
Task is the smallest unit for spark to execute tasks. If the task fails, the current task needs to be executed again. So, when we rerun the write operation (Figure 2) and finally try again (this time, the same 10 million pieces of data are represented in red), the 5 million pieces of data left by the last failure still exist (the blue line) and become dirty data. Dirty data has a serious impact on the correctness of data calculation. Therefore, we need to explore a method to realize the reliability and correctness of spark writing es data.
Figure 1 part of data is written to es when spark task fails
Figure 2 part of the data written last time becomes dirty after the task is successfully retried
1. Write overlay
From the above figure, we can see very intuitively that before each task inserts data, clear the data in ES index. Then, each write operation can be regarded as a combination of the following three steps:
- Step 1: judge whether there is data in the current index
- Step 2 clear the data in the current index
- Step 3 write data to the index
In other words, we can understand that no matter whether the data write has been performed before or how many times the data has been written before, we only want to ensure that the current write can be completed independently and correctly. This idea is called idempotent.
Power equality writing is a common way to solve the consistency problem of big data sink operator. Another way is called final consistency. The simplest way is to “insert rewrite”. When spark data fails to write to es and tries to execute again, the residual data in the index can be covered by using overlay write.
The graph uses the rewrite mode, and the previous data is overwritten when the task is retried
In DLI, you can set the mode to “override” in the dataframe interface to implement the overlay write es:
val dfWriter = sparkSession.createDataFrame(rdd, schema) // //Write data to es // dfWriter.write .format("es") .option("es.resource", resource) .option("es.nodes", nodes) .mode(SaveMode.Overwrite) .save()
You can also use SQL statements directly:
//Insert data to es sparkSession.sql("insert overwrite table es_table values(1, 'John'),(2, 'Bob')")
2. Final consistency
There is a big drawback in using the “rewrite” method to solve the fault tolerance problem. If the correct data already exists in the ES, this time it just needs to be appended. Then rewrite will cover all the correct data of the previous index.
For example, there are multiple tasks that perform data writing operations concurrently. If one task fails and the other tasks succeed, re executing the failed task will “rewrite” the other successfully written data. For another example, in the streaming scenario, every batch of data is written into an overlay, which is unreasonable.
Figure spark append data to es
If the graph is written with rewrite, the original correct data will be overwritten
In fact, all we want to do is clean up the dirty data, not all the data in the index. Therefore, the core problem becomes how to identify dirty data? Drawing on other database solutions, it seems that we can find a way. In mysql, there is a syntax of insert ignore into. If there is a primary key conflict, you can ignore this row of data. If there is no conflict, you can perform a normal insert operation. In this way, the strength of the coverage data can be refined to the row level.
Is there a similar function in es? If every piece of data in ES has a primary key and can be overridden in case of primary key conflict (both ignoring and overriding can solve this problem), then when the task fails to try again, only the dirty data can be covered.
Let’s take a look at a comparison between concepts in elasticsearch and relational databases
We know that the primary key in MySQL is the unique identifier for a row of data. As can be seen from the table, row corresponds to the document in es. So, does the document have a unique identifier?
The answer is yes. Every document has an ID, that is, Doc_ id。 doc_ ID is configurable, index, type, Doc_ ID specifies a unique piece of data (document). In addition, when inserting es, index, type, doc_ If the ID is the same, the original document data will be overwritten. Therefore, doc_ ID can be equivalent to “MySQL primary key conflict ignore insert” function, that is, “Doc”_ ID conflict overlay insert “function.
Therefore, configuration items are provided in the SQL syntax of DLI“ es.mapping.id ”You can specify a field as the document ID, for example:
create table es_table(id int, name string) using es options( 'es.nodes' 'localhost:9200', 'es.resource' '/mytest/anytype', 'es.mapping.id' 'id')")
The field “Id” is specified here as the DOC of ES_ ID, when data is inserted, the value of the field “Id” becomes the ID of the inserted document. It is worth noting that the value of “Id” should be unique, otherwise the same “Id” will cause the data to be covered.
In this case, if a job or task fails, you can directly execute it again. When the final job is successfully executed, there will be no residual dirty data in ES, that is, the final consistency is achieved.
Figure sets the primary key to doc when inserting data_ ID, using idempotent insertion to achieve final consistency
This article can be summed up as “using Doc”_ ID implements the final consistency of writing es “. In fact, this kind of problem does not need to be explored so much, because in the native API of ES, Doc needs to be specified when inserting data_ ID, which should be a basic common sense: for detailed API description, please refer to:https://www.elastic.co/guide/…）
Figure es uses bulk interface to write data
Power should be a pastime and a chat a comfort.
Thanks to the base theory, consistency becomes one of the most important solutions in distributed computing. Although the solution has some limitations (for example, the data in this solution must use primary keys), there are many distributed consistent solutions in the industry (such as 2pc and 3pc). But I think that measuring the workload and the final effect, the final consistency is a very effective and simple solution.
Extended reading: elasticsearch datasource
Datasource is a unified interface provided by Apache spark to access external data sources. Spark provides SPI mechanism for plug-in management of datasource. You can customize the logic of elasticsearch through Spark’s datasource module.
Huawei cloud DLI (data Lake exploration) service has fully realized es datasource function. Users can access es by spark only through simple SQL statements or spark dataframe API.
To access es through spark, you can find details in the official DLI documentation:https://support.huaweicloud.com/usermanual-dli/dli_01_0410.html。 (elasticsearch is provided by Huawei cloud CSS cloud search service).
You can use spark dataframe API to read and write data
// //Initialization settings // //Set / index / type of ES (ES version 6. X does not support multiple types in the same index, and 7. X version does not support setting type) val resource = "/mytest/anytype"; //Set the connection address of ES (the format is "node1: port, node2: Port...", because of the replica mechanism of ES, even if the ES cluster is accessed, only one address needs to be configured.) val nodes = "localhost:9200" //Construct data val schema = StructType(Seq(StructField("id", IntegerType, false), StructField("name", StringType, false))) val rdd = sparkSession.sparkContext.parallelize(Seq(Row(1, "John"),Row(2,"Bob"))) val dfWriter = sparkSession.createDataFrame(rdd, schema) // //Write data to es // dfWriter.write .format("es") .option("es.resource", resource) .option("es.nodes", nodes) .mode(SaveMode.Append) .save() // //Read data from ES // val dfReader = sparkSession.read.format("es").option("es.resource",resource).option("es.nodes", nodes).load() dfReader.show()
You can also use spark SQL to access:
//Create a spark temporary table associated with ES / index / type, which does not store actual data val sparkSession = SparkSession.builder().getOrCreate() sparkSession.sql("create table es_table(id int, name string) using es options( 'es.nodes' 'localhost:9200', 'es.resource' '/mytest/anytype')") //Insert data to es sparkSession.sql("insert into es_table values(1, 'John'),(2, 'Bob')") //Read data from ES val dataFrame = sparkSession.sql("select * from es_table") dataFrame.show()