Data Lake – basic operation of delta Lake

Time:2021-12-31

1. Brief introduction to data Lake:

1.1 official website

https://delta.io/

Look at a picture of the official website

Data Lake - basic operation of delta Lake

datalake.png

1.2 features:

1. There is no limit to the format. If you don't refuse to come, you can flow in
2. Centralized storage and everywhere access.
3. High performance analysis capability - with the help of spark, Mr, sparksql and other high-performance analysis and calculation engines, massive data can be analyzed.
4. Original data storage
5. Data lake is a large warehouse for storing various original data of enterprises, in which the data can be accessed, processed, analyzed and transmitted.

1.3 comparison of data lake, data warehouse and data mart

compare data warehouse data mart Data Lake
Scope of application Whole company Department or group Whole company
data type Structured data processing Structured data processing Arbitrary format data processing
Storage scale of large number Medium scale (small warehouse) Massive
Data application Dimension modeling, index analysis Small scale data analysis Massive arbitrary format analysis, unlimited application types
New application development cycle long long short

1.3 write mode

Before writing data, you need to define the schema of the data, and the data is written according to the definition of the schema

1.4 time reading mode

When writing data, you do not need to define a schema. When you need to use it, you use schema to define it. Write time mode and read time mode are two completely different data processing methods.
Data lake is a concrete embodiment of the idea of time reading mode
1. Compared with the write time mode, the read time mode can improve the flexibility of data model definition and meet the efficient analysis requirements of different upper layer services because it defines the model structure (schema) when the data is used.
2. Because for the write time mode, if you want to change the schema afterwards, it has a high cost.
3. The read-time schema can be defined when it is used, which is very flexible. The same set of data can be defined with different schemas to obtain different effects.

1.5 features:

1. Easy data collection (time reading mode): one big difference between data lake and data warehouse is schema on read, that is, schema information is required when using data; The data warehouse is schema on write, that is, schema needs to be designed when storing data. In this way, since there are no restrictions on data writing, the data lake can collect data more easily.
2. You don't need to care about the data structure: there are no restrictions on the storage of data. Data in any format can be stored as long as you can analyze it.
3. All data is shared (centralized storage). Multiple business units or researchers can use all data. In the past, it was troublesome to aggregate and summarize data because some data were distributed on different systems.
4. Discover more value from data (analysis ability): data warehouse and data market can only answer some pre-defined questions because they only use some attributes in the data; The data lake stores all the most original and detailed data, so you can answer more questions. Moreover, the data Lake allows various roles in the organization to analyze the data through self-help analysis tools (MR, spark, sparksql, etc.), and use AI and machine learning technologies to explore more value from the data.
5. It has better scalability and agility: the data lake can use the distributed file system to store data, so it has high scalability. The use of open source technology also reduces storage costs. The structure of the data lake is not so strict, so it is naturally more flexible, which improves agility.

1.6 data Lake requirements

1. Security: centralized data storage has higher requirements for data security and stricter requirements for permission control.
2. Expandable: with the expansion of business and the increase of data, the data lake system is required to expand its capacity as required.
3. Reliable: as a centralized storage data center, reliability is also very important. It can't break down in three days or two.
4. Throughput: as the storage of massive data, the data Lake must have high requirements for data throughput.
5. Original format storage: the data lake is defined as the original data centralized repository of all data, so the data stored into the data lake is unmodified and original data
6. Support the input of multiple data sources: no data type is limited, and any data can be written
7. Support of multiple analysis frameworks: because the data formats are various and not all structured data, it is required to support multiple analysis frameworks to extract and analyze the data in the data lake. Including but not limited to batch processing, real-time, streaming, machine learning, graphic computing, etc.

1.7 principles of data Lake

1. Separation of data and business
    2. Separation of storage and Computing (optional, more applicable to cloud platform)
    3. Lambda architecture vs kappa architecture vs iota architecture- 
    4. Importance of management services and selection of appropriate tools
        4.1 security (Kerberos)
        4.2 permissions (Ranger)

2. Basic operation of data Lake

2.1 characteristics of data Lake

1. Acid transaction control: Delta Lake brings acid transactions into your data lake. It provides serializability and the strongest isolation level.
2. Scalable metadata processing: Delta lake can easily handle Pb level tables with billions of partitions and files
4. Data version control: Delta Lake provides data snapshots so that developers can access and restore data to earlier versions for audit, rollback or reproduction experiments.
5. Open data format: all data in delta lake is stored in Apache parquet format, so that delta lake can take advantage of parquet's inherent efficient compression and coding scheme.
6. Source and sink of unified batch and stream processing: the table in delta lake is both a batch table and a source and sink of stream calculation.
7. Schema execution: Delta Lake provides the function of specifying and executing schemas. This helps to ensure that the data type is correct and that the necessary columns exist to prevent bad data from causing data corruption
8. Schema evolution: big data is constantly changing. Delta Lake allows you to change the table schema that can be applied automatically without cumbersome DDL
9. Audit history: the delta Lake transaction log records detailed information about each change made to the data, providing a complete audit trail of the change
10. Update and delete: Delta Lake supports Scala / Java API to merge, update and delete data sets.
10.100% compatible with Apache spark API: fully compatible with spark.

2.2 Data lakeOperation: Spark Scala shell — only requiredSparkVersion: >=2.4.2

bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0

The operation is shown in the figure:

[[email protected] spark-2.4.7-bin-hadoop2.7]# bin/spark-shell --packages io.delta:delta-core_2.11:0.5.0
Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
:: loading settings :: url = jar:file:/opt/module/spark-2.4.7-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
io.delta#delta-core_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-811cb329-3b4b-4a62-ab7a-d2287a1901dc;1.0
    confs: [default]
    found io.delta#delta-core_2.11;0.5.0 in central
    found org.antlr#antlr4;4.7 in central
    found org.antlr#antlr4-runtime;4.7 in central
    found org.antlr#antlr-runtime;3.5.2 in central
    found org.antlr#ST4;4.0.8 in central
    found org.abego.treelayout#org.abego.treelayout.core;1.0.3 in central
    found org.glassfish#javax.json;1.0.4 in central
    found com.ibm.icu#icu4j;58.2 in central
:: resolution report :: resolve 376ms :: artifacts dl 6ms
    :: modules in use:
    com.ibm.icu#icu4j;58.2 from central in [default]
    io.delta#delta-core_2.11;0.5.0 from central in [default]
    org.abego.treelayout#org.abego.treelayout.core;1.0.3 from central in [default]
    org.antlr#ST4;4.0.8 from central in [default]
    org.antlr#antlr-runtime;3.5.2 from central in [default]
    org.antlr#antlr4;4.7 from central in [default]
    org.antlr#antlr4-runtime;4.7 from central in [default]
    org.glassfish#javax.json;1.0.4 from central in [default]
    ---------------------------------------------------------------------
    |                  |            modules            ||   artifacts   |
    |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
    ---------------------------------------------------------------------
    |      default     |   8   |   0   |   0   |   0   ||   8   |   0   |
    ---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-811cb329-3b4b-4a62-ab7a-d2287a1901dc
    confs: [default]
    0 artifacts copied, 8 already retrieved (0kB/9ms)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/06/09 23:27:26 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
Spark context Web UI available at http://master01.pxx.com:4041
Spark context available as 'sc' (master = local[*], app id = local-1623252446880).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/
         
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_251)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val data = spark.range(0, 5)

2.2 official website command:

bin/spark-shell --packages io.delta:delta-core_2.12:1.0.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
In fact, you can use bin / spark shell -- packages io delta:delta-core_ 2.12:1.0.0

2.3 follow the order on the official website

1. Create a table and read the table

scala> val data = spark.range(0, 5)
data: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> data.write.format("delta").save("/tmp/delta-table02")                                                                         
scala> spark.read.format("delta").load("/tmp/delta-table02").toDF.show()
+---+
| id|
+---+
|  2|
|  0|
|  4|
|  3|
|  1|
+---+
scala> 
  1. update operation
scala> val data01 = spark.range(5,10)
data01: org.apache.spark.sql.Dataset[Long] = [id: bigint]
scala> data01.write.format("delta").mode("overwrite").save("/tmp/delta-table02")                                                                             
scala> spark.read.format("delta").load("/tmp/delta-table02").toDF.show()
+---+
| id|
+---+
|  8|
|  7|
|  5|
|  6|
|  9|
+---+
scala> 

3. Delta Lake provides a programming API for conditionally updating, deleting, and merging (upsert) data into tables

scala> import io.delta.tables._
import io.delta.tables._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val deltaTable = DeltaTable.forPath("/tmp/delta-table02")
deltaTable: io.delta.tables.DeltaTable = [email protected]

//Update each even value by adding 100 to each even value
scala> deltaTable.update(condition=expr("id % 2 ==0"), set = Map("id"->expr("id+100")))
                                                                                
scala> spark.read.format("delta").load("/tmp/delta-table02").toDF.show()
+---+                                                                           
| id|
+---+
|106|
|  7|
|  5|
|108|
|  9|
+---+

//Delete even
scala> deltaTable.delete(condition = expr("id % 2 ==0"))
                                                                                
scala> spark.read.format("delta").load("/tmp/delta-table02").toDF.show()
+---+                                                                           
| id|
+---+
|  7|
|  5|
|  9|
+---+


scala> val newData = spark.range(0,20).toDF
newData: org.apache.spark.sql.DataFrame = [id: bigint]
//Merge new data
 deltaTable.as("oldData").merge(newData.as("newData"),"oldData.id=newData.id").whenMatched.update(Map("id" -> col("newData.id"))).whenNotMatched.insert(Map("id" ->col("newData.id"))).excute()
                                                                                                                                                                                             ^

scala> deltaTable.as("oldData").merge(newData.as("newData"),"oldData.id=newData.id").whenMatched.update(Map("id" -> col("newData.id"))).whenNotMatched.insert(Map("id" ->col("newData.id"))).execute()
[Stage 86:===================================>                 (135 + 51) / 200]21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 61.43% for 11 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 56.31% for 12 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 51.98% for 13 writers
[Stage 86:===============================================>     (180 + 20) / 200]21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 48.27% for 14 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 45.05% for 15 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 42.24% for 16 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 39.75% for 17 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 37.54% for 18 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 35.57% for 19 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 33.79% for 20 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 35.57% for 19 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 37.54% for 18 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 39.75% for 17 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 42.24% for 16 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 45.05% for 15 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 48.27% for 14 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 51.98% for 13 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 56.31% for 12 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 61.43% for 11 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 67.58% for 10 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 75.08% for 9 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
21/06/09 23:51:11 WARN hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,014 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
                                                                                
scala> deltaTable.toDF.show()
+---+                                                                           
| id|
+---+
|  0|
|  2|
|  6|
|  1|
| 10|
| 11|
| 15|
| 12|
|  4|
| 19|
| 14|
|  5|
|  9|
| 13|
|  8|
| 18|
| 16|
|  7|
|  3|
| 17|
+---+
scala>

Recommended Today

Could not get a resource from the pool when the springboot project starts redis; nested exception is io. lettuce. core.

resolvent: Find your redis installation path: Start redis server Exe After successful startup: Restart project resolution. ———————————————————————->Here’s the point:<——————————————————————- Here, if you close the redis command window, the project console will report an error. If you restart the project, the same error will be reported at the beginning, The reason is: It is inconvenient to […]