Use goosefs to accelerate big data computing service on Tencent cloud EMR

Time:2022-5-7

Goosefs is the latest high-performance, highly available and elastic distributed cache system launched by Tencent cloud object storage team. Relying on the cost advantage of cloud object storage (COS) as the storage base of data lake, goosefs provides a unified data Lake entrance for computing applications in the data lake ecosystem, and can accelerate various tasks such as massive data analysis and machine learning based on Tencent cloud object storage. This article will introduce how to use goosefs to accelerate big data computing tasks on Tencent cloud EMR.

Goosefs is a storage accelerator recently launched by Tencent cloud object storage team for the next generation cloud native data Lake scenario. It provides the implementation of Hadoop compatible file system interface benchmarked with HDFS, which can provide:

  • Highly reliable, elastic and scalable distributed read-write cache service;
  • Memory level data locality access performance;
  • Read / write cache strategy based on namespace granularity and hive table level preheating;
  • Ranger authentication mechanism consistent with HDFS;
  • Object storage AZ level accelerated access and high QPS metadata access capability; And features such as rapid deployment and out of the box.

Use goosefs to accelerate big data computing service on Tencent cloud EMR

Based on Tencent cloud EMR, this paper will introduce how to quickly deploy goosefs to accelerate the big data analysis task on the cloud.

Accelerate Tencent cloud EMR big data computing tasks

In order to use goosefs in Tencent cloud EMR to accelerate big data computing tasks, you can refer to the official website document to deploy and configure goosefs in Tencent cloud EMR environment(https://cloud.tencent.com/doc…), you can turn on the cache acceleration capability of goosefs. The following will show goosefs’s ability to accelerate access with data warehouse business and iterative computing scenarios.

Accelerate query business based on hive, spark SQL and presto data warehouse

The data warehouse business of many big data customers has obvious hot and cold cycle characteristics. For example, a big data customer will regularly generate daily reports based on the data warehouse every day, and the partition of hive table is the date dimension.

Goosefs integrates the metadata management capability of hive table and provides the data preheating feature of hive table & partition granularity. Users can configure workflow tasks to preheat and load table & partition every day at leisure to reduce the bandwidth consumption of peak query, and then provide memory level cache acceleration during the peak period of data access.

After the hot table or partition becomes cold, use the free command to release it from the cache.

Use goosefs to accelerate big data computing service on Tencent cloud EMR

|Next, the management capabilities and preheating methods of goosefs table will be introduced in detail.

Goosefs table & partition management and preheating

Goosefs table & partition management and preheating capabilities are realized through the table command line of goosefs:

$ goosefs table
Usage: goosefs table [generic options]
   [attachdb [-o|--option <key=value>] [--db <goosefs db name>] [--ignore-sync-errors] <udb type> <udb connection uri> <udb db name>]
   [detachdb <db name>]                                      
   [free <dbName> <tableName> [-p|--partition <partitionSpec>]]
   [help [<command>]]                                        
   [load <dbName> <tableName> [-g|--greedy] [--replication <num>] [-p|--partition <partitionSpec>]]
   [ls [<db name> [<table name>]]]                           
   [stat <dbName> <tableName>]                               
   [sync <db name>]                                          
   [transform <db name> <table name> [-d <definition>]]      
   [transformStatus [<job ID>]]

Among them, hive DB binding and unbinding are provided to preheat and load the specified table & partition under dB.

  1. Before preheating the specified table & partition in hive DB to goosefs, you need to mount the DB to goosefs:
$ goosefs table attachdb --db test_db hive thrift://metastore_host:port goosefs_db_demo
response of attachdb
  1. After mounting, you can use the command line of goosefs to view the table information in DB:
$ goosefs table ls test_db web_page
OWNER: hadoop
DBNAME.TABLENAME: testdb.web_page (
wp_web_page_sk bigint,
wp_web_page_id string,
wp_rec_start_date string,
wp_rec_end_date string,
wp_creation_date_sk bigint,
wp_access_date_sk bigint,
wp_autogen_flag string,
wp_customer_sk bigint,
wp_url string,
wp_type string,
wp_char_count int,
wp_link_count int,
wp_image_count int,
wp_max_ad_count int,
)
PARTITIONED BY (
)
LOCATION (
gfs://metastore_host:port/myiNamespace/3000/web_page
)
PARTITION LIST (
{
partitionName: web_page
location: gfs://metastore_host:port/myNamespace/3000/web_page
}
)
  1. Then, you can preheat the specified table into goosefs, and you can also view the preheating of the table:
$ goosefs table load test_db web_page
Asynchronous job submitted successfully, jobId: 1615966078836
  1. After the warm-up is completed, the query task can be executed normally to obtain the local cache acceleration performance of goosefs.

Goosefs acceleration performance comparison

Here, we compare and test the local HDFS in Tencent cloud EMR environment based on the standard tpcds benchmark, and get the total delay of the whole test process. Among them, goosefs mounts Cosn as its UFS and preheats the test data set in advance.

Use goosefs to accelerate big data computing service on Tencent cloud EMR

With the same test data set localization, goosefs has better read data access performance than HDFS. Refer to the appendix for the delay data of SQL case.

Meanwhile, Cosn and chdfs are implemented as two commonly used big data file systems on Tencent cloud, and can also be used as the under file system of goosefs. Here, the three file systems are also compared and tested. Goosefs mounts Cosn as its UFS, and the test data set is also preheated in advance.

Use goosefs to accelerate big data computing service on Tencent cloud EMR

From the test results, it can also be seen that goosefs can significantly accelerate the access performance of Tencent cloud big data storage system under the condition of preheating data. Refer to the appendix for the delay data of SQL case.

summary

As a new cloud native big data storage accelerator launched by Tencent cloud object storage, goosefs solves the defect of data locality based on cloud storage such as Cosn and chdfs, and provides local near memory access performance.

Meanwhile, goosefs provides hive table & partition level preheating capability and cache policy management, which can greatly facilitate users to complete data preheating and access acceleration. In the future, goosefs will further optimize and develop meta data access performance, local short-circuit read performance and intelligent cache in order to further accelerate the application performance of massive data lake. For more information, go to:https://cloud.tencent.com/doc…

enclosure

case100_ D3_ Local SATA_ HDFS and case100_ D3_ Local SATA_ Tpcds test results of goosefs:

SQL case case100_ D3_ Local sata-hdfs case101_ D3_ Local SATA goosefs
29618 28230
query1.sql 150 167
query2.sql 1392 1213
query3.sql 402 329
query8.sql 338 255
query12.sql 280 252
query13.sql 367 293
query15.sql 767 706
query19.sql 368 297
query20.sql 503 441
query21.sql 170 182
query22.sql 96 94
query26.sql 582 583
query31.sql 1211 854
query32.sql 929 670
query33.sql 673 450
query34.sql 345 253
query36.sql 444 404
query37.sql 473 396
query38.sql 811 603
query39.sql 498 510
query40.sql 953 905
query43.sql 328 252
query45.sql 453 426
query46.sql 361 332
query48.sql 431 382
query52.sql 345 239
query53.sql 806 777
query55.sql 341 237
query56.sql 675 459
query57.sql 2627 2559
query59.sql 1711 1618
query60.sql 687 465
query63.sql 805 776
query66.sql 433 430
query68.sql 352 320
query70.sql 1261 3961
query71.sql 677 475
query73.sql 339 237
query76.sql 662 378
query82.sql 758 688
query83.sql 309 320
query86.sql 186 152
query87.sql 792 613
query89.sql 809 776
query97.sql 880 712
query98.sql 838 789

Comparison test results of goosefs, chdfs and Cosn in SSD cloud disk environment:

SQL case case200_ S5_ SSD cloud disk goosefs case201_ S5_ SSD cloud disk – chdfs case204_ S5_ SSD cloud disk – Cosn
30353 36820 41803
query1.sql 194 212 205
query2.sql 1377 1558 1921
query3.sql 463 457 570
query8.sql 294 394 509
query12.sql 287 307 347
query13.sql 307 668 814
query15.sql 837 867 1074
query19.sql 354 512 586
query20.sql 576 554 680
query21.sql 213 196 210
query22.sql 111 109 107
query26.sql 806 882 973
query31.sql 972 1328 1817
query32.sql 778 949 1453
query33.sql 524 779 1049
query34.sql 292 428 526
query36.sql 479 545 688
query37.sql 449 500 679
query38.sql 691 868 1210
query39.sql 695 565 654
query40.sql 1098 1082 1251
query43.sql 304 378 514
query45.sql 506 568 628
query46.sql 412 557 610
query48.sql 437 697 847
query52.sql 242 328 501
query53.sql 946 899 1058
query55.sql 244 351 485
query56.sql 520 704 925
query57.sql 3223 2914 3469
query59.sql 1965 1930 2302
query60.sql 539 696 905
query63.sql 935 934 1025
query66.sql 543 593 584
query68.sql 380 570 578
query70.sql 1430 4173 1608
query71.sql 536 780 951
query73.sql 282 384 547
query76.sql 368 648 981
query82.sql 796 828 972
query83.sql 369 353 378
query86.sql 163 184 219
query87.sql 712 896 1038
query89.sql 951 924 1050
query97.sql 801 871 1213
query98.sql 952 900 1092