Use Hive SQL to query the correct posture of the Iceberg table – “DEEPNOVA Developer Community”


Author: Wen Naisong

As a tabular management specification, Iceberg’s data is divided into metadata and table data. Metadata and table data are stored independently. Metadata currently supports storage in the local file system, HMS, Hadoop, JDBC database, AWS Glue, and custom storage. Table data supports local file systems, HDFS, S3, MinIO, OBS, OSS, etc. Metadata storage is widely based on HMS. In this article, table data storage is based on MinIO, and metadata storage is mainly based on HMS. In fact, the metadata stored based on HMS is only a small amount of information stored in HMS, and the main metadata is still stored in external systems, such as HDFS, MinIO, etc., so the storage method based on Hadoop has to be mentioned here.

Iceberg’s metadata management is specified by the Catalog type. We mainly focus on hive, hadoop and location_based_table.

What happens if you create a table directly without specifying any Catalog type?

add jar /Users/deepexi/.m2/repository/org/apache/iceberg/iceberg-hive-runtime/0.13.2/iceberg-hive-runtime-0.13.2.jar;
SET hive.vectorized.execution.enabled=false; 
CREATE TABLE default.sample_local_hive_table_1( id bigint, name string ) 
    PARTITIONED BY( dept string ) 
    STORED BY ''; 

Creating an Iceberg table needs to be implemented with the help of HiveIcebergStorageHandler. At this time, only metadata will be stored in the local HMS (configured in hive-site.xml). The metadata table table_params shows that the storage location of the current table uses the local HMS (hive-site.xml). xml configuration) file storage path:

Use Hive SQL to query the correct posture of the Iceberg table -

This corresponds to the normal action of creating a Hive table. Let’s see, what happens if you specify hive’s Catalog and configure a remote HMS?

Register the catalog of hive type

A remote HMS address is registered here:

SET iceberg.catalog.another_hive.type=hive; 
SET iceberg.catalog.another_hive.uri=thrift://; 
SET iceberg.catalog.another_hive.warehouse=s3a://faas-ethan/warehouse/; 
SET hive.vectorized.execution.enabled=false; 

Create a Hive Iceberg table in the current HiveCli, specify the location storage location, and use the hive type Catalog registered above:

 CREATE TABLE default.sample_hive_table_1( id bigint, name string ) 
     PARTITIONED BY( dept string ) 
     STORED BY '' location 's3a://faas-ethan/warehouse/default/sample_hive_table_1' 
     TBLPROPERTIES ('iceberg.catalog'='another_hive'); 

Confirm the storage parameters of the current table in the table_params metadata table of the local HMS (configured in hive-site.xml):

Use Hive SQL to query the correct posture of the Iceberg table -

Location information is stored in the SDS table. The table storage type is MANAGED_TABLE. Then go to the metadata table table_params of the remote HMS (specified by another_hive.uri) to see the storage content:

Use Hive SQL to query the correct posture of the Iceberg table -

In addition to the information stored in the local HMS, there are additional statistical information and metadata location information about Iceberg. The Location information is also copied synchronously and stored in the SDS table, but the storage type of the table is changed to EXTERNAL_TABLE. It can be seen that creating an Iceberg table in Hive will store a copy of metadata in both HMSs. Only in this way can the Iceberg table in the remote HMS be visible to the local HMS. Therefore, it is necessary to ensure that there is a corresponding database in the remote HMS. If you view the table creation statement in the current Hive CLI, you will get the following output:

show create table sample_hive_table_1; 
CREATE TABLE `sample_hive_table_1`( `id` bigint COMMENT 'from deserializer',    
                                   `name` string COMMENT 'from deserializer',   
                                   `dept` string COMMENT 'from deserializer') 
    ROW FORMAT SERDE    ''  
    STORED BY    ''  
    WITH SERDEPROPERTIES (    'serialization.format'='1') 
    LOCATION  's3a://faas-ethan/warehouse/default/sample_hive_table_1' 
 Time taken: 5.45 seconds, Fetched: 18 row(s) 

This means that the information returned by Hive Cli comes from the local HMS, similar to common Hive.

Question: If there is only the Iceberg table of the remote HMS, how to access it on the local HMS?

At this point, metadata can be generated on the local HMS by creating an external appearance as follows:

CREATE EXTERNAL TABLE default.sample_hive_table_1( id bigint, name string ) 
     PARTITIONED BY( dept string ) 
     STORED BY '' 
     location 's3a://faas-ethan/warehouse/default/sample_hive_table_1' 
     TBLPROPERTIES ('iceberg.catalog'='another_hive'); 

What does this do? Try the following statement:

select * from default.sample_local_hive_table_1,sample_hive_table_1; 

did you see? It implements federated queries across HMS through Hive SQL!

Register the catalog of the hadoop type

SET iceberg.catalog.hadoop_cat.type=hadoop; 
SET iceberg.catalog.hadoop_cat.warehouse=s3a://faas-ethan/warehouse; 

Create a hadoop Iceberg table with the registered hadoop type Catalog:

CREATE TABLE default.sample_hadoop_table_1( id bigint, name string ) 
     PARTITIONED BY ( dept string ) 
     STORED BY '' 
     LOCATION 's3a://faas-ethan/warehouse/default/sample_hadoop_table_1' 
     TBLPROPERTIES ('iceberg.catalog'='hadoop_cat'); 

If you look at the stored metadata, you will find that it is the same as the table created using the hive type Catalog, without the metadata_location attribute:

Use Hive SQL to query the correct posture of the Iceberg table -

At the same time, there are more version-hint.text metadata metadata files in MinIo:

Use Hive SQL to query the correct posture of the Iceberg table -

This is because the version mechanisms for managing the latest metadata are different. So you cannot create an Iceberg table of a Hive-type Catalog based on a Hadoop-type Catalog. The reverse is also not possible.

Question: The current table is created through Hive Cli, so the metadata will be stored in HMS. If the Iceberg table is created through the Iceberg API, and HMS has no relevant information at all, how can I access this hadoop Iceberg table?

At this time, specify the path to store the table in Hadoop through location, and then register the metadata in HMS through EXTERNAL, and Hive will automatically infer the table schema from it:

SET iceberg.catalog.hadoop_cat.type=hadoop; 
SET iceberg.catalog.hadoop_cat.warehouse=s3a://faas-ethan/warehouse; 
CREATE EXTERNAL TABLE default.sample_hadoop_table_1 
location 's3a://faas-ethan/warehouse/default/sample_hadoop_table_1' 
TBLPROPERTIES ('iceberg.catalog'='hadoop_cat'); 

At this time, the location must conform to the format: ${iceberg.catalog.hadoop_cat.warehouse}/${dbName}/${tableName}, otherwise the storage format cannot be recognized.

The above method of creating a table can also be implemented in the following way:

 CREATE EXTERNAL TABLE default.sample_hadoop_table_1 
 location 's3a://faas-ethan/warehouse/default/sample_hadoop_table_1' 
 TBLPROPERTIES ('iceberg.catalog'='location_based_table'); 


Use Hive SQL to create an Iceberg table. You can create an Iceberg table based on hive, hadoop, and location_based_table type catalogs, where location_based_table can be regarded as a simplified form of hadoop type, and register the tables in the external HMS with the current HMS in the form of external appearances, so that Realize the federated query of Hive, but the version management of hive and hadoop tables is slightly different, so the external table can only be mapped on the same type of catalog table. In addition, due to the large span of Hive versions, there are many compatibility issues between different versions of HMS, such as the inability of HMS 3.x to read the metadata of HMS 2.x.

Recommended Today

The Way of MatrixOne Chaos Testing

Author: Su Dong MO Test Engineer guide Migrating to cloud-native and adopting a distributed architecture/design approach to creating cloud-native applications is a major trend in recent years, and this trend is accelerating further. The most important driving factor is that it can greatly reduce application downtime, high elasticity, high resource utilization, etc., thereby adding more […]