Hive data import of nebula exchange tool

Time:2021-1-25

Hive data import of nebula exchange tool

Absrtact: with the contribution of xrfinbj, this paper mainly introduces the process of importing data from hive data warehouse to Nebula graph by exchange tools and the related precautions.

1 background

The company has the scenario of using graph database, and determines the nebula graph database through technology selection. It also needs to verify the query performance of the nebula graph database in the actual business scenario. So it’s urgent to import data into Nebula graph and verify it. In this process, we found that it is not complete to import data from hive data warehouse to Nebula graph document through exchange tool, so we recorded the pit we stepped on in this process, fed back to the community, and avoided future generations from detours.

This article is mainly based on the two posts I posted in the forum before:

2 environmental information

  • Nebula graph version: nebula:nightly
  • Deployment mode (distributed / stand-alone / docker / dbaas): MAC docker deployment
  • Hardware information

    • Disk (SSD / HDD): MAC SSD
    • CPU and memory information: 16 g
  • Data warehouse environment (local data warehouse built by MAC)

    • Hive 3.1.2
    • Hadoop 3.2.1
  • Exchange tools:https://github.com/vesoft-inc/nebula-java/tree/v1.0/tools/exchange

Generate jar package after compilation

  • Spark

Spark-2.4.7-bin-hadoop 2.7 (configure the corresponding core of Hadoop 3.2.1 in conf directory)- site.xml ,hdfs- site.xml ,hive- site.xml Set spark- env.sh )
Scala code runner version 2.13.3 — Copyright 2002-2020, LAMP/EPFL and Lightbend, Inc.

3 configuration

1 Nebula Graph DDL

CREATE SPACE test_ hive(partition_ num=10, replica_ Factor = 1); -- create a graph space, assuming that only one copy is needed in this example
USE test_ Hive; -- select graph space test
Create tag taga (idint int, idstring string, tboolean bool, tdouble double); -- create tag taga
Create tag TagB (idint int, idstring string, tboolean bool, tdouble double); -- create tag TagB
Create edge edgeab (idint int, idstring string, tboolean bool, tdouble double); -- create edge edgeab

2 Hive DDL

CREATE TABLE `tagA`(                               
   `id` bigint,                                     
   `idInt` int,                            
   `idString` string,                                 
   `tboolean` boolean,                                 
   `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '
CREATE TABLE `tagA`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagA select 1,1,'str1',true,11.11;
insert into tagA select 2,2,"str2",false,22.22;
CREATE TABLE `tagB`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagB select 3,3,"str 3",true,33.33;
insert into tagB select 4,4,"str 4",false,44.44;
CREATE TABLE `edgeAB`(                               
`id_source` bigint,                                     
`id_dst` bigint,         
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into edgeAB select 1,3,5,"edge 1",true,55.55;
insert into edgeAB select 2,4,6,"edge 2",false,66.66;
1' LINES TERMINATED BY '\n'; insert into tagA select 1,1,'str1',true,11.11; insert into tagA select 2,2,"str2",false,22.22; CREATE TABLE `tagB`( `id` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '
CREATE TABLE `tagA`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagA select 1,1,'str1',true,11.11;
insert into tagA select 2,2,"str2",false,22.22;
CREATE TABLE `tagB`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagB select 3,3,"str 3",true,33.33;
insert into tagB select 4,4,"str 4",false,44.44;
CREATE TABLE `edgeAB`(                               
`id_source` bigint,                                     
`id_dst` bigint,         
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into edgeAB select 1,3,5,"edge 1",true,55.55;
insert into edgeAB select 2,4,6,"edge 2",false,66.66;
1' LINES TERMINATED BY '\n'; insert into tagB select 3,3,"str 3",true,33.33; insert into tagB select 4,4,"str 4",false,44.44; CREATE TABLE `edgeAB`( `id_source` bigint, `id_dst` bigint, `idInt` int, `idString` string, `tboolean` boolean, `tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '
CREATE TABLE `tagA`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagA select 1,1,'str1',true,11.11;
insert into tagA select 2,2,"str2",false,22.22;
CREATE TABLE `tagB`(                               
`id` bigint,                                     
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into tagB select 3,3,"str 3",true,33.33;
insert into tagB select 4,4,"str 4",false,44.44;
CREATE TABLE `edgeAB`(                               
`id_source` bigint,                                     
`id_dst` bigint,         
`idInt` int,                            
`idString` string,                                 
`tboolean` boolean,                                 
`tdouble` double) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001' LINES TERMINATED BY '\n';
insert into edgeAB select 1,3,5,"edge 1",true,55.55;
insert into edgeAB select 2,4,6,"edge 2",false,66.66;
1' LINES TERMINATED BY '\n'; insert into edgeAB select 1,3,5,"edge 1",true,55.55; insert into edgeAB select 2,4,6,"edge 2",false,66.66;

My latest Nebula_ application.conf file

Pay attention to exec, fields nebula.fields , vertex, source, target field mapping

{
  # Spark relation config
  spark: {
    app: {
      name: Spark Writer
    }

    driver: {
      cores: 1
      maxResultSize: 1G
    }

    cores {
      max: 4
    }
  }

  # Nebula Graph relation config
  nebula: {
    address:{
      graph: ["192.168.1.110:3699"]
      meta: ["192.168.1.110:45500"]
    }
    user: user
    pswd: password
    space: test_hive

    connection {
      timeout: 3000
      retry: 3
    }

    execution {
      retry: 3
    }

    error: {
      max: 32
      output: /tmp/error
    }
    rate: {
      limit: 1024
      timeout: 1000
    }
  }

  # Processing tags
  tags: [
    # Loading from Hive
    {
      name: tagA
      type: {
        source: hive
        sink: client
      }
      exec: "select id,idint,idstring,tboolean,tdouble from nebula.taga"
      fields: [id,idstring,tboolean,tdouble]
      nebula.fields: [idInt,idString,tboolean,tdouble]
      vertex: id
      batch: 256
      partition: 10
    }
    {
      name: tagB
      type: {
        source: hive
        sink: client
      }
      exec: "select id,idint,idstring,tboolean,tdouble from nebula.tagb"
      fields: [id,idstring,tboolean,tdouble]
      nebula.fields: [idInt,idString,tboolean,tdouble]
      vertex: id
      batch: 256
      partition: 10
    }
  ]

  # Processing edges
  edges: [
    # Loading from Hive
    {
      name: edgeAB
      type: {
        source: hive
        sink: client
      }
      exec: "select id_source,id_dst,idint,idstring,tboolean,tdouble from nebula.edgeab"
      fields: [id_source,idstring,tboolean,tdouble]
      nebula.fields: [idInt,idString,tboolean,tdouble]
      source: id_source
      target: id_dst
      batch: 256
      partition: 10
    }
  ]
}

4 perform import

4.1 ensure the startup of nebula service

4.2 ensure hive table and data are ready

4.3 execute spark SQL cli to check whether hive table and data are normal to ensure spark environment is OK

Hive data import of nebula exchange tool

4.4 after all configuration work is ready, execute spark command:

spark-submit --class com.vesoft.nebula.tools.importer.Exchange --master “local[4]” /xxx/exchange-1.0.1.jar -c /xxx/nebula_application.conf -h

4.5 after successful import, you can use DB_ The dump tool checks the amount of imported data to verify its correctness

./db_dump --mode=stat --space=xxx --db_path=/home/xxx/data/storage0/nebula   --limit 20000000

5 step pit and instructions

  • The first one is that the spark submit command does not add the – h parameter
  • The tagName in the nebula graph is case sensitive. In the configuration of tags, the name should be the tag name of the nebula graph
  • The int of hive is inconsistent with the int of nebula graph. Bigint in hive corresponds to the int of nebula graph

Other instructions:

  • Since the underlying storage of nebula graph is kV, repeated insertion is actually an overlay, and the replacement performance of insert by update operation will be higher
  • For the time being, the incomplete part of the document can only be solved by looking at the source code and going to the forum to ask (it’s not easy for students to develop, and they have to be nervous to develop and answer users’ questions)
  • Import data, compact and operation suggestions:https://docs.nebula-graph.com.cn/manual-CN/3.build-develop-and-administration/5.storage-service-administration/compact/
  • I have verified the following two scenarios:

    • Importing data from hive 2 (Hadoop 2) to Nebula graph with spark 2.4
    • Importing data from hive3 (Hadoop 3) to Nebula graph with spark 2.4

Note: at present, exchange does not support spark 3. After compilation, an error is reported, so it is impossible to verify the spark 3 environment

There are still some questions

  • nebula_ application.conf File parameters batch and rate.limit How to set it? How to choose parameters?
  • Hive data import principle of exchange tool

6 exchange source code debug

Spark debug part reference blog:https://dzone.com/articles/how-to-attach-a-debugger-to-apache-spark

Through the exchange source code learning and debugging, we can deepen the understanding of exchange principles, and also find some unclear documents, such asImport SST fileandDownload and IngestOnly combined with the source code can we find the problem that the description of the document is not clear and the logic is not rigorous.

Some simple parameter configuration problems can also be found through source code debugging.

To get to the point:

Step 1:

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=4000

Step 2:

spark-submit --class com.vesoft.nebula.tools.importer.Exchange --master “local” /xxx/exchange-1.1.0.jar -c /xxx/nebula_application.conf -h
Listening for transport dt_socket at address: 4000

Step 3: idea configuration

Hive data import of nebula exchange tool

Step 4: click debug in idea

Hive data import of nebula exchange tool

7 suggestions and thanks

Thanks to vesoft for providing the most powerful Nebula graph database in the universe, which can solve many practical problems in the business. The pain in the middle of the process is nothing (see the previous sharing, the pain of 360 subjects is the real pain). Thanks again for the timely feedback from the community

I’m looking forward to exchange supporting Nebula graph 2.0

reference material

Like this article? Come on, come on, give it to usGitHubOrder a star table to encourage you ♂️ ‍ ♀️ [thank you]

Exchange chart database technology? Make a friend, official assistant of nebula graph, wechat:NebulaGraphbotPull you into the communication group~~

Recommended reading

Recommended Today

Video compatibility in wechat

1. In line properties of video tag SRC: URL of video Poster: Video cover, no picture displayed when playing Preload: preload Autoplay: autoplay Loop: loop playback Controls: browser’s own control bar Width: video width Height: video height style=”object-fit:fill” /Adding this style will make the Android / Web video full screen in wechat. If it is […]