Stack technology sharing: OTS data migration – we do not produce data, we are big data porters

Time:2021-7-28

Data stack is a cloud native one-stop data platform PAAS. We have an interesting open source project on GitHub and gitee: flinkx. Flinkx is a unified data synchronization tool for batch flow based on Flink. It can collect both static data and real-time changing data. It is a data synchronization engine integrating global, heterogeneous and batch flow. Please order us a star if you like! star! star!

GitHub open source project:https://github.com/DTStack/fl…

Gitee open source project:https://gitee.com/dtstack_dev…

“Table storage” is NoSQL’s data storage service. It is a distributed structured and semi-structured data storage and management service based on cloud computing technology.

The data model stored in the table is centered on the “two-dimensional table”.

Tables have the concepts of rows and columns, but unlike traditional databases, the tables stored in tables are sparse

Each row can have different columns, and attribute columns can be added or reduced dynamically. There is no need to define a strict schema for the attribute columns of the table when creating a table.
1、 Overview

For OTS data migration, “dataX” can be used to complete full data migration. However, due to the large amount of data in some data tables, the full migration cannot be completed within the specified time window. At present, dataX can only query the range of primary key values and does not support extracting data according to the range of attribute columns.

Therefore, full + incremental data migration can be realized in the following two ways:

  • If the partition key contains range information (such as time information and self incrementing ID), it will migrate in batches with the specified range as the cut-off point.
  • If the partition key does not contain range information, the data can be migrated in batches by using the double write mode on the application side and written to the same business table in the target environment. Using the primary key uniqueness of OTS, select the strategy of overwriting the original row for duplicate data to ensure data uniqueness.

In short, in short, we do not produce data. At the moment, we are the porters of big data.

Next, this paper takes the application side adjustment to double write mode as an example to explain in detail the OTS data migration and verification process.

The OTS data migration process is shown in the figure below:Stack technology sharing: OTS data migration - we do not produce data, we are big data porters

Preparation for OTS data migration

  • Pre migration stage: full migration of large tables in double write mode
  • Formal migration stage: full migration of incremental tables and other small tables in double write mode

2、 Pre migration phase

1. Preparatory work

In order to ensure the data consistency between the new and old environments, you need to clear the OTS data table of the target environment before data migration. The delete operation directly deletes the data in the table through the dataX tool without re creating the table.

The specific operations are as follows:

1) Configure dataX tasks

Before using dataX to clear data, you need to configure the JSON file required by the corresponding data table to use dataX to perform the delete task. In the configuration of clearing data, both reader and writer configure the connection information of the target end, and the data writing mode can be configured as deleterow. The details are as follows:

{

"job": {
    "setting": {
        "speed": {
            "channel": "5"
        }
    },
    "content": [{
        "reader": {
            "name": "otsreader",
            "parameter": {
                "endpoint": "http://xxx.vpc.ots.yyy.com/",
                "accessId": "dest_accessId",
                "accessKey": "dest_accessKey",
                "instanceName": " dest_instanceName",
                "table": " tablename ",
                "column": [{
                        "name": "xxxxx"
                    },
                    {
                        "name": "xxxxx"
                    }
                ],
                "range": {
                    "begin": [{
                        "type": "INF_MIN"
                    }],
                    "end": [{
                        "type": "INF_MAX"
                    }]
                }
            }
        },
        "writer": {
            "name": "otswriter",
            "parameter": {
                "endpoint": "http://xxx.vpc.ots.yun.yyy.com/",
                "accessId": "dest_accessId",
                "accessKey": "dest_accessKey",
                "instanceName": " dest_instanceName",
                "table": " tablename ",
                "primaryKey": [{
                    "name": "xxxxx",
                    "type": "string"
                }],
                "column": [{
                        "name": "xxxxx",
                        "type": "string"
                    },
                    {
                        "name": "xxxxx",
                        "type": "string"
                    }
                ],
                "writeMode": "DeleteRow"
            }
        }
    }]
}

}

2) Execute dataX task

  • Log in to the ECS where dataX is located and enter the path where dataX is located
  • Execute del on the corresponding machine tool_ The pre.sh script starts the data emptying of the corresponding table of the target environment. The specific commands are as follows:

sh del_pre.sh

  • del_ The pre.sh script is as follows:

!/bin/bash
nohup python datax.py del_table_1.json –jvm=”-Xms16G -Xmx16G” > del_table_1.log &

2. Data migration

Under the condition of non-stop service, all data tables with large amount of data in the source environment are migrated to the corresponding data tables in the target environment.

1) Configure dataX tasks

Configure the corresponding JSON file for the data table in dataX. The specific contents of the migration configuration are as follows:

{

"job": {
    "setting": {
        "speed": {
            "channel": "5"
        }
    },
    "content": [{
        "reader": {
            "name": "otsreader",
            "parameter": {
                "endpoint": "http://xxx.vpc.ots.yyy.com/",
                "accessId": "src_accessId",
                "accessKey": "src_ accessKey ",
                "instanceName": " src_instanceName",
                "table": "tablename",
                "column": [{
                        "name": "xxxxx"
                    },
                    {
                        "name": "xxxxx"
                    }
                ],
                "range": {
                    "begin": [{
                        "type": "INF_MIN"
                    }],
                    "end": [{
                        "type": "INF_MAX"
                    }]
                }
            }
        },
        "writer": {
            "name": "otswriter",
            "parameter": {
                "endpoint": "http://xxx.vpc.ots.yun.zzz.com/",
                "accessId": "dest_accessId",
                "accessKey": "dest_accessKey",
                "instanceName": " dest_instanceName",
                "table": " tablename ",
                "primaryKey": [{
                    "name": "xxxxx",
                    "type": "string"
                }],
                "column": [{
                        "name": "xxxxx",
                        "type": "string"
                    },
                    {
                        "name": "xxxxx",
                        "type": "string"
                    }
                ],
                "writeMode": "PutRow"
            }
        }
    }]
}

}

Note that because OTS itself is a NoSQL system, all attribute columns must be configured in the configuration of migration data, otherwise the value of the corresponding attribute column will be missing.

2) Execute dataX task

  • Log in to the ECS where dataX is located and enter the path where dataX is located
  • Execute pre on the corresponding machine tool_ Transfer.sh script to start data migration from VPC OTS to VPC OTS. The specific commands are as follows:

sh pre_transfer.sh

  • pre_ The script of transfer.sh is as follows:

!/bin/bash
nohup python datax.py table_1.json –jvm=”-Xms16G -Xmx16G” >table_1.log &

Ah, at this time, everything is ready, and the data is only to be migrated!

Before migrating, let’s finally focus on the objectives of data migration:

Next, enter the formal migration stage!
3、 Formal migration phase

1. OTS data silence

OTS data silence is mainly determined by observing whether the data in the corresponding table changes. The verification methods mainly include row count statistics and content statistics.

1) Row count

Because OTS itself does not provide a count interface, it uses the method of creating an OTS external table in hive to read OTS data and calculate the number of rows in the corresponding data table. The specific operations are as follows:

  • Create external table
  • Start hive and create the external table corresponding to the above data table; In order to improve the statistical efficiency, the external table can only read the primary key column of OTS. The example of table creation statement is as follows:

CREATE EXTERNAL TABLE t_oilcard_expenses_old
(h_card_no string)
STORED BY ‘com.aliyun.openservices.tablestore.hive.TableStoreStorageHandler’
WITH SERDEPROPERTIES(
“tablestore.columns.mapping”=”card_no”)
TBLPROPERTIES (“tablestore.endpoint”=”$endpoint “,”tablestore.instance”=”instanceName”,”tablestore.access_key_id”=”ak”,”tablestore.access_key_secret”=”sk”,”tablestore.table.name”=”tableName”);

  • Enter the path where the script is located
  • Log in to the ECS of the Hadoop cluster master and enter the directory of hive
  • Execute row count
  • Execute pre_ all_ Count.sh script to start the row count of the OTS corresponding table in the source environment

nohup sh pre_all_count.sh >pre_all_count.log &

  • pre_ all_ The script of count.sh is as follows:

!/bin/bash
./bin/hive -e “use ots;select count(h_card_no) from tableName;” >table.rs &

The row count is performed twice continuously. If the two statistics results are consistent, the data has been silent and the data writing is stopped.

2) Content statistics

Due to the single value corresponding to the partition key of some data tables, all data are stored in the same partition. If hive is used to count the number of rows, it will take too long. Therefore, for this table, use dataX to import OTS data into OSS for content statistics. The specific operations are as follows:

  • Enter the path where the script is located
  • Log in to the ECS corresponding to the above table and enter the path where dataX is located;
  • Perform content verification
    a. Execute check_ Table.sh script to export OTS data tables in the source environment to OSS;

sh check_table.sh

  • check_ The script of table.sh is as follows:

!/bin/bash
nohup python datax.py table.json –jvm=”-Xms8G -Xmx8G”>ots2oss01.log &

b. Get the Etag value of OSS object and write it to the corresponding file table_ check01.rs
Perform content statistics twice in a row and compare the Etag values of the exported object twice. If the results are consistent, it indicates that the data has been silent and the data writing is stopped.

2. OTS data migration

1) Preparatory work

In order to ensure the consistency between the new and old environment data after migration and prevent the target environment from leaving dirty data due to testing, it is necessary to clear the data of the remaining full scale of OTs in the target environment before data migration.

“Data clearing methods” mainly include drop and delete. The differences between them are as follows:

a. Drop table operation

Log in to the tool machine where the OTS graphical client is located, connect the specified OTS instance with the following information, and drop the corresponding table;

AK: dest_accessId
SK: dest_accessKey
InstanceName: InstanceName
Endpoint:endpoint

After confirming the deletion, re create the corresponding data on the client.

b. Delete table operation

The delete operation is to directly delete the data in the table through the dataX tool without rebuilding the table. The configuration files required for dataX are shown in 2.1.1.

  • Log in to the ECS where dataX is located and enter the path where dataX is located
  • Execute the delete script on the corresponding tool machine to clear the data of the corresponding table of the OTS in the target environment. The specific commands are as follows:

sh del_table_01.sh

  • del_ table_ 01.sh script is as follows:

!/bin/bash
nohup python datax.py del_table.json –jvm=”-Xms16G -Xmx16G”>del_table.log &

2) Data migration

When the service of the source environment is stopped, migrate all incremental tables and other small tables in the dual write mode to the corresponding data tables in the target environment.

The specific operations are as follows:

a. Configure dataX tasks

Configure the corresponding JSON file for the above data table in dataX. Refer to 2.2.1 for the specific content of migration configuration. In the configuration of migration data, all attribute columns need to be listed.

b. Execute dataX task

  • Log in to the ECS where dataX is located and enter the path where dataX is located
  • Execute the transfer.sh script on the corresponding tool machine to start the data migration from VPC OTS to VPC OTS. The specific commands are as follows:

sh transfer.sh

  • The script of transfer.sh is as follows:

!/bin/bash
nohup python datax.py Table.json >Table.log &

3. OTS data verification

The data verification methods of OTs in new and old environments include line count and content count, as follows:

1) Source environment data statistics

The statistics of the data volume of the OTS data table of the source environment can be based on the last statistical result during the data silence period.

2) Target environment data statistics

a. Row count

Because OTS itself does not provide a count interface, and the target environment ODPs supports the creation of OTS external tables, OTS data is read and the number of rows in the corresponding data table is calculated by creating OTS external tables in ODPs. The specific operations are as follows:

  • Create external table
  • Log in to odpscmd and create the external table corresponding to the above data table;
  • Enter the path where the script is located
  • Log in to the ECS where odpscmd tool is located and enter the path where ODPs is located;
  • Execute row count
  • Execute newots_ Count.sh script, you can count the number of rows in the OTS corresponding table in the target environment;

nohup sh newots_count.sh >newots_count.log &

  • newots_ The script of count.sh is as follows:

!/bin/bash
./bin/odpscmd -e “select count(h_card_no) from tableName;” >table.rs &

b. Content statistics

Since some data tables of the source environment use content statistics for data verification, in order to facilitate the comparison of data consistency, the target environment also uses content statistics. For specific operations, refer to 3.1.2.