Big data ETL processing tool kettle common input and output

Time:2022-1-1

Compared with the popular big data technology, you may think kettle has too few use scenarios, or there is no need to use such a thing. Check it outgithub kettleIt is found that there are some updates recently. In addition, for data users without programming experience, they can use a very simple kettle to realize what business they do through graphical interface design without writing code. They can do some experiments, such as capturing stock data and foreign exchange information on the website.

Kettle supports many input and output formats, including text files, data tables, and database engines. In short, kettle’s powerful input, output and conversion functions make it very convenient for you to operate data.

Common input steps

File input steps

Common text file input steps include: CSV file input, excel input, text file input, etc.

The HelloWorld level function “copy data from CSV file to excel file” has been introduced in the previous article. Please refer to the detailed steps.

You can select all files in the same directory, select a directory, and then configure files by wildcard symbols. You can also select whether to read files in subdirectories of the current directory, as shown in the following figure:

Big data ETL processing tool kettle common input and output

Select all files in the same directory

XML input steps

XML is an extensible markup language, which is mainly used to transmit and store data. In some traditional systems, this method is still used for data transmission and docking. With the help of the “get data from XML” input step, the data information in the XML file is obtained, and the location of some data in the XML document is determined by using XPath. XPath is based on the tree structure of XML, Provides the ability to find nodes in the data structure tree.

Example

Complete an example through the wizard to read the attribute configuration information in the POM file, as shown in the following figure:

Big data ETL processing tool kettle common input and output

XML data fragment
Big data ETL processing tool kettle common input and output

Parsed results

Common XPath expressions

expression explain
nodename Select all children of this node
/ Select from root node
// Selects nodes in the document from the current node that matches the selection, regardless of their location
. Select current node
Select the parent node of the current node
@ Select Properties
div Select all child nodes of the div element
/div Select root element div
div/p Select the child element P under the div element
//div Select all div elements
div//p Select all P elements under the div element
//@lang Select all properties named Lang

JSON input steps

Compared with XML, it is a lightweight data exchange format. JSON core concepts: array ([] data), object ({} data), attribute (K: V data)

To realize the function of “calling the restful interface to import JSON result storage”, whether calling the restful interface through Java or Python coding to store the results, it has a certain complexity. First, you need to load the third-party rest component dependencies, then connect to the database, write SQL statements, and finally insert them into the target database. But with kettle, we only need to use the graphical interfaceSpoonYou can easily complete the operation of interface call and warehousing.

Obtain the “Shanghai and Shenzhen IPO notice” information through a simple get request. The data and operations are as follows

sample data
Big data ETL processing tool kettle common input and output

Notice of new shares in Shanghai and Shenzhen
Interface information

https://xueqiu.com/service/v5/stock/preipo/cn/query?type=subscribe&order_by=onl_subbeg_date&order=asc&page=1&size=10&_=1626884306432

Conversion steps:

Big data ETL processing tool kettle common input and output

Capture the data flow in the figure above
Output results
Big data ETL processing tool kettle common input and output

Grab results
Specific configuration steps

Table input steps

Add database driver

If you want to operate the database, you must first download the database driver and then place it under / data integration / lib. By default, the following drivers are included:

(base)  ~/Documents/apps/data-integration/lib  ls -al | grep sql
[email protected]   1 Yezhiwei  staff   1473091 Jun 11  2019 hsqldb-2.3.2.jar
[email protected]   1 Yezhiwei  staff    825943 Jun 11  2019 postgresql-42.2.5.jar
[email protected]   1 Yezhiwei  staff   3201133 Jun 11  2019 sqlite-jdbc-3.7.2.jar

Add MySQL driver jar package mysql-connector-java-5.1.41 Jar to / data integration / lib:

Big data ETL processing tool kettle common input and output

image-20210722094311728
Configure database connection
Big data ETL processing tool kettle common input and output

Configure database connection
Get table SQL query statement, data preview

Select the database connection configuration, and then click Get SQL query statement

Big data ETL processing tool kettle common input and output

image-20210722094627047

Select a table, complete the setting according to the prompt, and automatically generate SQL. In preview, you can view the data.

Big data ETL processing tool kettle common input and output

image-20210722094818719
Big data ETL processing tool kettle common input and output

image-20210722095028211

Common output steps

Usually, after data processing, it needs to be saved to a place for subsequent use. Usually, it is output to excel or database. Excel output has been described in the previous example. The following focuses on the output steps of database.

Common database output steps include insert / update / delete. The example focuses on the “insert / update” and “delete” output steps

Table output

Output data from other input sources to a table

Big data ETL processing tool kettle common input and output

image-20210722100309221
Big data ETL processing tool kettle common input and output

image-20210722100343895

to update

Update is to compare the existing records in the database with the records in the data flow, and update them if they are different

Insert update

Insert update is to insert redundant data in the data stream on the basis of update

delete

Delete can be used with user-defined constant input to define a constant condition. All data that does not meet this condition will be deleted

Example
Data preparation
Big data ETL processing tool kettle common input and output

image-20210722102920952
Insert or update data from Excel to MySQL
Big data ETL processing tool kettle common input and output

image-20210722113824626
Big data ETL processing tool kettle common input and output

Set excel input files and fields
Big data ETL processing tool kettle common input and output

Database output settings
New data insertion results
Big data ETL processing tool kettle common input and output

New data insertion
Update results
Big data ETL processing tool kettle common input and output

image-20210722113626073

epilogue

In data warehouse technology, ETL is essential. As a classic tool of ETL, kettle realizes what business to do through graphical interface design without writing code. It is also very simple for data users without programming experience.

Of course, there are some problems, such as the amount of data processed is relatively small, the documentation is relatively small, and so on. Have the opportunity to summarize the problems encountered in the use process.
Welcome to the official account: HelloTech, get more information.

Recommended Today

Could not get a resource from the pool when the springboot project starts redis; nested exception is io. lettuce. core.

resolvent: Find your redis installation path: Start redis server Exe After successful startup: Restart project resolution. ———————————————————————->Here’s the point:<——————————————————————- Here, if you close the redis command window, the project console will report an error. If you restart the project, the same error will be reported at the beginning, The reason is: It is inconvenient to […]