Getting Started with Kettle, a Big Data ETL Processing Tool

Time:2022-8-9

Introduction to Kettle

ETL (abbreviation for Extract-Transform-Load, ieData extraction, transformation, loadingprocess), for data developers, we often encounter various data processing, transformation, and migration, so it is essential to understand and master the use of an ETL tool. The ETL tool we want to learn here is Kettle .

What is Kettle

Kettle is a foreign open source ETL tool, which has no restrictions on commercial users. It is written in pure Java and can run on Window, Linux, and Unix. It does not require installation, and the data extraction is efficient and stable. Kettle's Chinese name is Kettle, which allows to manage data from different databases, put variousdatainto a pot, and thenspecified formatoutflow. There are two kinds of script files in Kettle,TransformationandJob, Transformation completes the basic transformation of the data, and Job completes the control of the entire workflow. What business is realized through graphical interface design, and in the start module under Job, there is a timing function, which can be timed every day, every week, etc.

Core Components of Kettle

name Function
Spoon Through the graphical interface, allows you to design the ETL transformation process (Transformation) through the graphical interface
Pan Command line tool to run the conversion
Kitchen Command line tool to run jobs
Carte Carte is a lightweight web container for building dedicated, remote ETL Server
  • Jobs and transformations can be performed in the GUI, but only during the development, testing and debugging phases. Spoon is rarely used after development and needs to be deployed to the production environment. Kitchen and Pan command line tools are used in the actual production environment.
  • The deployment and production stage generally needs to be executed through the command line, and the command line needs to be put into the shell script, and the script should be scheduled regularly.
  • The Kitchen and Pan tools are Kettle's command-line execution programs, just wrappers around the Kettle execution engine, they just interpret command-line arguments, call and pass those arguments to the Kettle engine.
  • Kitchen and Pan are very similar in concept and usage, and the parameters of these two commands are basically the same. The only difference is that Kitchen is used to perform jobs and Pan is used to perform transformations.

Kettle Concept Model

The execution of Kettle is divided into two levels: Job (job, .kjb suffix) and Transformation (transformation, .ktr suffix)

Getting Started with Kettle, a Big Data ETL Processing Tool

Simply put, a transformation is an ETL process, and a job is a collection of multiple transformations and jobs, in which transformations or jobs can be scheduled, scheduled tasks, and so on.

In the actual process, the writing process should not be very complicated. When data extraction requires multiple steps, it needs to be divided into multiple transformations, integrated into a job and placed in sequence, and then executed.

Catalog file function description

Getting Started with Kettle, a Big Data ETL Processing Tool

Getting Started with Kettle, a Big Data ETL Processing Tool

Getting Started with Kettle, a Big Data ETL Processing Tool

Download and install

The download address of each version of the official website: https://sourceforge.net/projects/pentaho/files/Data%20Integration/
Domestic Kettle Forum Network: https://www.kettle.net.cn/

Kettle is an open source software for pure Java programming. You need to install JDK and configure environment variables. After decompression, it can be used directly without installation.

Other things to prepare:database driven, such as putting the driver under the bin folder of the Kettle root directory.

Open Kettle Just run Spoon.bat (win)/spoon.sh (Linux / macOS) to open the Spoon graphical tool.

Start Kettle

As shown below, execute./spoon.shOrder

Getting Started with Kettle, a Big Data ETL Processing Tool

welcome page

Getting Started with Kettle, a Big Data ETL Processing Tool

HelloWorld

Copy data from CSV file to Excel file

Getting Started with Kettle, a Big Data ETL Processing Tool

CSV file input

Getting Started with Kettle, a Big Data ETL Processing Tool

Drag "CSV file input" to the work area on the right, double-click to edit, browse and select the prepared test file, click "Get Field" to automatically obtain the header information in the CSV file, the input configuration is completed, and the next step is to configure the output .

Getting Started with Kettle, a Big Data ETL Processing Tool

Excel output

Getting Started with Kettle, a Big Data ETL Processing Tool

Drag "Excel Output" to the work area on the right, double-click to edit, this step is relatively simple, browse to select the output directory and set the file name to complete the configuration.

Getting Started with Kettle, a Big Data ETL Processing Tool

Convert file

Hold shift + left mouse button to establish a connection and save the conversion configuration

Getting Started with Kettle, a Big Data ETL Processing Tool

run conversion

Getting Started with Kettle, a Big Data ETL Processing Tool

View Results

Getting Started with Kettle, a Big Data ETL Processing Tool

Summarize

A preliminary understanding of Kettle core components and their use

  • Jobs and transformations can be performed in the GUI, but only during the development, testing and debugging phases. Spoon is rarely used after development and needs to be deployed to the production environment. Kitchen and Pan command line tools are used in the actual production environment.
  • The deployment and production stage generally needs to be executed through the command line, and the command line needs to be put into the shell script, and the script should be scheduled regularly.
  • The Kitchen and Pan tools are Kettle's command-line execution programs, just wrappers around the Kettle execution engine, they just interpret command-line arguments, call and pass those arguments to the Kettle engine.
  • Kitchen and Pan are very similar in concept and usage, and the parameters of these two commands are basically the same. The only difference is that Kitchen is used to perform jobs and Pan is used to perform transformations.

Step by step through a HelloWrold process

Getting Started with Kettle, a Big Data ETL Processing Tool

Welcome to pay attention to the public number: HelloTech, get more content