Taskctl is a domestic open source ETL tool, written in pure C, which can run on windows, Linux and UNIX.
To put it bluntly, it is necessary to understand the essential features and functions of general ETL tools, so as to better master the use of taskctl.
Today, we will first describe the general functions of ETL tools.
One of the functions of ETL tool: connection
Any ETL tool should have the ability to connect to a wide range of data sources and data formats. For the most commonly used relational database system, local connection mode (such as OCI for Oracle) should also be provided,ETL should provide the following basic functions:
Connect to common relational databases and get data, such as orcal, MS SQL server, IBM dB / 2, Ingres, MySQL and PostgreSQL. There are also many ways to get data from ASCII files with separators and fixed formats, from XML files, from popular office software, such as access database and Excel spreadsheet, using FTP, SFTP, SSH to get data (preferably without scripts), and from web services or RSS. If you need some data in ERP system, such as Oracle E-Business Suite, SAP / R3, PeopleSoft or JD / Edwards, ETL tool should also provide the connection to these systems. What else can we offer http://Salesforce.com And SAP / R3… But not in the suite, need additional installation. Other solutions are needed for data extraction of other ERP and financial systems. Of course, the most common method is to require these systems to export data in text format and use the text data as the data source.
The second function of ETL tool: platform independence
An ETL tool should be able to run on any platform or even a combination of different platforms. A 32-bit operating system may run well in the initial stage of development, but when the amount of data is increasing, a more powerful operating system is needed. On the other hand, development usually runs on windows or Mac. While the production environment is usually Linux system or cluster, your ETL solution should be able to switch between these systems seamlessly.
The third function of ETL tool: data scale
Generally, ETL can process big data in the following three ways.
- Concurrency: ETL process can process multiple data streams at the same time, so as to take advantage of modern multi-core hardware architecture.
- Partition: ETL can use specific partition mode to distribute data to concurrent data streams.
- Cluster: ETL process can be distributed on multiple machines.
Kettle is a Java based solution that can run on any computer installed with Java virtual machine (including windows, Linux and MAC). Each step in the transformation is executed in a concurrent way, and can be executed many times, which speeds up the processing speed.
When kettle runs the transformation, according to the user’s settings, it can send data to multiple data streams in different ways (there are two ways to send copies: distribution and replication). Distribution is similar to the distribution of playing cards. Each row of data is sent to only one data stream in turn. Replication is to send each row of data to all data streams.
In order to control the data more accurately, kettle also uses the partition mode, through which the data of the same feature can be sent to the same data stream. The partition here is only conceptually similar to a database partition.
Kettle has no function for database partitioning.
The fourth function of ETL tool: design flexibility
An ETL tool should leave enough freedom for developers to use, instead of limiting users’ creativity and design requirements in a fixed way. ETL tools can be divided into process based and mapping based.
The mapping based function only provides a set of fixed steps between the source data and the destination data, which seriously limits the freedom of design work. Mapping based tools are generally easy to use and can be used quickly, but for more complex tasks, process based tools are the best choice.
Using process based tools like kettle, you can create custom steps and transformations according to actual data and requirements.
The fifth function of ETL tool: reusability
It is very important that the designed ETL transformation can be reused. Copying and pasting the existing conversion steps is the most common kind of reuse, but it is not really reuse.
There is a mapping (sub transformation) step in taskctl, which can complete the reuse of transformation. In this step, one transformation can be used as a sub transformation of other transformations. In addition, the transformation can be used multiple times in multiple jobs, and the same job can be a sub job of other jobs.
The sixth function of ETL tool: extensibility
As we all know, almost all ETL tools provide scripts to solve the problems that can’t be solved by the tools themselves. In addition, there are a few ETL tools that can add components to the tool through API or other ways. Use script language to write functions, which can be called by other transformations or scripts.
Kettle provides all of the above functions. Java script step can be used to develop java script, save the script as a transformation, and then through the mapping (sub transformation) step, it can be turned into a standard reusable function. In fact, not limited to scripts, each transformation can be reused in this way of mapping (sub transformation), just like creating a component. Kettle is extensible in design and provides a plug-in platform. This plug-in architecture allows third parties to develop plug-ins for the kettle platform.
All plug-ins in kettle, even the components provided by default, are actually plug-ins. The only difference between built-in third-party plug-ins and pentaho plug-ins is technical support. Suppose you buy a third-party plug-in (such as a sugar CRM connection), and technical support is provided by a third party, not pentaho.
The seventh function of ETL tool: data conversion
A large part of ETL project is data conversion. Between input and output,Data should be checked, connected, separated, merged, transposed, sorted, merged, cloned, duplicated, filtered, deleted, replaced or other operations。
In different organizations, projects and solutions, the requirements of data conversion are very different, so it is difficult to say which conversion functions an ETL tool should provide at least.
However, common ETL tools (including taskctl)They all provide the following basic integration functions:
- Slowly changing dimensions
- Query value
- Row column conversion
- Conditional separation
- Sort, merge, connect
The eighth function of ETL tool: testing and debugging
Testing is usually divided into black box testing (also known as functional testing) and white box testing (structural testing).
Black box testing, ETL conversion is considered a black box, testers do not understand the function of the black box, only know the input and expected output.
White box testing requires the tester to know the internal working mechanism of the transformation and design test cases to check whether a specific transformation has a specific result.
Debugging is actually a part of white box testing, which allows developers or testers to run a transformation step by step and find out the problem.
The ninth function of ETL tool: pedigree analysis and impact analysis
Any ETL tool should have an important function: reading the metadata of transformation, which is to extract the information of data streams composed of different transformations.
Lineage analysis and impact analysis are two related features based on metadata.
Lineage is a kind of retrospective mechanism, which can see the source of data.
Impact analysis is another analysis method based on metadata, which can analyze the impact of source data on subsequent transformation and target table.
Function 10 of ETL tool: log and audit
The purpose of data warehouse is to provide an accurate information source, so the data in data warehouse should be reliable and trustworthy. In order to ensure the reliability of the array and record all data conversion operations, ETL tool should provide log and audit functions.
The log can record which steps were performed in the conversion process, including the start and end timestamps of each step.
Audit can trace all operations on data, including the number of rows read, converted and written.
Yes, our official account.“Agile scheduling task CTL”(ID:Gh_ 79ababc7910b) long term update of the latest Internet information, workplace anecdotes in the industry, as well as interesting and practical programming plug-ins and development framework knowledge sharing, etc,
If you are happy, you can pay attention to it! Looking forward to