ETL, the abbreviation of English extract transform load, is used to describe the process of extracting, transforming and loading data from the source to the destination. ETL is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data into the data warehouse according to the pre-defined data warehouse model. We list seven open source ETL tools below, and discuss the process from ETL to “no ETL”, because ELT is rapidly becoming the ultimate process of modern data and cloud environment.
ETL, the abbreviation of English extract transform load, is used to describe the process of extracting, transforming and loading data from the source to the destination. ETL is an important part of building a data warehouse. The user extracts the required data from the data source, cleans the data, and finally loads the data into the data warehouse according to the pre-defined data warehouse model.
Below, I list 9 free and well-known ETL scheduling tools on the market, and list several dimensions to consider before choosing these tools
Excellent ETL tools
Apache camel is a very powerful rule-based routing and media engine, which provides an implementation of enterprise integration patterns based on POJO. You can use its extremely powerful and easy-to-use API (which can be said to be a Java domain specific language) to configure its routing or mediation rules. With this kind of domain definition language, you can write a type safe and intelligent rule description file with simple java code in your ide.
Apache Kafka is an open source messaging system, written in scala and Java. The project provides a unified, high-throughput, low latency platform for processing real-time data. It has the following characteristics:
Through O (1) disk data structure to provide message persistence, this structure can maintain a long-term stable performance for even terabytes of message storage.
High throughput: even very common hardware Kafka can support hundreds of thousands of messages per second.
It supports partitioning messages through Kafka server and consumer cluster.
Support Hadoop parallel data loading.
Apatar, written in Java, is an open source data extraction, transformation and loading (ETL) project. Modular architecture. It provides visual job designer and mapping tool, supports all mainstream data sources, and provides flexible deployment options based on GUI, server and embedded. It can be used to integrate data across teams, fill data warehouse and data market, and maintain with little or no code when connecting to other systems.
Heka from Mozilla is a tool for collecting and collating data from multiple different sources. After collecting and collating the data, it sends the result report to different targets for further analysis.
Logstash is a platform for the transmission, processing, management and search of application logs and events. You can use it to collect and manage application logs, and provide web interface for query and statistics. Logstash is now a member of the elasticsearch family.
Scriptella is an open source ETL (extract transform load) tool and a script execution tool, developed in Java. Scriptella supports ETL scripts across databases, and can run with multiple data sources in a single ETL file. Scriptella can be integrated with any JDBC / ODBC compatible driver, and provides an interoperable interface with non JDBC data sources and scripting languages. It can also integrate with Java EE, spring, JMX, JNDI and JavaMail.
Talend is the first ETL (extract, transform, load) open source software provider in the data integration tool market. Talend provides a new perspective for ETL service with its dual mode of technology and business. It breaks the traditional unique closed service and provides an open, innovative, powerful and flexible software solution for companies of all sizes. Finally, due to the emergence of talent, data integration solutions are no longer exclusive to large companies.
Kettle is a foreign open source software. ETL tool, written in pure Java, green, no need to install, efficient and stable data extraction (data migration tool). There are two kinds of script files in kettle, transformation and job. Transformation completes the basic transformation of data, and job completes the control of the whole workflow.
9. Taskctl web (free version)
Taskctl, the first 100 000 level ETL scheduling software in China, is independently developed by Chengdu task Technology Co., Ltd., of which the latest web version is released
It was born on the basis of the original commercial version of taskctl 6.0;
Taskctl web application is a portable agile scheduling tool for batch job scheduling automation. It can provide a simple method to manage the scheduling and monitoring management of all kinds of complex jobs. Compared with the previous v1.2, the C / s application has complete functions and simplified part of the operation logic. It is suitable for beginners to experience taskctl products, and can also be used as a production application for small and medium-sized projects.
Please refer to jump for detailed software parameter specifications
- Task CTL, a simple ETL job scheduling tool
- 0 yuan permanent authorization, ETL scheduling software taskctl free application version
Tool acquisition: go to official account [taskctl] to reply to content “software”.
Selection of ETL tools
How to choose ETL tools in data integration? Generally speaking, the following aspects need to be considered:
- Support for the platform.
- Support for data sources.
- Whether the performance of extraction and loading is high or not, and it has little impact on the performance of business system, and the dumping is not high.
- The function of data conversion and processing is not strong.
- Whether it has management and scheduling functions.
- Is it well integrated and open