Introduction: are you still making efforts to select Kafka components for cleaning and conversion? Try the Kafka ETL task function!
When it comes to data islands, all technicians are no stranger. In the process of it development, enterprises inevitably build various business systems. These systems operate independently and the generated data are closed to each other, which makes it difficult for enterprises to realize data sharing and integration, and forms a “data island”.
Because the data is scattered in different databases and message queues, the computing platform may encounter availability, transmission delay and even system throughput problems when accessing these data directly. If we rise to the business level, we will find that these scenarios will be encountered at any time: summarizing business transaction data, migrating old system data to new systems, and integrating data from different systems. Therefore, in order to make data more real-time and efficient integration and support various business scenarios, enterprises usually choose to use various ETL tools to achieve the above purpose.
Therefore, we can see various solutions explored by enterprises, such as using custom scripts, or using service buses (ESB) and message queues (MQ), such as using enterprise application integration (EAI) to cross enterprise heterogeneous systems, applications Data source, etc., to realize the seamless sharing and exchange of data.
Although all the above methods have achieved effective real-time processing, they also bring more difficult multiple-choice questions to enterprises: real-time, but not scalable, or scalable. But batch processing. At the same time, with the continuous development of data technology and business requirements, enterprises’ requirements for ETL are also increasing:
- In addition to supporting transactional data, it also needs to be able to handle more and more rich types of data sources such as log and metric;
- Batch processing speed needs to be further improved;
- The underlying technology architecture needs to support real-time processing and evolve to event centered.
It can be seen that the stream processing / real-time processing platform is the cornerstone of event driven interaction. It provides enterprises with global data / event links, real-time data access, unified management of global data by a single system, and continuous index / query capabilities. Facing the above technical and business requirements, Kafka provides a new idea:
- As a real-time and scalable message bus, enterprise application integration is no longer required;
- Provide flow data pipeline for all message processing destinations;
- As the basic building block of stateful flow processing microservices.
We take the shopping website data analysis scenario as an example. In order to realize fine operation, the operation team and product manager need to summarize many user behaviors, business data and other data, including but not limited to:
- User’s behavior data such as clicking, browsing, additional purchase and login;
- Basic log data;
- App actively uploads data;
- Data from DB;
These data are collected into Kafka, and then the data analysis tools uniformly obtain the required data from Kafka for analysis and calculation. Because Kafka collects many data sources and various formats. Before the data enters the downstream data analysis tool, it is necessary to clean the data, such as filtering and formatting. Here, the R & D team has two choices: (1) write code to consume the messages in Kafka, and send them to the target Kafka topic after cleaning. (2) Use components for data cleaning and conversion, such as logstash, Kafka stream, Kafka connector, Flink, etc.
Here, you will certainly have questions: Kafka stream, as a streaming processing class library, directly provides specific classes for developers to call. The operation mode of the whole application is mainly controlled by developers, which is convenient for use and debugging. Is there a problem? Although the above methods can solve the problem quickly, the problem is also obvious.
- The R & D team needs to write its own code and needs continuous maintenance in the later stage, resulting in large operation and maintenance costs;
- For many lightweight or simple computing requirements, the technical cost of introducing a new component is too high, and technical selection is required;
- After a component is selected, the R & D team needs to learn and maintain it continuously, which brings unexpected learning and maintenance costs.
In order to solve the problem, we provide a lighter solution: Kafka ETL function.
After using Kafka ETL function, simply configure it through Kafka console and write a cleaning code online to achieve the purpose of ETL. The possible high availability and maintenance problems are completely left to Kafka.
Next, let’s show you how to quickly create a data ETL task in only three steps.
Step 1: create task
Select the Kafka source instance, source topic, and the corresponding Kafka target instance and target topic. It also configures the initial location of messages, failure handling, and resource creation methods.
Step 2: write ETL main logic
We can choose Python 3 as the functional language. At the same time, a variety of data cleaning and data conversion templates are provided here, such as rule filtering, string replacement, adding pre / suffix and other common functions.
Step 3: set task operation and exception parameter configuration, and execute
It can be seen that there is no need for additional component access or complex configuration. The lighter and lower cost Kafka ETL only needs 3-5 steps of visual configuration to start the ETL task. For teams with relatively simple data ETL requirements, Kafka ETL becomes the best choice and can focus more on business research and development.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.