Real time data engine series (V): about SQL server and SQL Server CDC

Time:2021-9-23

Abstract: among enterprise customers, SQL Server still exudes lasting vitality in the traditional manufacturing industry. The CDC complexity of SQL server is lower than that of Oracle. Therefore, the standard official practice is to directly use this CDC interface for synchronization, but there are also people in the Jianghu who directly read database changes by naked parsing LDF files. This paper will discuss these two schools.

preface

Last sent aboutOracle CDCThere are several databases of this type. Let’s talk about the three closed source databases first: Oracle, SQL server and DB2.

Among enterprise customers, the use scope of SQL Server far exceeds my previous expectations. This database, which can hardly be seen by Internet users, still exudes lasting vitality in traditional manufacturing and enterprise customers. In the ranking of DB engine, SQL Server ranks third after Oracle and mysql, It is one of the databases that tapdata often meets when the customer scenario lands.

Real time data engine series (V): about SQL server and SQL Server CDC

Two sects

The CDC complexity of SQL server is lower than that of Oracle, and the official supports this function since version 08. Only before version 16, this setting is only available in the enterprise version. After version 16, the function can also be enabled in the community version. Therefore, the standard official practice is to directly use this CDC interface for synchronization.

After the CDC function is enabled, SQL server will synchronize the changed contents to a table with a primary key. The business party will get the new transaction changes by polling the table with a primary key.

This is one of the best solutions. There are several problems:

  1. Not supported before version 08.
  2. Before version 16, only enterprise version is supported.
  3. Relying on polling, the real-time performance is poor, and the CDC table itself has a delay. Under normal circumstances, the delay is often more than seconds.
  4. If you forget to take the data from the table, the table will expand indefinitely.
  5. The removed data will be cleaned up and the log cannot be played back.
  6. The configuration is complex. In the case of AlwaysOn, the trigger of CDC job recovery of available node switching needs to be considered.

SQL server will record operations in files ending with LDF when conducting transaction operations. Therefore, all kinds of reckless people in the Jianghu have aimed at this and developed their own data tracking routines.

Many manufacturers engaged in operation audit or data synchronization began to try to bypass CDC and directly read database changes by naked parsing LDF files. Their advantages are:

  1. The load is separated from the database, the performance can be optimized separately, the speed is very fast, and the bottleneck is Io.
  2. It can shield the differences brought by various deployment structures, and LDF files are eternal.
  3. Event triggering, compared with polling CDC table, the message delay is very low, which can be within tens of milliseconds.

The disadvantage is that it increases the deployment complexity. It is necessary to deploy a separate agent on the database machine to collect logs, and it is necessary to consider the problem of agent transmission logs during high availability switching.

Compared with the various versions of Oracle and the completely elusive binary format, the parsing of SQL server is relatively simple, and there are many commercial products implemented in the industry. However, at present, there is no version available under Linux or open source implementation.

Solution of tapdata

By default, tapdata uses the method of polling the CDC incremental table for data acquisition. This method is user-friendly and does not require additional deployment services. At the same time, it supports high availability features such as automatic switching of AlwaysOn and automatic recovery of CDC tasks, which is easy to use.

When the amount of data is very large, the performance of CDC table is insufficient, or customers have very high requirements for CDC delay, tapdata can obtain several times the performance of traditional methods and effectively reduce the delay in the form of agent transmission + raw log file parsing

A small problem

Oracle log parsing is difficult, there are few open source projects, and SQL server has a lot less content. Moreover, Google has a large number of commercial software doing some auditing, observation and synchronization based on bare log parsing. Why can’t you still see decent open source projects?

Further understandingTapdata real time data service platform, more technical articles can be found atTapdata technology blog。 Real time synchronization tool for heterogeneous database developed by tapdata——Tapdata CloudIt is now available to technology developers for free. At present, it supports data migration and synchronization between Oracle, mysql, PostgreSQL, SQL server, mongodb, elasticsearch, Dameng, Kafka and other mainstream libraries, that is, DB2, Sybase ASE, redis, gbase, gaussdb, etc.

Author: Xiao Beibei, technical partner of tapdata,Source address

Recommended Today

Seven Python code review tools recommended

althoughPythonLanguage is one of the most flexible development languages at present, but developers often abuse its flexibility and even violate relevant standards. So PythoncodeThe following common quality problems often occur: Some unused modules have been imported Function is missing arguments in various calls The appropriate format indentation is missing Missing appropriate spaces before and after […]