Brief analysis of jdbcsink

Time:2021-9-6

1、JdbcSink

It is used to increase the sink output of JDBC in datastream. There are two main interfaces: sink() and exactlyoncesink(). Exactlyoncesink () is a new transactional interface in version 13. This time, we mainly introduce the sink () interface.

public static <T> SinkFunction<T> sink(
        String sql,
        JdbcStatementBuilder<T> statementBuilder,
        JdbcExecutionOptions executionOptions,
        JdbcConnectionOptions connectionOptions) {
    return new GenericJdbcSinkFunction<>(
            new JdbcBatchingOutputFormat<>(
                    new SimpleJdbcConnectionProvider(connectionOptions),
                    executionOptions,
                    context -> {
                        Preconditions.checkState(
                                !context.getExecutionConfig().isObjectReuseEnabled(),
                                "objects can not be reused with JDBC sink function");
                        return JdbcBatchStatementExecutor.simple(
                                sql, statementBuilder, Function.identity());
                    },
                    JdbcBatchingOutputFormat.RecordExtractor.identity()));
}

1.1 parameters

The interface has four parameters, of which the third parameter executionoptions can be omitted and the default value can be used. For specific examples, see 1. Jdbcsink mode

  • sql

    String type, an SQL statement template, is usually used in the form of Preparedstatement, for example: insert into wordcount (wordcl, count L) values (?,?)

  • statementBuilder

    The jdbcstatementbuilder type is used to complete the correspondence between stream data and SQL specific columns, and complete the correspondence based on the Preparedstatement form of the previous parameter

  • executionOptions

    The execution rules output by Flink JDBC mainly set the execution trigger mechanism, and mainly set three parameters: data volume trigger threshold, time trigger threshold and maximum retry times. Among them, the data trigger defaults to 5000, and the time trigger defaults to 0, that is, turn off the time trigger. Note that the trigger threshold should not be set too low, otherwise the database may be blocked.

  • connectionOptions

    The jdbcconnectionoptions type is used to set database connection properties, including URL, driver, username, password, etc

1.2 return

The interface returns a genericjdbcsinkfunction class based on the implementation of sinkfunction, and its core is the parameter jdbcbatchingoutputformat.

The core methods of the results of genericjdbcsinkfunction are as follows, which are all operations based on jdbcbatchingoutputformat.

@Override
public void open(Configuration parameters) throws Exception {
    super.open(parameters);
    RuntimeContext ctx = getRuntimeContext();
    outputFormat.setRuntimeContext(ctx);
    outputFormat.open(ctx.getIndexOfThisSubtask(), ctx.getNumberOfParallelSubtasks());
}

@Override
public void invoke(T value, Context context) throws IOException {
    outputFormat.writeRecord(value);
}

@Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
    outputFormat.flush();
}

2、JdbcBatchingOutputFormat

Jdbcbatchingoutputformat is an implementation class for JDBC interaction. Data aggregation is performed before output to JDBC

2.1 parameters

The interface has four parameters

  • JdbcConnectionProvider

    Provide JDBC connection

  • JdbcExecutionOptions

    Execution parameters

  • StatementExecutorFactory

    Statement execution factory, that is, the processing of the corresponding relationship between stream data and database fields

  • RecordExtractor

    Execution class of data extraction

2.2 open method

The open method is an interface for database connection initialization and preliminary preparation, and there is a call relationship

Task.doRun()
    ->invokable.invoke()->DataSinkTask.invoke()
        ->format.open()->JdbcBatchingOutputFormat.open()

2.2.1. Connect to the database

The first step of the Open method is to connect to the database, call the open method of the upper level method AbstractJdbcOutputFormat, then call the getOrEstablishConnection () method of the implementation class SimpleJdbcConnectionProvider of JdbcConnectionProvider to establish the connection, and the specific operation of getOrEstablishConnection is as follows

public Connection getOrEstablishConnection() throws SQLException, ClassNotFoundException {
    if (connection != null) {
        return connection;
    }
    if (jdbcOptions.getDriverName() == null) {
        connection =
                DriverManager.getConnection(
                        jdbcOptions.getDbURL(),
                        jdbcOptions.getUsername().orElse(null),
                        jdbcOptions.getPassword().orElse(null));
    } else {
        Driver driver = getLoadedDriver();
        Properties info = new Properties();
        jdbcOptions.getUsername().ifPresent(user -> info.setProperty("user", user));
        jdbcOptions.getPassword().ifPresent(password -> info.setProperty("password", password));
        connection = driver.connect(jdbcOptions.getDbURL(), info);
        if (connection == null) {
            // Throw same exception as DriverManager.getConnection when no driver found to match
            // caller expectation.
            throw new SQLException(
                    "No suitable driver found for " + jdbcOptions.getDbURL(), "08001");
        }
    }
    return connection;
}

There are two kinds of processing here according to whether the drive is set or not. If it is not set, it will be automatically parsed according to the set URL. Java drivermanager is used to manage the jdbc driver. The driver manager will automatically identify the driver in the classpath, and then automatically resolve the pairing according to the URL. If the driver is set, the driver will be loaded directly for connection processing.

2.2.2、JdbcExec

This is created based on statementexecutorfactory. The last implementation class used here is jdbcbatchstatementexecutor, which is set in the sink () interface. The actual operation of this step is to make a preparestatements

@Override
public void prepareStatements(Connection connection) throws SQLException {
    this.st = connection.prepareStatement(sql);
}

2.2.3、scheduler

The database performance is limited, so Flink usually writes the database in batch. Here is how to set the time scheduling. See Chapter 1 for specific parameters. Note that there are two special configuration values: if the time is 0 or the number of entries is 1, the scheduler will not be created.

if (executionOptions.getBatchIntervalMs() != 0 && executionOptions.getBatchSize() != 1) {

The scheduling thread pool created here contains only one thread

this.scheduler =
        Executors.newScheduledThreadPool(
                1, new ExecutorThreadFactory("jdbc-upsert-output-format"))

The final operation performed by the scheduler is the biggest point of the whole class, flushing data to the database

synchronized (JdbcBatchingOutputFormat.this) {
    if (!closed) {
        try {
            flush();
        } catch (Exception e) {
            flushException = e;
        }
    }
}

2.3 writerecord method

Writerecord is the core method of the class to write data. It mainly performs two operations: adding data to the list and flushing to the database when the conditions are met.

try {
    addToBatch(record, jdbcRecordExtractor.apply(record));
    batchCount++;
    if (executionOptions.getBatchSize() > 0
            && batchCount >= executionOptions.getBatchSize()) {
        flush();
    }
} catch (Exception e) {
    throw new IOException("Writing records to JDBC failed.", e);
}

2.3.1 cache data

The cached data uses a simple ArrayList, which is defined in the simplebackstatementexecutor

SimpleBatchStatementExecutor(
        String sql, JdbcStatementBuilder<V> statementBuilder, Function<T, V> valueTransformer) {
    this.sql = sql;
    this.parameterSetter = statementBuilder;
    this.valueTransformer = valueTransformer;
    this.batch = new ArrayList<>();
}

As mentioned above, batch is used to cache data. The operations of adding data are as follows.

@Override
public void addToBatch(T record) {
    batch.add(valueTransformer.apply(record));
}

The function of valuetransformer is to return the input, which is defined at the beginning of sink:

return JdbcBatchStatementExecutor.simple(
        sql, statementBuilder, Function.identity());
        
/**
 * Returns a function that always returns its input argument.
 *
 * @param <T> the type of the input and output objects to the function
 * @return a function that always returns its input argument
 */
static <T> Function<T, T> identity() {
    return t -> t;
}

2.3.2、flush

Flush is to brush out the cached data to the database, and finally call the executebatch method of simplebackstatementexecutor

@Override
public void executeBatch() throws SQLException {
    if (!batch.isEmpty()) {
        for (V r : batch) {
            parameterSetter.accept(st, r);
            st.addBatch();
        }
        st.executeBatch();
        batch.clear();
    }
}