Production practice | short video production and consumption monitoring based on Flink

Time:2021-10-21

This paper introduces the data flow link and technical scheme of real-time monitoring indicators in detail. Most real-time monitoring indicators can be realized according to several schemes in this paper.

Short video production and consumption monitoring

Short video has brought a new communication field and program form. While small screen and fast rhythm have become the trend of the industry, it has also spawned new user consumption habits and brought benefits to creators and merchants. Diversified short videos can also provide marketing opportunities for brands.

Among them, the monitoring and analysis of the production and consumption hotspots of vertical ecological short videos has become a common application scenario for real-time data processing, such as monitoring the video production or video consumption under a delineated vertical ecology, generating corresponding optimization recommendation strategies for hot videos, and promoting the production or consumption of Hot Videos, Build a closed loop of the whole production and consumption data link, so as to improve the creator’s income and consumer retention.

This paper will completely analyze the whole link flow mode of vertical ecological short video production and consumption data, and provide several scheme designs for vertical video production and consumption monitoring based on Flink. Through this article, you can learn that:

  • Vertical ecological short video production and consumption data link closed loop
  • Scheme design of real-time monitoring short video production and consumption
  • Code implementation under different monitoring level scenarios
  • Flink learning materials

Project introduction

The flow structure diagram of vertical ecological short video production and consumption data link is as follows. This data flow diagram is also applicable to other scenarios:

Production practice | short video production and consumption monitoring based on Flink

In the above scenario, users produce and consume short videos, so that the client, server and database will generate corresponding behavior operation logs. These logs will be extracted into the message queue through the log extraction middleware. In our current scenario, Kafka is used as the message queue; Then use flick to monitor the production or consumption of video in vertical Ecology (content production is usually delineating vertical author ID pool, content consumption is usually delineating vertical video ID pool), and finally output real-time aggregated data to the downstream; The downstream can be displayed in the form of data services and real-time kanban. The operation or automation tools will eventually help us analyze the current production or consumption hotspots under the vertical category, so as to generate recommendation strategies.

conceptual design

Production practice | short video production and consumption monitoring based on Flink

The data sources are as follows:

  • KafkaLogs for full content production and consumption.
  • RPC / HTTP / MySQL / configuration center / redis / HBaseIt is the vertical ecological content ID pool to be monitored (the author ID pool for content production and the video ID pool for content consumption). It is mainly used to dynamically configure the ID range to be monitored for operating students. It can conduct real-time query in Flink, analyze the range of monitoring indicators desired by operating students, as well as the monitored indicators and calculation methods, and then process data output, It can be configured at any time and real-time data can be calculated at any time.

The data is collected into clustered hot topics or event indicators of content production or consumption:

  • Redis/HBaseIt mainly provides data services with low latency (redis 5ms p99, HBase 100ms p99, different service capabilities of different companies) and high QPS to provide low latency data query for server-side or online users.
  • Druid/MysqlIt can be used as an OLAP engine to provide flexible roll up and drill down aggregation analysis capability for Bi analysis, which can be used by operation students to configure visual charts.
  • KafkaIt can be produced as streaming data, which can be provided to the downstream for continuous consumption or feature extraction.

Needless to say, let’s go directly to the scheme and code. The following schemes are distinguished according to the monitoring ID range. Different magnitudes correspond to different schemes. The code example is processwindowfunction, which can also be replaced by aggregatefunction. The main monitoring logic is the same.

Option 1

It is suitable for scenarios with small amount of monitoring ID data(Thousands of IDS), the implementation method is to load the ID pool to be monitored or the ID pool of the dynamic configuration center into the memory during the initialization of the flick task, and then only judge whether the content production or consumption data is in the monitoring pool in the memory.

ProcessWindowFunction p = new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {
    
    //Configuration center dynamic ID pool private config < set < long > > needmonitoredidsconfig;

    @Override
    public void open(Configuration parameters) throws Exception {
        this.needMonitoredIdsConfig = ConfigBuilder
                .buildSet("needMonitoredIds", Long.class);
    }

    @Override
    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
        Set<Long> needMonitoredIds = needMonitoredIdsConfig.get();
        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
    }
}

The monitored ID pool can be fixed or configurable, so it can be obtained in two ways: the first is to load all the IDS into the memory at the beginning of the flick task, which is suitable for the situation that the monitored ID pool remains unchanged; The second is to use the dynamic configuration center to access the latest monitoring ID pool from the configuration center every time, which can meet the needs of dynamic configuration or changing the ID pool, and this implementation can usually sense the configuration change in real time with almost no delay.

Option 2

Suitable for monitoring ID with moderate data volume(Hundreds of thousands of IDS), the monitoring data range will change from time to time. The implementation method is to obtain the latest monitoring ID pool from the timing access interface in the flick operator to obtain the latest monitoring data range.

ProcessWindowFunction p = new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {

    private long lastRefreshTimestamp;

    private Set<Long> needMonitoredIds;

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        this.refreshNeedMonitoredIds(System.currentTimeMillis());
    }

    @Override
    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
        long windowStart = context.window().getStart();
        this.refreshNeedMonitoredIds(windowStart);
        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
    }

    public void refreshNeedMonitoredIds(long windowStart) {
        //Access if every 10 seconds (windowstart - this. Lastrefreshtimestamp > = 10000l){
            this.lastRefreshTimestamp = windowStart;
            this.needMonitoredIds = Rpc.get(...)
        }
    }
}

According to the above code implementation method, the ID pool is refreshed according to the time interval. The disadvantage is that the change of the monitoring ID pool cannot be sensed in real time, so the refresh time may be strongly coupled with the demand scenario (if the ID pool will be updated frequently, the refresh time interval needs to be reduced). You can also refresh the ID pool before the start of each window according to the demand scenario, so as to ensure that the data in the ID pool in each window is always updated.

Option 3

An optimization of scheme 3 to scheme 2(Hundreds of thousands of IDS, the most commonly used in our production environment)。 The implementation method is to use the broadcast operator to regularly access the monitoring ID pool in the flash, and send the ID pool to the downstream operators participating in the calculation in the form of broadcast. The optimization points are: for example, the parallelism of the task is 500, and it is accessed every 1s. If scheme 2 is adopted, the QPS for accessing the monitoring ID pool interface is 500. After using the broadcast operator, the access QPS can be reduced to 1, which can greatly reduce the access to the interface and reduce the pressure on the interface.

public class Example {

    @Slf4j
    static class NeedMonitorIdsSource implements SourceFunction<Map<Long, Set<Long>>> {

        private volatile boolean isCancel;

        @Override
        public void run(SourceContext<Map<Long, Set<Long>>> sourceContext) throws Exception {
            while (!this.isCancel) {
                try {
                    TimeUnit.SECONDS.sleep(1);
                    Set<Long> needMonitorIds = Rpc.get(...);
                    //It can be compared with the data accessed last time to see if there is any change. If there is any change, it can be sent out if (collectionutils. Isnotempty (needmonitorids)){
                        sourceContext.collect(new HashMap<Long, Set<Long>>() {{
                            put(0L, needMonitorIds);
                        }});
                    }
                } catch (Throwable e) {
                    //Prevent the error caused by interface access failure from causing the fly job to hang up log.error ("need monitor IDS error", e);
                }
            }
        }

        @Override
        public void cancel() {
            this.isCancel = true;
        }
    }

    public static void main(String[] args) {
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        InputParams inputParams = new InputParams(parameterTool);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        final MapStateDescriptor<Long, Set<Long>> broadcastMapStateDescriptor = new MapStateDescriptor<>(
                "config-keywords",
                BasicTypeInfo.LONG_TYPE_INFO,
                TypeInformation.of(new TypeHint<Set<Long>>() {
                }));

        /********************* kafka source *********************/
        BroadcastStream<Map<Long, Set<Long>>> broadcastStream = env
                . addsource (New needmonitoridsource()) // redis photoID data broadcast. Setparallelism (1)
                .broadcast(broadcastMapStateDescriptor);

        DataStream<CommonModel> logSourceDataStream = SourceFactory.getSourceDataStream(...);

        /********************* dag *********************/
        DataStream<CommonModel> resultDataStream = logSourceDataStream
                .keyBy(KeySelectorFactory.getStringKeySelector(CommonModel::getKeyField))
                .connect(broadcastStream)
                .process(new KeyedBroadcastProcessFunction<String, CommonModel, Map<Long, Set<Long>>, CommonModel>() {

                    private Set<Long> needMonitoredIds;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        this.needMonitoredIds = Rpc.get(...)
                    }

                    @Override
                    public void processElement(CommonModel commonModel, ReadOnlyContext readOnlyContext, Collector<CommonModel> collector) throws Exception {
                        //Determine whether the ID in the commonmodel is in the needmonitoredids pool}

                    @Override
                    public void processBroadcastElement(Map<Long, Set<Long>> longSetMap, Context context, Collector<CommonModel> collector) throws Exception {
                        //Fields to be monitored set < long > needmonitorids = longsetmap.get (0l);
                        if (CollectionUtils.isNotEmpty(needMonitorIds)) {
                            this.needMonitoredIds = needMonitorIds;
                        }
                    }
                });

        /********************* kafka sink *********************/
        SinkFactory.setSinkDataStream(...);
        
        env.execute(inputParams.jobName);
    }

}

Option 4

Suitable for data with large monitoring range(Millions, and the use in our own production practice is expanded to 5 million)。 The principle is to divide the monitoring range interface into buckets according to ID and certain rules. After the Flink consumes the log data, the ID is divided into buckets keyby according to the bucket division method with the same monitoring range interface ID, so that each operator in the downstream operator can get the monitoring ID data of the corresponding bucket from the interface according to the bucket name. In this way, each parallel operator in the Flink only needs to obtain the data of its own corresponding bucket, which can greatly reduce the pressure of request.

public class Example {

    public static void main(String[] args) {
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        InputParams inputParams = new InputParams(parameterTool);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        final MapStateDescriptor<Long, Set<Long>> broadcastMapStateDescriptor = new MapStateDescriptor<>(
                "config-keywords",
                BasicTypeInfo.LONG_TYPE_INFO,
                TypeInformation.of(new TypeHint<Set<Long>>() {
                }));

        /********************* kafka source *********************/

        DataStream<CommonModel> logSourceDataStream = SourceFactory.getSourceDataStream(...);

        /********************* dag *********************/
        DataStream<CommonModel> resultDataStream = logSourceDataStream
                .keyBy(KeySelectorFactory.getLongKeySelector(CommonModel::getKeyField))
                .timeWindow(Time.seconds(inputParams.accTimeWindowSeconds))
                .process(new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {

                    private long lastRefreshTimestamp;

                    private Set<Long> oneBucketNeedMonitoredIds;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                    }

                    @Override
                    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
                        long windowStart = context.window().getStart();
                        this.refreshNeedMonitoredIds(windowStart, bucket);
                        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
                    }

                    public void refreshNeedMonitoredIds(long windowStart, long bucket) {
                        //Access if every 10 seconds (windowstart - this. Lastrefreshtimestamp > = 10000l){
                            this.lastRefreshTimestamp = windowStart;
                            this.oneBucketNeedMonitoredIds = Rpc.get(bucket, ...)
                        }
                    }
                });

        /********************* kafka sink *********************/
        SinkFactory.setSinkDataStream(...);

        env.execute(inputParams.jobName);
    }
}

summary

Firstly, this paper introduces the whole closed-loop of short video production and consumption data link in the field of short video, and its data link closed-loop is also applicable to other scenarios in general; And the design of corresponding real-time monitoring scheme and code implementation in different scenarios, including:

  • Vertical ecological short video production and consumption data link closed loop: the whole process of user operation behavior log flow, log upload, real-time calculation, Bi flow, data service and finally data empowerment
  • Design of real-time monitoring scheme: selection of various data sources and data sinks in the monitoring real-time computing process
  • Specific code implementation of monitoring ID pool in different scenarios

Learning materials

flink

Production practice | short video production and consumption monitoring based on Flink