Production practice | production and consumption monitoring of short video based on Flink

Time:2020-12-6

This paper introduces the data flow link and technical scheme of real-time monitoring indicators in detail. Most of the real-time monitoring indicators can be implemented according to several schemes in this paper.

Production and consumption monitoring of short video

Short video has brought a new field of communication and program form. Small screen and fast-paced become the trend of the industry. At the same time, it also gives birth to new consumption habits of users, which brings benefits to creators and merchants. The diversified short video can also provide marketing opportunities for the brand side.

Among them, the monitoring and analysis of production and consumption hotspots of vertical ecological short video has become a very common application scenario of real-time data processing, such as monitoring the video production or video consumption in a certain vertical ecology, generating the corresponding optimization recommendation strategy for hot video, promoting the production or consumption of hot video, and constructing the entire production and consumption data chain So as to improve the creator’s income and consumer retention.

This paper will analyze the whole link flow mode of vertical ecological short video production and consumption data, and provide several design schemes for vertical video production and consumption monitoring based on Flink. Through this article, you can learn:

  • Data link closed loop of vertical ecological short video production and consumption
  • Scheme design of real time monitoring short video production and consumption
  • Code implementation under different monitoring levels
  • Flynk learning materials

Project introduction

The data link flow architecture of vertical ecological short video production and consumption is as follows, which is also applicable to other scenarios:

Production practice | production and consumption monitoring of short video based on Flink

In the above scenario, users produce and consume short videos, so that the client, server and database will generate corresponding behavior operation logs. These logs will be extracted into the message queue through the log extraction middleware. In our current scenario, Kafka is used as the message queue, and then Flink is used Monitoring the production or consumption of videos in the vertical Ecology (content production is usually delineation of vertical category author ID pool, and content consumption is usually delineating vertical video ID Finally, the real-time aggregate data will be output to the downstream; the downstream can be displayed in the form of data service and real-time Kanban, and the operation students or automation tools will eventually help us analyze the current production or consumption hotspots under the vertical category, and then generate recommendation strategies.

conceptual design

Production practice | production and consumption monitoring of short video based on Flink

The data sources are as follows:

  • KafkaLogs for full content production and consumption.
  • RPC / HTTP / MySQL / configuration center / redis / HBaseIt is the vertical ecological content ID pool that needs to be monitored (the content production is the author ID pool, and the content consumption is the video ID pool). It is mainly provided to the operation students to dynamically configure the ID range to be monitored. It can be used in the Flink In order to analyze the range of monitoring indicators that the operation students want, as well as the monitoring indicators and calculation methods, and then process the data output, which can support the configuration at any time and calculate the output of real-time data at any time.

Among them, the data are clustered into hot topics or event indicators of content production or consumption

  • Redis/HBaseIt mainly provides data service with low latency (redis 5ms p99, HBase 100ms p99, different companies have different service capabilities) and high QPS to provide data query with low delay for server or online users.
  • Druid/MysqlIt can be used as OLAP engine to provide flexible up roll down aggregate analysis capability for Bi analysis, which can be used by operators to configure visual charts.
  • KafkaIt can be produced by streaming data, which can be provided for downstream consumption or feature extraction.

Needless to say much nonsense, we will go directly to the scheme and code. The following schemes are distinguished according to the monitoring ID range. Different magnitudes correspond to different schemes. The code example is processwindowfunction, which can also be replaced by aggregatefunction. The main monitoring logic is the same.

Option 1

It is suitable for monitoring scenarios with small amount of ID data(Thousands of ID)The implementation method is to load the ID pool to be monitored or the ID pool of the dynamic configuration center into the memory during the initialization of the Flink task. After that, it only needs to judge whether the content production or consumption data is in the monitoring pool.

ProcessWindowFunction p = new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {
    
    //Configuration center dynamic ID pool private config < set < long > > needmonitoredidsconfig;

    @Override
    public void open(Configuration parameters) throws Exception {
        this.needMonitoredIdsConfig = ConfigBuilder
                .buildSet("needMonitoredIds", Long.class);
    }

    @Override
    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
        Set<Long> needMonitoredIds = needMonitoredIdsConfig.get();
        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
    }
}

The monitored ID pool can be divided into two ways: the first is to load all the IDS into memory at the beginning of the Flink task, which is suitable for the situation that the monitoring ID pool remains unchanged; the second is to use the dynamic configuration center to access the latest monitoring ID pool from the configuration center every time, which can meet the dynamic configuration or change the ID Pool requirements, and this implementation is usually real-time aware of configuration changes with little delay.

Option 2

Suitable for monitoring, the amount of ID data is moderate(Hundreds of thousands of IDS)The monitoring data range will change from time to time. The implementation method is to access the latest monitoring ID pool in the Flink operator regularly to obtain the latest monitoring data range.

ProcessWindowFunction p = new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {

    private long lastRefreshTimestamp;

    private Set<Long> needMonitoredIds;

    @Override
    public void open(Configuration parameters) throws Exception {
        super.open(parameters);
        this.refreshNeedMonitoredIds(System.currentTimeMillis());
    }

    @Override
    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
        long windowStart = context.window().getStart();
        this.refreshNeedMonitoredIds(windowStart);
        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
    }

    public void refreshNeedMonitoredIds(long windowStart) {
        //Access if (Windows start) every 10 seconds- this.lastRefreshTimestamp  >= 10000L) {
            this.lastRefreshTimestamp = windowStart;
            this.needMonitoredIds = Rpc.get(...)
        }
    }
}

According to the above code implementation mode, refresh the ID pool according to the time interval. Its disadvantage is that it can not sense the changes of the ID pool in real time, so the refresh time may be strongly coupled with the demand scenario (if the ID pool will be updated frequently, the refresh interval needs to be reduced). The ID pool can also be refreshed before the start of each window according to the demand scenario, so as to ensure that the data in the ID pool in each window is always updated.

Option 3

An optimization of scheme 3 to scheme 2(Hundreds of thousands of IDS, the most commonly used in our production environment)。 The implementation method is to use broadcast operator to access monitoring ID pool regularly in Flink, and send the ID pool to downstream operators in the form of broadcast. The optimization points are as follows: for example, the parallelism of the task is 500, and the QPS of accessing the monitoring ID pool interface is 500 if the scheme 2 is adopted. After using the broadcast operator, the QPS access can be reduced to 1, which can greatly reduce the access to the interface and relieve the interface pressure.

public class Example {

    @Slf4j
    static class NeedMonitorIdsSource implements SourceFunction<Map<Long, Set<Long>>> {

        private volatile boolean isCancel;

        @Override
        public void run(SourceContext<Map<Long, Set<Long>>> sourceContext) throws Exception {
            while (!this.isCancel) {
                try {
                    TimeUnit.SECONDS.sleep(1);
                    Set<Long> needMonitorIds = Rpc.get(...);
                    //You can compare it with the data you visited last time to see if there is any change. If there is a change, it will be sent out if( CollectionUtils.isNotEmpty (needMonitorIds)) {
                        sourceContext.collect(new HashMap<Long, Set<Long>>() {{
                            put(0L, needMonitorIds);
                        }});
                    }
                } catch (Throwable e) {
                    //To prevent the link job from hanging up due to interface access failure log.error ("need monitor ids error", e);
                }
            }
        }

        @Override
        public void cancel() {
            this.isCancel = true;
        }
    }

    public static void main(String[] args) {
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        InputParams inputParams = new InputParams(parameterTool);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        final MapStateDescriptor<Long, Set<Long>> broadcastMapStateDescriptor = new MapStateDescriptor<>(
                "config-keywords",
                BasicTypeInfo.LONG_TYPE_INFO,
                TypeInformation.of(new TypeHint<Set<Long>>() {
                }));

        /********************* kafka source *********************/
        BroadcastStream<Map<Long, Set<Long>>> broadcastStream = env
                . addsource (New needmonitoridssource()) // redis photoID data broadcast. Setparallelism (1)
                .broadcast(broadcastMapStateDescriptor);

        DataStream<CommonModel> logSourceDataStream = SourceFactory.getSourceDataStream(...);

        /********************* dag *********************/
        DataStream<CommonModel> resultDataStream = logSourceDataStream
                .keyBy(KeySelectorFactory.getStringKeySelector(CommonModel::getKeyField))
                .connect(broadcastStream)
                .process(new KeyedBroadcastProcessFunction<String, CommonModel, Map<Long, Set<Long>>, CommonModel>() {

                    private Set<Long> needMonitoredIds;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        this.needMonitoredIds = Rpc.get(...)
                    }

                    @Override
                    public void processElement(CommonModel commonModel, ReadOnlyContext readOnlyContext, Collector<CommonModel> collector) throws Exception {
                        //Determine whether the ID in the commonmodel is in the needmonitoredids pool}

                    @Override
                    public void processBroadcastElement(Map<Long, Set<Long>> longSetMap, Context context, Collector<CommonModel> collector) throws Exception {
                        //Fields to be monitored set < long > needmonitorids= longSetMap.get (0L);
                        if (CollectionUtils.isNotEmpty(needMonitorIds)) {
                            this.needMonitoredIds = needMonitorIds;
                        }
                    }
                });

        /********************* kafka sink *********************/
        SinkFactory.setSinkDataStream(...);
        
        env.execute(inputParams.jobName);
    }

}

Option 4

Suitable for data with large monitoring range(In our own production practice, we have expanded the use to 5 million)。 Its principle is to divide the monitoring range interface into buckets according to certain rules according to the ID. After the link consumes the log data, the ID is divided into buckets according to the same monitoring range interface ID. in this way, each operator in the downstream operator can get the monitoring ID data of the corresponding bucket from the interface according to the bucket name. Thus, each parallel operator in the Flink only needs to obtain its own bucket data, which can greatly reduce the pressure of the request.

public class Example {

    public static void main(String[] args) {
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        InputParams inputParams = new InputParams(parameterTool);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

        final MapStateDescriptor<Long, Set<Long>> broadcastMapStateDescriptor = new MapStateDescriptor<>(
                "config-keywords",
                BasicTypeInfo.LONG_TYPE_INFO,
                TypeInformation.of(new TypeHint<Set<Long>>() {
                }));

        /********************* kafka source *********************/

        DataStream<CommonModel> logSourceDataStream = SourceFactory.getSourceDataStream(...);

        /********************* dag *********************/
        DataStream<CommonModel> resultDataStream = logSourceDataStream
                .keyBy(KeySelectorFactory.getLongKeySelector(CommonModel::getKeyField))
                .timeWindow(Time.seconds(inputParams.accTimeWindowSeconds))
                .process(new ProcessWindowFunction<CommonModel, CommonModel, Long, TimeWindow>() {

                    private long lastRefreshTimestamp;

                    private Set<Long> oneBucketNeedMonitoredIds;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                    }

                    @Override
                    public void process(Long bucket, Context context, Iterable<CommonModel> iterable, Collector<CommonModel> collector) throws Exception {
                        long windowStart = context.window().getStart();
                        this.refreshNeedMonitoredIds(windowStart, bucket);
                        /**
 *Determine whether the ID in the commonmodel is in the needmonitoredids pool*/
                    }

                    public void refreshNeedMonitoredIds(long windowStart, long bucket) {
                        //Access if (Windows start) every 10 seconds- this.lastRefreshTimestamp  >= 10000L) {
                            this.lastRefreshTimestamp = windowStart;
                            this.oneBucketNeedMonitoredIds = Rpc.get(bucket, ...)
                        }
                    }
                });

        /********************* kafka sink *********************/
        SinkFactory.setSinkDataStream(...);

        env.execute(inputParams.jobName);
    }
}

summary

This paper first introduces the whole closed-loop of short video production and consumption data link in the field of short video, and its data link closed-loop is also applicable to other scenarios in general; and the design of corresponding real-time monitoring scheme and the code implementation in different scenarios, including:

  • The data link closed loop of vertical ecological short video production and consumption: the flow of user operation behavior log, log upload, real-time calculation, and the whole process of transferring to Bi, data service and finally data enabling
  • Design of real-time monitoring scheme: selection of various data sources and data sinks in monitoring real-time calculation process
  • Implementation of monitoring ID pool in different level scenarios

Learning materials

flink

Production practice | production and consumption monitoring of short video based on Flink