Data Statistics Based on Time Window Using Spark Streaming SQL


1. Background introduction

A common scenario of streaming computing is event-time-based processing, which is often used in detection, monitoring, time-based statistics and other systems. For example, each log in the buried log records the time of operation at the buried point, or the user’s operation time is recorded in the business system, which is used to count the frequency of various operations, or the abnormal behavior detection or monitoring system alarm is carried out according to the rule matching. Such time data will be included in the event data. It is necessary to extract the time field and make statistics or rule matching according to a certain time range.
Spark Streaming SQL can be used to process the time field of event data conveniently, and the time window function provided by Spark Streaming SQL can make statistical operation of event time according to a certain time interval.
This article introduces how to use Spark Streaming SQL to manipulate event time by explaining a case of counting the number of clicks a user has made in the past five seconds.

2. Grammatical Description of Time Window

Spark Streaming SQL supports two types of window operations: rolling window (TUMBLING) and sliding window (HOPPING).

2.1 Scroll Window

TUMBLING distributes data to a specified size window according to the time field of each data. The window slides with the window size as the step size, and there is no overlap between windows. For example, if a five-minute scroll window is specified, the data will be divided according to time.[0:00 - 0:05)、 [0:05, 0:10)[0:10, 0:15)Wait for window.

  • grammar
GROUP BY TUMBLING ( colName, windowDuration ) 
  • Example

The inv_data_time time column of inventory table is window operated and the mean value of inv_quantity_on_hand is counted; the window size is 1 minute.

SELECT avg(inv_quantity_on_hand) qoh
FROM inventory
GROUP BY TUMBLING (inv_data_time, interval 1 minute)

2.2 Sliding Window

Sliding Window (HOPPING), also known as Sliding Window. Unlike rolling windows, sliding windows can set the step size of window sliding, so windows can overlap. The sliding window has two parameters: windows Duration and slideDuration. The slideDuration is the step size for each slide, and the Windows Duration is the window size. When slideDuration < window Duration, windows overlap and each element is assigned to multiple windows.
Therefore, a rolling window is actually a special case of a slideDuration = windows Duration, which is equivalent to a rolling window.

  • grammar
GROUP BY HOPPING ( colName, windowDuration, slideDuration ) 
  • Example

The inv_data_time column of inventory table is window operated, and the mean value of inv_quantity_on_hand is counted; the window is 1 minute and the sliding step is 30 seconds.

SELECT avg(inv_quantity_on_hand) qoh
FROM inventory
GROUP BY HOPPING (inv_data_time, interval 1 minute, interval 30 second)

3. System Architecture

Data Statistics Based on Time Window Using Spark Streaming SQL

After the business log is collected into Aliyun SLS, Spark docks with SLS, processes the data through Streaming SQL and writes the statistical results into HDFS. The follow-up operation process mainly focuses on the part of Spark Streaming SQL that receives SLS data and writes it to HDFS. For log collection, please refer to the log service.

4. Operation process

4.1 Environmental preparation

  • Create a Hadoop cluster for versions E-MapReduce 3.21.0 and above.
  • Download and compile the E-MapReduce-SDK package
git clone [email protected]:aliyun/aliyun-emapreduce-sdk.git
cd aliyun-emapreduce-sdk
git checkout -b master-2.x origin/master-2.x
mvn clean package -DskipTests

After compilation, emr-data sources_shaded_${version}. jar is generated in the assembly/target directory, where ${version} is the version of sdk.

4.2 Create tables

Start sparksql client from command line

spark-sql --master yarn-client --num-executors 2 --executor-memory 2g --executor-cores 2 --jars emr-datasources_shaded_2.11-${version}.jar --driver-class-path emr-datasources_shaded_2.11-${version}.jar

Create SLS and HDFS tables

spark-sql> USE default;

--Data Source Table
spark-sql> CREATE TABLE IF NOT EXISTS sls_user_log
USING loghub
sls.project = "${logProjectName}", = "${logStoreName}", = "${accessKeyId}",
access.key.secret = "${accessKeySecret}",
endpoint = "${endpoint}");

--Result table
spark-sql> CREATE TABLE hdfs_user_click_count
USING org.apache.spark.sql.json
OPTIONS (path '${hdfsPath}');

4.3 Statistics User Clicks

spark-sql>SET spark.sql.streaming.checkpointLocation.user_click_count=hdfs:///tmp/spark/sql/streaming/test/user_click_count; 
spark-sql>insert into hdfs_user_click_count 
select sum(cast(action_click as int)) as click, userId, window from sls_user_log 
where delay(__time__)<"1 minute" 
group by TUMBLING(__time__, interval 5 second), userId;

Among them, built-in functionsdelay()Used to set up watermarks in Streaming SQL, there will be a special article on Streaming SQL watermark.

4.4 View the results

Data Statistics Based on Time Window Using Spark Streaming SQL

As you can see, the result will automatically generate a window column containing the start and end time information of the window.

5. Concluding remarks

This paper briefly introduces the scenario of event-time-based processing in streaming processing, as well as the related content of Spark Streaming SQL time window, and introduces the use of time window through a simple case. In a follow-up article, I’ll cover more about Spark Streaming SQL.

Author: Ligh-rain

Read the original text

This article is the original content of Yunqi Community, which can not be reproduced without permission.