Six java examples of alink online learning

Time:2021-3-5

I published a series of articles on how to use Python for alink online learning Some readers reported that they need java version. Although these two versions are the same in algorithm principle, there are still many differences in the process of using. In order to facilitate readers to quickly use java to start alink online learning, this paper will rewrite this article from the perspective of java with six examples, hoping to help you.

If you have any problems in using alink, you can scan the QR code below for group communication

Six java examples of alink online learning

Example 1

Online learning is a model training method of machine learning, which can adjust the model in real time according to the change of online data. The model can reflect the change of online data, so as to improve the accuracy of online prediction.

In order to better understand the concept of online learning, we first introduce the corresponding concept: batch learning. First, we determine a sample training set, and train all the data of the training set. Generally, we need to use iterative process, reuse the data set, and constantly adjust the parameters. Online learning does not need to determine the training data set in advance. The training data arrive one by one in the training process. Each training sample will iterate the model according to the loss function value, objective function value and gradient generated by the sample.

Let’s first focus on ftrl online prediction component and ftrl streaming prediction component, as shown in the figure below. They are linked through model data flow, that is, ftrltrain continuously generates new models, which are streamed to ftrlpredict component. Each time ftrlpredict component receives a complete model, it will replace its old model and switch to a new one. For online learning ftrltrain, two inputs are needed: one is the initial model to avoid cold start of the system; the other is the streaming training data. The output of ftrltrain is the model data flow. The ftrlpredict component also needs an initial model, which can predict the coming data before the ftrltrain outputs the model.

Six java examples of alink online learning

After introducing the two core components, let’s see how the required initial model, training data flow and prediction data flow are prepared. As shown in the figure below, the initial model is obtained by using the traditional off-line training method to train the batch training data.

Six java examples of alink online learning

Ftrl algorithm is a linear algorithm, its input data must be numerical, and the original data has both numerical and discrete, we need to carry out the corresponding feature engineering operation, transform the original feature data into vector form.

Here, we need to use the component of feature engineering to transform the batch original training data into batch vector training data, transform the flow original training data into flow vector training data, and transform the flow original prediction data into flow vector prediction data.

Six java examples of alink online learning

Example 2

First of all, we need a java project to configure the relevant environment. The simplest way is to use alink’s example project, Download alink git’s code, and open the project with Jave IDE, as shown in the figure below. You can see three examples that have been written: alsexample, gbdtexample, kmeansexample

Six java examples of alink online learning

We are here com.alibaba.alink Create a new Java file under package

package com.alibaba.alink;

public class FTRLExample {
  
  public static void main(String[] args) throws Exception {

  }
}

This article’s examples refer to alink’s Python Demo:

https://github.com/alibaba/Al…

In online advertising, click through rate (CTR) is a very important indicator to measure the advertising effect. Therefore, click prediction system has important application value in sponsorship search and real-time bidding. The demo uses the ftrl method to train the classification model in real time, and uses the model for real-time prediction and evaluation.

Here, we use the CTR data of the kaggle competition. The link is:https://www.kaggle.com/c/avaz…. Because it is compressed data, it needs to be downloaded to the local. For the convenience of demonstration, we directly use a sample data stored on OSS. Use textsourcebatchop to read part of the printed data. The script is as follows:

new TextSourceBatchOp()
    .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv")
    .firstN(10)
    .print();

The running results are as follows
Six java examples of alink online learning

We see that each data item contains multiple data items separated by commas. The definition of each data column is as follows:

  • id: ad identifier
  • click: 0/1 for non-click/click
  • hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC.
  • C1 — anonymized categorical variable
  • banner_pos
  • site_id
  • site_domain
  • site_category
  • app_id
  • app_domain
  • app_category
  • device_id
  • device_ip
  • device_model
  • device_type
  • device_conn_type
  • C14-C21 — anonymized categorical variables

According to the definition of each column, we assemble schemastr as follows:

String schemaStr
  = "id string, click string, dt string, C1 string, banner_pos int, site_id string, site_domain string, "
  + "site_category string, app_id string, app_domain string, app_category string, device_id string, "
  + "device_ip string, device_model string, device_type string, device_conn_type string, C14 int, C15 int, "
  + "C16 int, C17 int, C18 int, C19 int, C20 int, C21 int";

With the definition of schema, we can read the display data through csvsourcebatchop. The script is as follows:

CsvSourceBatchOp trainBatchData = new CsvSourceBatchOp()
  .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv")
  .setSchemaStr(schemaStr);

trainBatchData.firstN(10).print();

The results are as follows

Six java examples of alink online learning

Because of the large number of columns, it is not easy to match the data with the column name. In order to get a better view of the data, here’s a trick. The printed text data and separated newline symbols are in markdown format. You can copy and paste them into the markdown editor, and you can see the neat picture display, as shown in the following figure:

Six java examples of alink online learning

Example 3

The previous article showed the data. Here we will continue to understand the data in depth. From the description information of the data column, we can know which numerical features and which enumeration features are contained in it. The specific content is shown in the following script:

String labelColName = "click";
String[] selectedColNames = new String[] {
  "C1", "banner_pos", "site_category", "app_domain",
  "app_category", "device_type", "device_conn_type",
  "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21",
  "site_id", "site_domain", "device_id", "device_model"};

String[] categoryColNames = new String[] {
  "C1", "banner_pos", "site_category", "app_domain",
  "app_category", "device_type", "device_conn_type",
  "site_id", "site_domain", "device_id", "device_model"};

String[] numericalColNames = new String[] {
  "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"};

The “click” column indicates whether it has been clicked or not. It is the label column of the classification problem. For numerical features, the value range of each feature is very different, which generally requires standardization, normalization and other operations; enumeration type features can not be directly applied to ftrl model, so the mapping from enumeration value to vector value is needed, and then the transformation results of each column need to be combined into a vector, which is the feature vector of later model training.

In our example, we choose to standardize the numerical type, and use the featurehash algorithm component. In its parameter setting, we need to specify the names of the columns to be processed, and indicate which are enumeration types, then the columns that are not indicated are numerical types. The featurehash operation will map these features to a sparse vector in the way of hash. The dimension of the vector can be set to 30000. Each numerical column will be hashed to a vector term, and the numerical value of the column will be paid to the corresponding vector term. For different enumeration values of each enumeration feature, they will also be hashed to the vector term and assigned to 1.

In fact, featurehash completes the work of mapping enumeration types and summarizing them into eigenvectors. Because hash is used, there is a risk that different contents will be hashed to the same item. However, because this component is relatively easy to use, featurehash is very suitable to use in the example or as a component at the beginning of the experiment to quickly get the baseline index. The related scripts are as follows:

// result column name of feature enginerring
String vecColName = "vec";
int numHashFeatures = 30000;

// setup feature enginerring pipeline
Pipeline feature_pipeline = new Pipeline()
  .add(
    new StandardScaler()
      .setSelectedCols(numericalColNames)
  )
  .add(
    new FeatureHasher()
      .setSelectedCols(selectedColNames)
      .setCategoricalCols(categoryColNames)
      .setOutputCol(vecColName)
      .setNumFeatures(numHashFeatures)
  );

We define feature engineering to process pipeline (pipeline), including standardscaler and featurehasher, execute fit method on trainbatchdata of batch training data, and train to get pipeline model (pipeline model). The pipeline model can be applied to batch data or flow data to generate eigenvectors. Let’s save the feature engineering processing model locally and set the file path as / users / Yangxu / alink / data / temp / feature_ pipe_ model.csv .

// fit and save feature pipeline model
String FEATURE_PIPELINE_MODEL_FILE =  "/Users/yangxu/alink/data/temp/feature_pipe_model.csv";
feature_pipeline.fit(trainBatchData).save(FEATURE_PIPELINE_MODEL_FILE);

BatchOperator.execute();

Example 4

In the previous article, we trained and saved the feature engineering processing model

  • The batch original training data is transformed into batch vector training data;
  • The original training data is transformed into vector training data;
  • The original prediction data is transformed into vector prediction data.

Six java examples of alink online learning

The original batch training data are as follows:

CsvSourceBatchOp trainBatchData = new CsvSourceBatchOp()
  .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv")
  .setSchemaStr(schemaStr);

We can define a streaming data source and segment the data in real time according to the ratio of 1:1 to get the streaming original training data and the streaming original prediction data.

// prepare stream train data
CsvSourceStreamOp data = new CsvSourceStreamOp()
  .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-ctr-train-8M.csv")
  .setSchemaStr(schemaStr);

// split stream to train and eval data
SplitStreamOp spliter = new SplitStreamOp().setFraction(0.5).linkFrom(data);
StreamOperator train_stream_data = spliter;
StreamOperator test_stream_data = spliter.getSideOutput(0);

adopt PipelineModel.load () method, you can load the previously saved feature engineering processing model.

// load pipeline model
PipelineModel feature_pipelineModel = PipelineModel.load(FEATURE_PIPELINE_MODEL_FILE);

Alink’s pipeline model can not only predict batch data, but also stream data. It can also call the mode system and use the transform method of model instance.

Batch vector training data can be obtained by the following code:

feature_pipelineModel.transform(trainBatchData)

The flow vector training data can be obtained by the following code:

feature_pipelineModel.transform(train_stream_data)

The flow vector prediction data can be obtained by the following code:

feature_pipelineModel.transform(test_stream_data)

Furthermore, we can train a linear model through batch vector training data as the initial model of online learning ftrl algorithm. As shown in the following script, first define the logistic regression classifier LR, then “connect” the batch vector training data to this classifier, and the output result is the logistic regression model.

// train initial batch model
LogisticRegressionTrainBatchOp lr = new LogisticRegressionTrainBatchOp()
  .setVectorCol(vecColName)
  .setLabelCol(labelColName)
  .setWithIntercept(true)
  .setMaxIter(10);

BatchOperator initModel = feature_pipelineModel.transform(trainBatchData).link(lr);

Example 5

Based on the previous preparations, we have the initial model, streaming vector training data and streaming vector prediction data, as shown in the blue node below. Next, we will enter the critical moment of this series of articles to demonstrate how to access the ftrl online training module and the corresponding online prediction module.

Six java examples of alink online learning

The code of ftrl online model training is as follows: input the initial model initmodel in the constructor of ftrltrainstreamop, then set various parameters, and “connect” the streaming vector training data.

# ftrl train 
model = FtrlTrainStreamOp(initModel) \
        .setVectorCol(vecColName) \
        .setLabelCol(labelColName) \
        .setWithIntercept(True) \
        .setAlpha(0.1) \
        .setBeta(0.1) \
        .setL1(0.01) \
        .setL2(0.01) \
        .setTimeInterval(10) \
        .setVectorSize(numHashFeatures) \
        .linkFrom(feature_pipelineModel.transform(train_stream_data))

The code of ftrl online prediction is as follows, which needs to “connect” the model flow of ftrl online model training output and flow vector prediction data.

# ftrl predict
predResult = FtrlPredictStreamOp(initModel) \
        .setVectorCol(vecColName) \
        .setPredictionCol("pred") \
        .setReservedCols([labelColName]) \
        .setPredictionDetailCol("details") \
        .linkFrom(model, feature_pipelineModel.transform(test_stream_data))

We can set the printing of streaming results as follows. Because there are many data, we need to sample the streaming data before printing. Note that for streaming tasks, the print () method cannot trigger the execution of streaming tasks, and must be called StreamOperator.execute () method to start execution.

predResult.sample(0.0001).print();

StreamOperator.execute();

In the process of execution, the batch initial model training will be run first, and then the streaming task will be started after the batch task is finished. The results are as follows

......
click|pred|details
-----|----|-------
......
collect model : 27
collect model : 27
collect model : 27
collect model : 27
collect model : 27
collect model : 27
collect model : 27
collect model : 27
6 load model : 27
0|0|{"0":"0.9955912589178986","1":"0.0044087410821014306"}
3 load model : 27
0 load model : 27
2 load model : 27
5 load model : 27
4 load model : 27
7 load model : 27
1 load model : 27
0|0|{"0":"0.8264317979578765","1":"0.17356820204212353"}
0|0|{"0":"0.9620885206519035","1":"0.037911479348096466"}
0|0|{"0":"0.7733924667279566","1":"0.22660753327204342"}
0|0|{"0":"0.8502672431715895","1":"0.14973275682841047"}
0|0|{"0":"0.9422313239589072","1":"0.057768676041092815"}
0|0|{"0":"0.8540319447494245","1":"0.14596805525057555"}
1|0|{"0":"0.7956910587819983","1":"0.2043089412180017"}
collect model : 28
collect model : 28
collect model : 28
1 load model : 28
collect model : 28
7 load model : 28
collect model : 28
collect model : 28
6 load model : 28
4 load model : 28
collect model : 28
collect model : 28
5 load model : 28
0 load model : 28
3 load model : 28
2 load model : 28
0|0|{"0":"0.794857507111827","1":"0.205142492888173"}
0|0|{"0":"0.7489915122615897","1":"0.25100848773841034"}
0|0|{"0":"0.9145883964932835","1":"0.0854116035067165"}
0|0|{"0":"0.9699130297461115","1":"0.030086970253888512"}
0|0|{"0":"0.8633425927307238","1":"0.13665740726927622"}
1|0|{"0":"0.5067251707884466","1":"0.4932748292115534"}
0|0|{"0":"0.9197477679857682","1":"0.08025223201423182"}
0|0|{"0":"0.8754429175320314","1":"0.1245570824679686"}
0|0|{"0":"0.9027103601565077","1":"0.09728963984349226"}
0|0|{"0":"0.9396522264624441","1":"0.06034777353755594"}
0|0|{"0":"0.7870435722294925","1":"0.2129564277705075"}
......

Two kinds of information are mixed in the printed text. One is to load a new model for each flow prediction node, such as “5 load model: 27”, which means that node 5 successfully loads model 27 in the model flow. The other is the prediction result data, which is printed in the information.

click|pred|details
-----|----|-------

The first column is the original “click” information, the second column is the prediction result column, and the third column is the prediction details column. The corresponding prediction results are as follows

0|0|{"0":"0.7870435722294925","1":"0.2129564277705075"}

Finally, we connect the prediction result stream predresult to evalbinaryclassstreamop, a streaming binary evaluation component, and set the corresponding parameters. Since the evaluation results are given in JSON format each time, we can also use the JSON content extraction component jsonvaluestreamop for easy display. The code is as follows:

// ftrl eval
predResult
  .link(
    new EvalBinaryClassStreamOp()
      .setLabelCol(labelColName)
      .setPredictionCol("pred")
      .setPredictionDetailCol("details")
      .setTimeInterval(10)
  )
  .link(
    new JsonValueStreamOp()
      .setSelectedCol("Data")
      .setReservedCols(new String[] {"Statistics"})
      .setOutputCols(new String[] {"Accuracy", "AUC", "ConfusionMatrix"})
      .setJsonPath(new String[] {"$.Accuracy", "$.AUC", "$.ConfusionMatrix"})
  )
  .print();

StreamOperator.execute();

Note: after the streaming component “connection” is completed, you need to call the streaming task execution command, that is StreamOperator.execute (), start execution. The results are as follows:

......
Statistics|Accuracy|AUC|ConfusionMatrix
----------|--------|---|---------------
......
window|0.839781746031746|0.6196235914061319|[[140,174],[6609,35413]]
all|0.839781746031746|0.6196235914061319|[[140,174],[6609,35413]]
......
window|0.8396464236640808|0.6729843274019895|[[206,220],[14601,77400]]
all|0.8396889353902778|0.6559248735416315|[[346,394],[21210,112813]]
......
window|0.8389366017867161|0.7125709974947197|[[328,233],[14695,77428]]
all|0.8393823616051211|0.6792648375138227|[[674,627],[35905,190241]]
......

The statistics column above has two values, all and window. All represents the evaluation results of all forecast data from the start to the present; wiodow represents the evaluation results of all forecast data in the time window (currently set to 10 seconds).

Example 6: complete code

Finally, post the complete code, interested readers can run the experiment.

Note that there are many printing or executing methods to demonstrate the intermediate results in the example. I now set the code calling these methods as comments, so that readers can release some code to see the running effect.

package com.alibaba.alink;

import com.alibaba.alink.operator.batch.BatchOperator;
import com.alibaba.alink.operator.batch.classification.LogisticRegressionTrainBatchOp;
import com.alibaba.alink.operator.batch.source.CsvSourceBatchOp;
import com.alibaba.alink.operator.stream.StreamOperator;
import com.alibaba.alink.operator.stream.dataproc.JsonValueStreamOp;
import com.alibaba.alink.operator.stream.dataproc.SplitStreamOp;
import com.alibaba.alink.operator.stream.evaluation.EvalBinaryClassStreamOp;
import com.alibaba.alink.operator.stream.onlinelearning.FtrlPredictStreamOp;
import com.alibaba.alink.operator.stream.onlinelearning.FtrlTrainStreamOp;
import com.alibaba.alink.operator.stream.source.CsvSourceStreamOp;
import com.alibaba.alink.pipeline.Pipeline;
import com.alibaba.alink.pipeline.PipelineModel;
import com.alibaba.alink.pipeline.dataproc.StandardScaler;
import com.alibaba.alink.pipeline.feature.FeatureHasher;

public class FTRLExample {

  public static void main(String[] args) throws Exception {

    //new TextSourceBatchOp()
    //  .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv")
    //  .firstN(10)
    //  .print();

    String schemaStr
      = "id string, click string, dt string, C1 string, banner_pos int, site_id string, site_domain string, "
      + "site_category string, app_id string, app_domain string, app_category string, device_id string, "
      + "device_ip string, device_model string, device_type string, device_conn_type string, C14 int, C15 int, "
      + "C16 int, C17 int, C18 int, C19 int, C20 int, C21 int";

    CsvSourceBatchOp trainBatchData = new CsvSourceBatchOp()
      .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-small.csv")
      .setSchemaStr(schemaStr);

    //trainBatchData.firstN(10).print();

    String labelColName = "click";
    String[] selectedColNames = new String[] {
      "C1", "banner_pos", "site_category", "app_domain",
      "app_category", "device_type", "device_conn_type",
      "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21",
      "site_id", "site_domain", "device_id", "device_model"};

    String[] categoryColNames = new String[] {
      "C1", "banner_pos", "site_category", "app_domain",
      "app_category", "device_type", "device_conn_type",
      "site_id", "site_domain", "device_id", "device_model"};

    String[] numericalColNames = new String[] {
      "C14", "C15", "C16", "C17", "C18", "C19", "C20", "C21"};

    // result column name of feature enginerring
    String vecColName = "vec";
    int numHashFeatures = 30000;

    // setup feature enginerring pipeline
    Pipeline feature_pipeline = new Pipeline()
      .add(
        new StandardScaler()
          .setSelectedCols(numericalColNames)
      )
      .add(
        new FeatureHasher()
          .setSelectedCols(selectedColNames)
          .setCategoricalCols(categoryColNames)
          .setOutputCol(vecColName)
          .setNumFeatures(numHashFeatures)
      );

    // fit and save feature pipeline model
    String FEATURE_PIPELINE_MODEL_FILE = "/Users/yangxu/alink/data/temp/feature_pipe_model.csv";
    //feature_pipeline.fit(trainBatchData).save(FEATURE_PIPELINE_MODEL_FILE);
    //
    //BatchOperator.execute();

    // prepare stream train data
    CsvSourceStreamOp data = new CsvSourceStreamOp()
      .setFilePath("http://alink-release.oss-cn-beijing.aliyuncs.com/data-files/avazu-ctr-train-8M.csv")
      .setSchemaStr(schemaStr)
      .setIgnoreFirstLine(true);

    // split stream to train and eval data
    SplitStreamOp spliter = new SplitStreamOp().setFraction(0.5).linkFrom(data);
    StreamOperator train_stream_data = spliter;
    StreamOperator test_stream_data = spliter.getSideOutput(0);

    // load pipeline model
    PipelineModel feature_pipelineModel = PipelineModel.load(FEATURE_PIPELINE_MODEL_FILE);

    // train initial batch model
    LogisticRegressionTrainBatchOp lr = new LogisticRegressionTrainBatchOp()
      .setVectorCol(vecColName)
      .setLabelCol(labelColName)
      .setWithIntercept(true)
      .setMaxIter(10);

    BatchOperator initModel = feature_pipelineModel.transform(trainBatchData).link(lr);

    // ftrl train
    FtrlTrainStreamOp model = new FtrlTrainStreamOp(initModel)
      .setVectorCol(vecColName)
      .setLabelCol(labelColName)
      .setWithIntercept(true)
      .setAlpha(0.1)
      .setBeta(0.1)
      .setL1(0.01)
      .setL2(0.01)
      .setTimeInterval(10)
      .setVectorSize(numHashFeatures)
      .linkFrom(feature_pipelineModel.transform(train_stream_data));

    // ftrl predict
    FtrlPredictStreamOp predResult = new FtrlPredictStreamOp(initModel)
      .setVectorCol(vecColName)
      .setPredictionCol("pred")
      .setReservedCols(new String[] {labelColName})
      .setPredictionDetailCol("details")
      .linkFrom(model, feature_pipelineModel.transform(test_stream_data));

    //predResult.sample(0.0001).print();
    //
    //StreamOperator.execute();

    // ftrl eval
    predResult
      .link(
        new EvalBinaryClassStreamOp()
          .setLabelCol(labelColName)
          .setPredictionCol("pred")
          .setPredictionDetailCol("details")
          .setTimeInterval(10)
      )
      .link(
        new JsonValueStreamOp()
          .setSelectedCol("Data")
          .setReservedCols(new String[] {"Statistics"})
          .setOutputCols(new String[] {"Accuracy", "AUC", "ConfusionMatrix"})
          .setJsonPath(new String[] {"$.Accuracy", "$.AUC", "$.ConfusionMatrix"})
      )
      .print();

    //StreamOperator.execute();

  }
}