How does alink read and write libsvm data?

Time:2021-1-19

Alink is a machine learning algorithm platform based on Flink. Please visit alink’s GitHub for more information. This article mainly shares one of the skills of using alink, how to read and write libsvm data. Libsvm data format is libsvm( csie.ntu.edu . TW / ~ cjlin /) is a common data format in the field of machine learning. Its format is defined as follows:

<label> <index1>:<value1> <index2>:<value2> ...

The first < label > is the target value of the training data setclassificationProblem, use integer as the identification of the category (for 2 classification, use {0,1} or {- 1,1} to express; for 2 classification, use {- 1,1}MulticlassificationFor this problem, we usually use continuous integers, such as {1,2,3} to represent each category of three categoriesregressionThe problem is that the target value is a real number. After that, it is composed of a number of index < index > and numerical value < value > pairs (with colon “:” as the separator), and each item is separated by a space. Index < index > is an integer starting with 1, which can be discontinuous; value < value > is a real number.

Here are a few pieces of data in libsvm format.

1 1:-0.555556 2:0.5 3:-0.79661 4:-0.916667
1 1:-0.833333 3:-0.864407 4:-0.916667
1 1:-0.444444 2:0.416667 3:-0.830508 4:-0.916667
1 1:-0.611111 2:0.0833333 3:-0.864407 4:-0.916667
2 1:0.5 3:0.254237 4:0.0833333
2 1:0.166667 3:0.186441 4:0.166667
2 1:0.444444 2:-0.0833334 3:0.322034 4:0.166667

Note this data:

2 1:0.5 3:0.254237 4:0.0833333

There is no item with an index value of 2, indicating that the second eigenvalue is 0.

We will csie.ntu.edu . TW / ~ cjlin / download to local, named iris.scale.libsvm . By calling libsvmsourcebatchop to read data, you only need to specify one parameter, that is, the path of the file. And take the first three data to print and display.

iris_libsvm = LibSvmSourceBatchOp()\
    .setFilePath("/Users/yangxu/alink/data/iris/iris.scale.libsvm")
iris_libsvm.firstN(3).print()

The output results are as follows: the index number of the printed data is on the left, followed by the label column of the data (the column name is automatically named label), and then the feature data column of the data (the column name is automatically named features).

How does alink read and write libsvm data?

Next, we sample 10 pieces of original data, and then use libsvmsinkbatchop to save the sampling results. Note that in addition to the saved path, we also need to specify three parameters, the first two are the label column name and characteristic data column name of the data, and the last parameter Rewritesink, which indicates whether the target file exists during the save operation. At the end of the script, call BatchOperator.execute (), perform the task.

iris_libsvm \
.sampleWithSize(10) \
.link(
    LibSvmSinkBatchOp()\
    .setFilePath("/Users/yangxu/alink/data/iris/iris.scale.sample.libsvm")\
    .setLabelCol('label')\
    .setVectorCol('features')\
    .setOverwriteSink(True)
)

BatchOperator.execute()

Finally, we verify the saved result file, that is, read iris.scale.sample . libsvm and print out.

LibSvmSourceBatchOp().setFilePath("/Users/yangxu/alink/data/iris/iris.scale.sample.libsvm").print()

The output results are as follows

How does alink read and write libsvm data?

above. Alink is a machine learning algorithm platform based on Flink. Welcome to the GitHub link of alink for more information. Welcome to join alink open source user group for communication ~

Link to alink GitHub:
https://github.com/alibaba/Alink

To join alink technology exchange group

How does alink read and write libsvm data?