2020-01-06 learning record

Time:2020-1-14

In order to write a more beautiful blog, I read the markdown document today. Try it first. I’ll write something later anyway.
To get back to the point, first summarize what you learned today.

Learning summary:

1. Practical operation of pyspark

Pyspark has learned a lot, so I also found an example to try. The specific process will not be posted first, because it is not done locally and it is not easy to record the process. Let’s talk about the learning experience:
                      . Dataframe can then be converted to a table through the createorreplacetempview function or sqlcontext.registerdataframeastable, and then queried and operated through sqlcontext.sql (“SQL expression”).

Import findpark is an initialization process
findspark.init()
import os
import pyspark
from time import time
data_file = "../data/kddcup.data_10_percent_corrected"
sc = pyspark.SparkContext(appName="test")
raw_data = sc.textFile(data_file).cache()

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
Interactions? DF = sqlcontext. Createdataframe (row? Data)? DF created

There are two ways to build tables:

interactions_df.createOrReplaceTempView("interactions")   

sqlContext.registerDataFrameAsTable(df=interactions_df, tableName='interactions_df')

Query data:

sqlContext.sql("select * from interactions_df where protocol_type='tcp'").show(5)

out:
+---------+--------+----+-------------+-------+---------+
|dst\_bytes|duration|flag|protocol\_type|service|src\_bytes|
+---------+--------+----+-------------+-------+---------+
|     5450|       0|  SF|          tcp|   http|      181|
|      486|       0|  SF|          tcp|   http|      239|
|     1337|       0|  SF|          tcp|   http|      235|
|     1337|       0|  SF|          tcp|   http|      219|
|     2032|       0|  SF|          tcp|   http|      217|
+---------+--------+----+-------------+-------+---------+

We can see that after using DF to build tables, we can use general SQL statements to query data, including some functions of SQL, such as group by, order by, distinct, count, etc. This requires knowledge of SQL.

Then from table back to DF:

sqlContext.table('interactions_df')

out:
DataFrame [dst_bytes: bigint, duration: bigint, flag: string, protocol_type: string, service: string, src_bytes: bigint]

It’s also very convenient to convert.

But this leaves a thought:
We have RDD, DF and table tools to store data, so how to choose?


2. Some knowledge of integration algorithm

Today, I mainly talked about xgboost and gbdt, which are integrated algorithms. They are a bit messy, so I need to supplement the knowledge of the algorithm later. I won’t explain it in detail here. I will write a special article on algorithm later.