The way of turning over Xianyu in 2020

It's 2020, and finally it's 30. I think it's time to work hard. At present, in learning machine learning, record the growth path this year.

On January 1, 2020, I learned pyspark today and learned some operations about RDD and dataframe. In this summary:

1. Import package and initialization first:    
    from pyspark import SparkConf, SparkContext
    conf = SparkConf().setMaster('local').setAppName('CustomerAnalysis')
    sc = SparkContext(conf = conf)

Some of the parameters are not checked in detail, but it’s literally the meaning of setting local and appName.

2 read the corresponding file:
rdd = sc.textFile('../data/xxxx.csv')

In this way, we create an RDD class. The following is a series of operations on this class.

3. A series of operations of RDD:

Take out the first columns
RDD. Take (3) take out the first three lines

Map (f) function applies F function for each line
RDD. Map (f) ා map is a very important function. When it is used, it can delete and select col directly. Because many methods of RDD only accept single column or (k, V) structure map, it can delete and select well

Reducebykey() needs two inputs. It seems that the last one is the reduce operation
Reducebykey (lambda x, Y: x + y) ා reducebykey is the value reduction (DF. Groupby ('key '). Sum()) based on the key

Filter function, delete and select

Sortby function sorting
sortBy(lamdba x:x[1], ascending=False)

The count calculation function passes in a single column

Incoming pair RDD

Recommended Today

Laravel queue technique: fail, retry or delay

The article was forwarded from the professional laravel developer community. Original link: When you create a queue job, listener, or subscriber to push to the queue, you may begin to think that once dispatched, it’s up to you, the queue worker, to decide what to do with your logic. Um… It’s not about youCan […]