It's 2020, and finally it's 30. I think it's time to work hard. At present, in learning machine learning, record the growth path this year.
On January 1, 2020, I learned pyspark today and learned some operations about RDD and dataframe. In this summary:
1. Import package and initialization first: from pyspark import SparkConf, SparkContext conf = SparkConf().setMaster('local').setAppName('CustomerAnalysis') sc = SparkContext(conf = conf)
Some of the parameters are not checked in detail, but it’s literally the meaning of setting local and appName.
2 read the corresponding file: rdd = sc.textFile('../data/xxxx.csv')
In this way, we create an RDD class. The following is a series of operations on this class.
3. A series of operations of RDD: Take out the first columns RDD. Take (3) take out the first three lines Map (f) function applies F function for each line RDD. Map (f) ා map is a very important function. When it is used, it can delete and select col directly. Because many methods of RDD only accept single column or (k, V) structure map, it can delete and select well Reducebykey() needs two inputs. It seems that the last one is the reduce operation Reducebykey (lambda x, Y: x + y) ා reducebykey is the value reduction (DF. Groupby ('key '). Sum()) based on the key Filter function, delete and select Sortby function sorting sortBy(lamdba x:x, ascending=False) The count calculation function passes in a single column count() Incoming pair RDD countByKey()