4. RDD operation

Time:2022-5-24

1、 RDD creation

Load data from local file system to create RDD

  • SC: sparkcontext (automatically created by shell)

  • Load data in local file system to create RDD

    Spark uses the textfile () method to load data from the file system and create RDD

    This method takes the URI of the file as a parameter, which can be:

    • Address of local file system

    • Or the address of the distributed file system HDFS

    • Or the address of Amazon S3, etc

Load data from HDFS to create RDD

  1. Start HDFS

  2. Upload file

  3. see file

  4. Loading files in spark

    Textfile reads HDFS by default, so HDFS can be omitted.

    The default directory of HDFS. The first three statements are completely equivalent. You can use any of them

    It is not the default directory. You need to return the path

  5. Stop HDFS

Creating RDDS from parallel sets (lists)

  • Input list, string, numpy generate array

2、 RDD operation

Conversion operation

  • For RDD, each conversion operation will generate a new RDD for the next “conversion”

  • The RDD obtained from the conversion is evaluated lazily, that is, the whole conversion process only records the conversion track, and real calculation will not occur. Real calculation will occur only when action operation is encountered, and physical conversion operation will be carried out from the source of blood relationship

operation meaning
filter(func) Filter out the elements that meet the function func and return a new data set
map(func) Pass each element into the func function and return the result as a new data set
flatMap(func) Similar to map (), but each input element can be mapped to 0 or more output results
groupByKey() When applied to the data set of (k, V) key value pairs, a new (k, iteratable) data set is returned
reduceByKey(func) When applied to the data set of (k, V) key value pairs, a new data set in the form of (k, V) is returned, where each value is the result of aggregation after passing each key to func

filter(func)

  • Explicitly define functions

    The result is not obvious. Change a keyword

  • Lambda function

map(func)

  1. String segmentation

    • Explicitly define functions

    • Lambda function

  2. Number plus 100

    • Explicitly define functions

    • Lambda function

  3. String with fixed prefix

    • Explicitly define functions

    • Lambda function

flatMap(func)

  1. participle

  2. Mapping words into key value pairs

reduceByKey()

  1. Statistical word frequency, accumulation

  2. Multiplication rule

groupByKey()

  1. Word grouping

  2. View the contents of the group

  3. Accumulate map after grouping

sortByKey()

  1. Word frequency statistics sort by word

sortBy()

  1. Word frequency statistics sort by word frequency

Action operation

The action operation is where the calculation is really triggered. The spark program will only perform the real calculation when it is executed to the action operation, load the data from the file, complete the conversion operation again and again, and finally complete the action operation to get the result.

operation meaning
count() Returns the number of elements in the dataset
collect() Returns all elements in the dataset as an array
first() Returns the first element in the dataset
take(n) Returns the first n elements in the dataset as an array
foreach(func) Pass each element in the dataset to the func function to run
reduce(func) The elements in the dataset are aggregated through the function func (enter two parameters and return a value)

foreach(func)

  • foreach(print)

  • foreach(lambda a:print(a.upper())

collect()

count()

take(n)

reduce()

  • Numeric RDD elements are accumulated

  • Different from reducebykey

    When reducebykey (func) is applied to the data set of (k, V) key value pairs, it returns a new data set in the form of (k, V). Each value is the result obtained by passing each key to func for aggregation