Spark RDD conversion operation and action operation

Time:2019-11-27

This article is excerpted from spark rapid big data analysis

Summary

  • RDD supports two operations: transformation and action.
  • A new RDD operation, such as map() and filter(), is returned during the conversion operation.
  • The action operation is to return the result to the driver or write the result to the external system, which will trigger the actual calculation, such as count() and first().
  • Spark treats transformation operations differently from action operations, so it’s important to understand what you’re doing.
  • If you are confused about whether a specific function belongs to conversion operation or action operation, you can see its return value type: conversion operation returns RDD, while action operation returns other data types.

RDD conversion operation

  • Table 3-2: basic RDD conversion operation for an RDD with {1, 2, 3, 3} data
Function name objective Example Result
map() Apply the function to each element in the RDD, and the return value constitutes the new RDD rdd.map(x -> x+1) {2, 3, 4, 4}
flatMap() The function is applied to each element in the RDD, and all the contents of the returned iterator constitute the new RDD. Usually used to segment words rdd.flatMap(x -> x.to(3)) {1, 2, 3, 2, 3, 3, 3}
filter() Returns an RDD consisting of elements passed to the filter() function rdd.filter(x -> x != 1) {2, 3, 3}
distinct() Duplicate removal rdd.distinct() {1, 2, 3}
sample(withReplacement, fraction, [seed]) Sample RDD and replace or not rdd.sample(false, 0.5) Indefinite
  • Table 3-3: the RDD with data of {1, 2, 3} and {3, 4, 5} are transformed into two RDDS
Function name objective Example Result
union() Generate a RDD rdd.union(other) {1, 2, 3, 3, 4, 5}
intersection() Find the RDD of two RDD common elements rdd.intersection(other) {3}
subtract() Remove elements from another RDD rdd.subtract(other) {1, 2}
cartesian() Cartesian product of another RDD rdd.cartesian(other) {(1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 3), (3, 4), (3, 5)}

RDD action operations

  • Table 3-4: basic RDD operations for an RDD with {1, 2, 3, 3} data
Function name objective Example Result
collect() Return all elements in RDD rdd.collect() {1, 2, 3, 3}
count() Number of elements in RDD rdd.count() 4
countByValue() The number of times each element appears in RDD rdd.countByValue() {(1, 1), (2, 1), (3, 2)}
take(num) Num elements returned from RDD rdd.take(2) {1, 2}
top(num) Return the top num elements from RDD rdd.top(2) {3, 3}
takeOrdered(num)(ordering) Return the first num elements in the order provided from RDD rdd.takeOrdered(2)(myOrdering) {3, 3}
takeSample(withReplacement, num, [seed]) Return any elements from RDD rdd.takeSample(false, 1) Indefinite
reduce(func) Parallel integration of data in RDD (such as sum) rdd.reduce((x, y) -> x + y) 9
fold(zeor)(func) Same as reduce (), but initial value is required rdd.fold(0)((x, y) -> x + y) 9
★ aggregate(zeroValue)(seqOp, combOp) Similar to reduce (), but usually returns different types of functions rdd.aggergate((0, 0))((x, y) -> (x._1 + y, x._2 + 1), (x, y) -> (x._1 + y._1, x._2 + y._2)) (9, 4)
foreach(func) Use the given function for each element in RDD rdd.foreach(func) nothing

Pair RDD conversion operation

  • Table 4-1: the conversion operation of pair RDD, taking the key value pair {(1, 2), (3, 4), (3, 6)} as an example
Function name objective Example Result
reduceByKey(func) Merge values with the same key rdd.reduceByKey((x, y) -> x + y) {(1, 2), (3, 10)}
groupByKey() Group values with the same key rdd.groupByKey() {(1, [2]), (3, [4, 6])}
★ combineByKey(createCombiner, mergeValue, mergeCombiners, partitioner) Merge values with the same key using different return types See examples 4-12 to 4-14
mapValues(func) Apply a function to each value in pair RDD without changing the key rdd.mapValues(x -> x + 1) {(1, 3), (3, 5), (3, 7)}
flatMapValues(func) Apply a function that returns an iterator to each value in pair RDD, and then generate a corresponding original key value pair record for each returned element. Usually used for symbolization rdd.flatMapValues(x -> (x to 5)) {(1, 2), (1, 3), (1, 4), (1, 5), (3, 4), (3, 5)}
keys() Returns an RDD containing only keys rdd.keys() {1, 3, 3}
values() Returns an RDD containing only values rdd.values() {2, 4, 6}
sortByKey() Returns an RDD sorted by key rdd.sortByKey() {(1, 2), (3, 4), (3, 6)}
  • Table 4-2: RDD = {(1, 2), (3, 4), (3, 6)} other = {(3, 9)} for the conversion operation of two pair RDDS
Function name objective Example Result
subtractByKey Delete the same key element in RDD as in other RDD rdd.substractByKey(other) {(1, 2)}
join Internal connection of two RDDS rdd.join(other) {(3, (4, 9)), (3, (6, 9))}
★ rightOuterJoin Connect two RDDS to ensure that the key of the first RDD must exist (right outer connection) rdd.rightOuterJoin(other) {(3, (Some(4), 9)), (3, (Some(6), 9))}
★ leftOuterJoin Connect the two RDDS to ensure that the key of the second RDD must exist (left outer connection) rdd.leftOuterJoin(other) {(1, (2, None)), (3, (4, Some(9))), (3, (6, Some(9)))}
cogroup Grouping data with the same key in two RDDS rdd.cogroup(other) {(1, ([2], [])), (3, ([4, 6], [9]))}

Pair RDD action operation

  • Table 4-3: action operations of pair RDD, taking the key value pair set {(1, 2), (3, 4), (3, 6)} as an example
Function name objective Example Result
countByKey() Count the elements corresponding to each key separately rdd.countByKey() {(1, 1), (3, 2)}
collectAsMap() Return the result in the form of a mapping table for querying rdd.collectAsMap() Map{(1, 2), (3, 6)}
lookup(key) Returns all values corresponding to a given key rdd.lookup(3) [4, 6]

This article is from Walker snapshot