RDD转换算子

转换算子不会立即计算结果，而是返回一个新的 RDD，这个新的 RDD 是基于原 RDD 的一个惰性计算的结果。

常用的 RDD 转换算子

map(func)
- 将 RDD 中的每个元素通过指定的函数 func 转换成一个新的元素，返回一个新的 RDD。
- 例子：将数字列表中的每个数字平方。
```
rdd = sc.parallelize([1, 2, 3, 4])
squared_rdd = rdd.map(lambda x: x * x)
print(squared_rdd.collect())  # 输出: [1, 4, 9, 16]
```
filter(func)
- 返回一个新的 RDD，它包含了所有通过指定函数 func 测试的原 RDD 中的元素。
- 例子：过滤出列表中所有的偶数。
```
rdd = sc.parallelize([1, 2, 3, 4, 5])
even_rdd = rdd.filter(lambda x: x % 2 == 0)
print(even_rdd.collect())  # 输出: [2, 4]
```

flatMap(func)

类似于 map，但每个输入元素可以被映射到 0 或多个输出元素（即，func 应该返回一个列表）。

例子：将句子分割成单词。

rdd = sc.parallelize(["hello world", "spark is fun"])
words_rdd = rdd.flatMap(lambda x: x.split(" "))
print(words_rdd.collect())  # 输出: ['hello', 'world', 'spark', 'is', 'fun']

mapPartitions(func)

类似于 map，但 func 是作用于 RDD 的每一个分区（partition），而不是每一个元素。

例子：在每个分区中加上一个数字。

def add_ten(iterator):
    for x in iterator:
        yield x + 10

rdd = sc.parallelize([1, 2, 3, 4], 2)  # 2个分区
incremented_rdd = rdd.mapPartitions(add_ten)
print(incremented_rdd.collect())  # 输出: [11, 12, 13, 14]

mapPartitionsWithIndex(func)

类似于 mapPartitions，但 func 还接受一个表示分区索引的参数。

例子：在每个分区中加上分区索引。

def add_index(index, iterator):
    for x in iterator:
        yield (index, x)

rdd = sc.parallelize([1, 2, 3, 4, 5, 6], 2)  # 2个分区
indexed_rdd = rdd.mapPartitionsWithIndex(add_index)
print(indexed_rdd.collect())  # 输出: [(0, 1), (0, 2), (0, 3), (1, 4), (1, 5), (1, 6)]

sample(withReplacement, fraction, seed)
- 根据指定的比例 fraction 对 RDD 进行随机抽样，withReplacement 决定是否允许重复抽样。
- 例子：从 RDD 中随机抽取 50% 的元素。
```
rdd = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
sampled_rdd = rdd.sample(False, 0.5, 1234)
print(sampled_rdd.collect())  # 输出可能: [1, 2, 4, 5, 9]
```

union(otherRDD)

返回两个 RDD 的并集。

例子：合并两个 RDD。

rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([4, 5, 6])
union_rdd = rdd1.union(rdd2)
print(union_rdd.collect())  # 输出: [1, 2, 3, 4, 5, 6]

intersection(otherRDD)

返回两个 RDD 的交集。

例子：找出两个 RDD 的共同元素。

rdd1 = sc.parallelize([1, 2, 3, 4])
rdd2 = sc.parallelize([3, 4, 5, 6])
intersection_rdd = rdd1.intersection(rdd2)
print(intersection_rdd.collect())  # 输出: [3, 4]

distinct([numPartitions])

返回 RDD 中所有不重复的元素。

例子：去除 RDD 中的重复元素。

rdd = sc.parallelize([1, 2, 3, 2, 1, 4])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())  # 输出: [1, 2, 3, 4]

groupByKey([numPartitions])

对键值对 RDD 中的键进行分组，返回一个新的 RDD，其元素为键值对，值是原 RDD 中该键对应的所有值的迭代器。

例子：按键分组。

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
grouped_rdd = rdd.groupByKey()
for key, values in grouped_rdd.collect():
    print(f"{key}: {list(values)}")
# 输出:
# a: [1, 2]
# b: [1]

reduceByKey(func, [numPartitions])

对键值对 RDD 中相同键的值进行聚合，使用 func 函数进行合并。

例子：按键求和。

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
reduced_rdd = rdd.reduceByKey(lambda x, y: x + y)
print(reduced_rdd.collect())  # 输出: [('a', 3), ('b', 1)]

aggregateByKey(zeroValue, seqOp, combOp, [numPartitions])

对键值对 RDD 中相同键的值进行更复杂的聚合操作，seqOp 用于分区内的合并，combOp 用于分区间的合并。

例子：计算每个键的值的列表和总和。

rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2), ("b", 3)])
def seqOp(values, v):
    return values + [v]

def combOp(values1, values2):
    return values1 + values2

aggregated_rdd = rdd.aggregateByKey([], seqOp, combOp)
for key, values in aggregated_rdd.collect():
    print(f"{key}: {values}, sum: {sum(values)}")
# 输出:
# a: [1, 2], sum: 3
# b: [1, 3], sum: 4

sortBy(keyfunc, ascending=True, numPartitions=None)

对 RDD 中的元素按照指定的键函数 keyfunc 进行排序。

例子：按数字大小排序。

rdd = sc.parallelize([(3, "three"), (1, "one"), (2, "two")])
sorted_rdd = rdd.sortBy(lambda x: x[0])
print(sorted_rdd.collect())  # 输出: [(1, 'one'), (2, 'two'), (3, 'three')]

sortByKey([ascending=True, numPartitions=None], keyfunc=None)
- 对键值对 RDD 中的键进行排序。
- 例子：按键排序。
```
rdd = sc.parallelize([("b", 2), ("a", 1), ("c", 3)])
sorted_rdd = rdd.sortByKey()
print(sorted_rdd.collect)
```

RDD转换算子

常用的 RDD 转换算子

悦读