pyspark.RDD.partitionBy#
- RDD.partitionBy(numPartitions, partitionFunc=<function portable_hash>)[source]#
- Return a copy of the RDD partitioned using the specified partitioner. - New in version 0.7.0. - Parameters
- numPartitionsint, optional
- the number of partitions in new - RDD
- partitionFuncfunction, optional, default portable_hash
- function to compute the partition index 
 
- Returns
 - Examples - >>> pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1]).map(lambda x: (x, x)) >>> sets = pairs.partitionBy(2).glom().collect() >>> len(set(sets[0]).intersection(set(sets[1]))) 0