pyspark.SparkContext.union#

SparkContext.union(rdds)[source]#

Build the union of a list of RDDs.

This supports unions() of RDDs with different serialized formats, although this forces them to be reserialized using the default serializer:

New in version 0.7.0.

See also

RDD.union()

Examples

>>> import os
>>> import tempfile
>>> with tempfile.TemporaryDirectory(prefix="union") as d:
...     # generate a text RDD
...     with open(os.path.join(d, "union-text.txt"), "w") as f:
...         _ = f.write("Hello")
...     text_rdd = sc.textFile(d)
...
...     # generate another RDD
...     parallelized = sc.parallelize(["World!"])
...
...     unioned = sorted(sc.union([text_rdd, parallelized]).collect())

>>> unioned
['Hello', 'World!']