pyspark.sql.streaming.DataStreamWriter.partitionBy#

DataStreamWriter.partitionBy(*cols)[source]#

Partitions the output by the given columns on the file system.

If specified, the output is laid out on the file system similar to Hive’s partitioning scheme.

New in version 2.0.0.

Changed in version 3.5.0: Supports Spark Connect.

Parameters

colsstr or list: name of columns

Notes

This API is evolving.

Examples

>>> df = spark.readStream.format("rate").load()
>>> df.writeStream.partitionBy("value")
<...streaming.readwriter.DataStreamWriter object ...>

Partition-by timestamp column from Rate source.

>>> import tempfile
>>> import time
>>> with tempfile.TemporaryDirectory(prefix="partitionBy1") as d:
...     with tempfile.TemporaryDirectory(prefix="partitionBy2") as cp:
...         df = spark.readStream.format("rate").option("rowsPerSecond", 10).load()
...         q = df.writeStream.partitionBy(
...             "timestamp").format("parquet").option("checkpointLocation", cp).start(d)
...         time.sleep(5)
...         q.stop()
...         spark.read.schema(df.schema).parquet(d).show()
+...---------+-----+
|...timestamp|value|
+...---------+-----+
...