pyspark.sql.streaming.DataStreamWriter.partitionBy#
- DataStreamWriter.partitionBy(*cols)[source]#
- Partitions the output by the given columns on the file system. - If specified, the output is laid out on the file system similar to Hive’s partitioning scheme. - New in version 2.0.0. - Changed in version 3.5.0: Supports Spark Connect. - Parameters
- colsstr or list
- name of columns 
 
 - Notes - This API is evolving. - Examples - >>> df = spark.readStream.format("rate").load() >>> df.writeStream.partitionBy("value") <...streaming.readwriter.DataStreamWriter object ...> - Partition-by timestamp column from Rate source. - >>> import tempfile >>> import time >>> with tempfile.TemporaryDirectory(prefix="partitionBy1") as d: ... with tempfile.TemporaryDirectory(prefix="partitionBy2") as cp: ... df = spark.readStream.format("rate").option("rowsPerSecond", 10).load() ... q = df.writeStream.partitionBy( ... "timestamp").format("parquet").option("checkpointLocation", cp).start(d) ... time.sleep(5) ... q.stop() ... spark.read.schema(df.schema).parquet(d).show() +...---------+-----+ |...timestamp|value| +...---------+-----+ ...