pyspark.sql.DataFrame.groupBy¶
-
DataFrame.groupBy(*cols: ColumnOrName) → GroupedData[source]¶ Groups the
DataFrameusing the specified columns, so we can run aggregation on them. SeeGroupedDatafor all the available aggregate functions.groupby()is an alias forgroupBy().New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- Returns
GroupedDataGrouped data by given columns.
Examples
>>> df = spark.createDataFrame([ ... (2, "Alice"), (2, "Bob"), (2, "Bob"), (5, "Bob")], schema=["age", "name"])
Empty grouping columns triggers a global aggregation.
>>> df.groupBy().avg().show() +--------+ |avg(age)| +--------+ | 2.75| +--------+
Group-by ‘name’, and specify a dictionary to calculate the summation of ‘age’.
>>> df.groupBy("name").agg({"age": "sum"}).sort("name").show() +-----+--------+ | name|sum(age)| +-----+--------+ |Alice| 2| | Bob| 9| +-----+--------+
Group-by ‘name’, and calculate maximum values.
>>> df.groupBy(df.name).max().sort("name").show() +-----+--------+ | name|max(age)| +-----+--------+ |Alice| 2| | Bob| 5| +-----+--------+
Group-by ‘name’ and ‘age’, and calculate the number of rows in each group.
>>> df.groupBy(["name", df.age]).count().sort("name", "age").show() +-----+---+-----+ | name|age|count| +-----+---+-----+ |Alice| 2| 1| | Bob| 2| 2| | Bob| 5| 1| +-----+---+-----+