pyspark.sql.functions.count_min_sketch#
- pyspark.sql.functions.count_min_sketch(col, eps, confidence, seed=None)[source]#
- Returns a count-min sketch of a column with the given esp, confidence and seed. The result is an array of bytes, which can be deserialized to a CountMinSketch before usage. Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. - New in version 3.5.0. - Parameters
- colColumnor column name
- target column to compute on. 
- epsColumnor float
- relative error, must be positive - Changed in version 4.0.0: eps now accepts float value. 
- confidenceColumnor float
- confidence, must be positive and less than 1.0 - Changed in version 4.0.0: confidence now accepts float value. 
- seedColumnor int, optional
- random seed - Changed in version 4.0.0: seed now accepts int value. 
 
- col
- Returns
- Column
- count-min sketch of the column 
 
 - Examples - Example 1: Using columns as arguments - >>> from pyspark.sql import functions as sf >>> spark.range(100).select( ... sf.hex(sf.count_min_sketch(sf.col("id"), sf.lit(3.0), sf.lit(0.1), sf.lit(1))) ... ).show(truncate=False) +------------------------------------------------------------------------+ |hex(count_min_sketch(id, 3.0, 0.1, 1)) | +------------------------------------------------------------------------+ |0000000100000000000000640000000100000001000000005D8D6AB90000000000000064| +------------------------------------------------------------------------+ - Example 2: Using numbers as arguments - >>> from pyspark.sql import functions as sf >>> spark.range(100).select( ... sf.hex(sf.count_min_sketch("id", 1.0, 0.3, 2)) ... ).show(truncate=False) +----------------------------------------------------------------------------------------+ |hex(count_min_sketch(id, 1.0, 0.3, 2)) | +----------------------------------------------------------------------------------------+ |0000000100000000000000640000000100000002000000005D96391C00000000000000320000000000000032| +----------------------------------------------------------------------------------------+ - Example 3: Using a long seed - >>> from pyspark.sql import functions as sf >>> spark.range(100).select( ... sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.2, 1111111111111111111)) ... ).show(truncate=False) +----------------------------------------------------------------------------------------+ |hex(count_min_sketch(id, 1.5, 0.2, 1111111111111111111)) | +----------------------------------------------------------------------------------------+ |00000001000000000000006400000001000000020000000044078BA100000000000000320000000000000032| +----------------------------------------------------------------------------------------+ - Example 4: Using a random seed - >>> from pyspark.sql import functions as sf >>> spark.range(100).select( ... sf.hex(sf.count_min_sketch("id", sf.lit(1.5), 0.6)) ... ).show(truncate=False) +----------------------------------------------------------------------------------------------------------------------------------------+ |hex(count_min_sketch(id, 1.5, 0.6, 2120704260)) | +----------------------------------------------------------------------------------------------------------------------------------------+ |0000000100000000000000640000000200000002000000005ADECCEE00000000153EBE090000000000000033000000000000003100000000000000320000000000000032| +----------------------------------------------------------------------------------------------------------------------------------------+