PySpark dataframe repartition

Question

What happens when we do repartition on a PySpark dataframe based on the column. For example

dataframe.repartition('id')

Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the repartition?

score 12 · Accepted Answer · edited Mar 27 '18 at 01:39

12

The default value for spark.sql.shuffle.partitions is 200, and configures the number of partitions that are used when shuffling data for joins or aggregations.

dataframe.repartition('id') creates 200 partitions with ID partitioned based on Hash Partitioner. Dataframe Row's with the same ID always goes to the same partition. If there is DataSkew on some ID's, you'll end up with inconsistently sized partitions.

edited Mar 27 '18 at 01:39

tuomastik

1,173
10
22

answered Mar 26 '18 at 17:25

Kiran

136
1
3

Ideally means that all the data of a particular ID will be located on the same worker node right? – Jack Daniel Nov 20 '20 at 19:45

PySpark dataframe repartition

1 Answers1