11

What happens when we do repartition on a PySpark dataframe based on the column. For example

dataframe.repartition('id')

Does this moves the data with the similar 'id' to the same partition? How does the spark.sql.shuffle.partitions value affect the repartition?

Nikhil Baby
  • 213
  • 1
  • 2
  • 6

1 Answers1

12

The default value for spark.sql.shuffle.partitions is 200, and configures the number of partitions that are used when shuffling data for joins or aggregations.

dataframe.repartition('id') creates 200 partitions with ID partitioned based on Hash Partitioner. Dataframe Row's with the same ID always goes to the same partition. If there is DataSkew on some ID's, you'll end up with inconsistently sized partitions.

tuomastik
  • 1,173
  • 10
  • 22
Kiran
  • 136
  • 1
  • 3