-
Notifications
You must be signed in to change notification settings - Fork 29
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Create rule S7194: PySpark broadcasting should be used when joining a…
… small DataFrame to a larger DataFrame.
- Loading branch information
Showing
2 changed files
with
50 additions
and
24 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,44 +1,70 @@ | ||
FIXME: add a description | ||
|
||
// If you want to factorize the description uncomment the following line and create the file. | ||
//include::../description.adoc[] | ||
This rule raises an issue when a small DataFrame is joined to another DataFrame without the use of the broadcast operation. | ||
|
||
== Why is this an issue? | ||
|
||
FIXME: remove the unused optional headers (that are commented out) | ||
In PySpark, shuffling refers to the process of transferring data between worker nodes within a cluster. | ||
This operation, while necessary for tasks such as join and aggregation on DataFrames, can be resource-intensive. | ||
Although Spark handles shuffling automatically, there are strategies to minimize it, thereby enhancing the performance of these operations. | ||
|
||
//=== What is the potential impact? | ||
When performing join operations with multiple DataFrames in PySpark, it is crucial to consider the size of the DataFrames involved. | ||
If a small DataFrame is being joined to a larger one, utilizing the `broadcast` function to distribute the small DataFrame across all worker nodes can be beneficial. | ||
This approach significantly reduces the volume of data shuffled between nodes, thereby improving the efficiency of the join operation. | ||
|
||
== How to fix it | ||
//== How to fix it in FRAMEWORK NAME | ||
|
||
To fix this issue, use the `broadcast` function on the small DataFrame before performing the join operation. | ||
|
||
=== Code examples | ||
|
||
==== Noncompliant code example | ||
|
||
[source,python,diff-id=1,diff-type=noncompliant] | ||
---- | ||
FIXME | ||
from pyspark.sql import SparkSession | ||
spark = SparkSession.builder.appName('myspark').getOrCreate() | ||
data = [ | ||
(1, "Alice"), | ||
(2, "Bob"), | ||
(2, "Charlie"), | ||
(1, "Dan"), | ||
(2, "Elsa") | ||
] | ||
large_df = spark.createDataFrame(data, ["department_id", "name"]) | ||
small_df = spark.createDataFrame([(1, 'HR'), (2, 'Finance')], ["department_id", "department"]) | ||
joined_df = large_df.join(small_df, on="department_id", how="left") # NonCompliant: the small DataFrame is not broadcasted | ||
---- | ||
|
||
==== Compliant solution | ||
|
||
[source,python,diff-id=1,diff-type=compliant] | ||
---- | ||
FIXME | ||
from pyspark.sql import SparkSession | ||
from pyspark.sql.functions import broadcast | ||
spark = SparkSession.builder.appName('myspark').getOrCreate() | ||
data = [ | ||
(1, "Alice"), | ||
(2, "Bob"), | ||
(2, "Charlie"), | ||
(1, "Dan"), | ||
(2, "Elsa") | ||
] | ||
large_df = spark.createDataFrame(data, ["department_id", "name"]) | ||
small_df = spark.createDataFrame([(1, 'HR'), (2, 'Finance')], ["department_id", "department"]) | ||
joined_df = large_df.join(broadcast(small_df), on="department_id", how="left") # Compliant | ||
---- | ||
|
||
//=== How does this work? | ||
== Resources | ||
=== Documentation | ||
|
||
//=== Pitfalls | ||
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.broadcast.html[pyspark.sql.functions.broadcast] | ||
|
||
//=== Going the extra mile | ||
=== Articles & blog posts | ||
|
||
* Medium Article - https://aspinfo.medium.com/what-is-broadcast-join-how-to-perform-broadcast-in-pyspark-699aef2eff5a[What is broadcast join, how to perform broadcast in pyspark?] | ||
|
||
//== Resources | ||
//=== Documentation | ||
//=== Articles & blog posts | ||
//=== Conference presentations | ||
//=== Standards | ||
//=== External coding guidelines | ||
//=== Benchmarks |