Apache Spark Performance Tuning

Overview

This is the last article of a four-part series about Apache Spark on YARN. Apache Spark carefully distinguishes “transformation” operation into two types such as “narrow” and “wide”. This distinction is important due to strong implications on evaluating transformations and improving their performance. Spark depends heavily on key/value pair paradigm on defining and parallelizing operations, especially wide transformations requiring data to be redistributed between machines.

Few performance bottlenecks were identified in the SFO Fire department call service dataset use case with YARN cluster manager. To understand about the use case and performance bottlenecks identified, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks.

The Resource planning bottleneck is addressed and notable performance improvements achieved in the use case Spark application is discussed in our previous blog on Apache Spark on YARN – Resource Planning.

To know about partition tuning in the use case Spark application, refer our previous blog on Apache Spark Performance Tuning – Degree of Parallelism.

In this blog, let us discuss about shuffle and straggler tasks problem so as to improve the performance of the use case application.

Our other articles of the four-part series are:

Part 1 – Apache Spark on YARN – Performance and Bottlenecks
Part 2 – Apache Spark on YARN – Resource Planning
Part 3 – Apache Spark Performance Tuning – Degree of Parallelism

Spark Shuffle Principles

Two primary techniques such as “shuffle less” and “shuffle better” to avoid performance problems associated with shuffles are as follows:

Shuffle Less Often – To minimize number of shuffles in a computation requiring several transformations, preserve partitioning across narrow transformations to avoid reshuffling data.
Shuffle Better – Computation cannot be completed without a shuffle sometimes. All wide transformations and all shuffles are not equally expensive or prone to failure.

Operations on the key/value pairs can cause:

Out-of-memory errors in the driver
Out-of-memory errors on the executor nodes
Shuffle failures
Straggler tasks or partitions, especially slow to compute

The memory errors in the driver is mainly caused by actions. The last three performance issues (such as out of memory on the executors, shuffles, and straggler tasks) are almost caused by shuffles associated with the wide transformations.

Understanding Use Case Application Shuffle

The number of partitions tuned based on the input dataset size is explained in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. The DataFrame API implementation of application submitted with the following configuration is shown in the below screenshot:

./bin/spark-submit --name FireServiceCallAnalysisDataFramePartitionTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.FireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysis.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

Partition23_23_ShuflleUnderstanding

On considering Shuffle Read and Write columns, the shuffled data are in Bytes and Kilo Bytes (KB) across all the stages as per the shuffle principles “Shuffle are less” in our use case application.

The input of ~849 MB is carried over in all the shuffle stages.

The “Executors” tab in the Spark UI provides the summary of input, shuffles read, and write as shown in the below diagram:

ExecutorSummary23_23Partition

The overall input size is 5.9 GB including original input of 1.5 GB and entire shuffle input of ~849 MB.

Detecting Stragglers Tasks in Use Case

“Stragglers” are tasks within a stage that take much longer to execute than other tasks.

The total time taken for DataFrame API implementation is 1.3 minutes.

On considering the Stages wise durations, Stage 0 and 2 consumed 10 s and 46 s, respectively. Totally, 56 seconds (~ 1 minute).

StragglerDeduction23_23Partition

Internally, Spark does the following:

Spark optimizers such as Catalyst and Tungsten optimize the code at run time
Spark high-level DataFrame and DataSet API encoder reduce the input size by encoding the data

By reducing input size and by filtering the data from input datasets in both low-level and high-level API implementation, the performance can be improved.

Low-Level and High-Level API Implementation

Our input dataset has 34 columns. 3 columns were used for computation to answer the use case scenario questions.

The below updated RDD and DataFrame API implementation code provides performance improvement by selecting only needed data for this use case scenario:

val filteredFireServiceCallRDD = filteredFireServiceCallWithoutHeaderRDD.map(x => Array(x(3), x(4), x(31)))

The above line is added at the beginning of the RDD API implementation to select 3 columns and remove 31 columns from the RDD to reduce the input size in all the shuffle stages.

The below code also does the same thing in DataFrame API implementation:

// FILTERING NEEDED COLUMN FOR USE CASE SCENARIO’S
val fireServiceCallDF = fireServiceCallYearAddedDF.select("CallType", "NeighborhooodsDistrict", "CallDateTS", "CallYear")

The code block of both RDD and DataFrame API implementations is given below:

// FILTER THE HEADER ROW AND SPLIT THE COLUMNS IN THE DATA FILE (EXCLUDE COMMA WITH IN DOUBLE QUOTES)
val filteredFireServiceCallWithoutHeaderRDD = fireServiceCallRawRDD.filter(row => row != header).map(x => x.split(",(?=([^\"]*\"[^\"]*\")*[^\"]*$)"))

val filteredFireServiceCallRDD = filteredFireServiceCallWithoutHeaderRDD.map(x => Array(x(3), x(4), x(31)))

// CACHE/PERSIST THE RDD
filteredFireServiceCallRDD.setName("FireServiceCallsRDD").persist().take(10)

// NUMBER OF RECORDS IN THE FILE
val totalRecords = filteredFireServiceCallRDD.count()
    println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsRDD = filteredFireServiceCallRDD.map(x => x(0))
distinctTypesOfCallsRDD.distinct().collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedRDD = distinctTypesOfCallsRDD.map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
distinctTypesOfCallsSortedRDD.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsRDD = filteredFireServiceCallRDD.map(convertToYear).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
fireServiceCallYearsRDD.take(20).foreach(println)

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallRDD = filteredFireServiceCallRDD.map(convertToDate).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).sortByKey(false)
last7DaysServiceCallRDD.take(7).foreach(println)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR? 
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsRDD = filteredFireServiceCallRDD.filter(row => (convertToYear(row) == "2016")).map(x => x(2)).map(x => (x, 1)).reduceByKey((x, y) => (x + y)).map(x => (x._2, x._1)).sortByKey(false)
neighborhoodDistrictCallsRDD.collect().foreach(println)

// FILTERING NEEDED COLUMN FOR USE CASE SCENARIO’S
val fireServiceCallDF = fireServiceCallYearAddedDF.select("CallType", "NeighborhooodsDistrict", "CallDateTS", "CallYear")

// RE ARRANGE NUMBER OF PARTITION
fireServiceCallDF.cache().take(10)

// PRINT SCHEMA 
fireServiceCallDF.printSchema()

// LOOK INTO TOP 20 ROWS IN THE DATA FILE
fireServiceCallDF.show()

// NUMBER OF RECORDS IN THE FILE
val totalRecords = fireServiceCallDF.count()
println(s"Number of records in the data file: $totalRecords")

// Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?
println(s"Q1: HOW MANY TYPES OF CALLS WERE MADE TO THE FIRE SERVICE DEPARTMENT?")
val distinctTypesOfCallsDF = fireServiceCallDF.select("CallType").distinct()
distinctTypesOfCallsDF.collect().foreach(println)

// Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?
println(s"Q2: HOW MANY INCIDEDNTS OF EACH CALL TYPE WHERE THERE?")
val distinctTypesOfCallsSortedDF = fireServiceCallDF.select("CallType").groupBy("CallType").count().orderBy(desc("count"))
distinctTypesOfCallsSortedDF.collect().foreach(println)

// Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?
println(s"Q3: HOW MANY YEARS OF FIRE SERVICE CALLS IS IN THE DATA FILES AND INCIDENTS PER YEAR?")
val fireServiceCallYearsDF = fireServiceCallDF.select("CallYear").groupBy("CallYear").count().orderBy(desc("count"))
fireServiceCallYearsDF.show()

// Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?
println(s"Q4: HOW MANY SERVICE CALLS WERE LOGGED IN FOR THE PAST 7 DAYS?")
val last7DaysServiceCallDF = fireServiceCallDF.select("CallDateTS").groupBy("CallDateTS").count().orderBy(desc("CallDateTS"))
last7DaysServiceCallDF.show(7)

// Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?
println(s"Q5: WHICH NEIGHBORHOOD IN SF GENERATED THE MOST CALLS LAST YEAR?")
val neighborhoodDistrictCallsDF = fireServiceCallDF.filter("CallYear == 2016").select("NeighborhooodsDistrict").groupBy("NeighborhooodsDistrict").count().orderBy(desc("count"))
neighborhoodDistrictCallsDF.collect().foreach(println)

Submitting Spark Application in YARN

The Spark submit command with partition tuning, used to execute the RDD and DataFrame API implementation in YARN, is as follows:

./bin/spark-submit --name FireServiceCallAnalysisRDDStragglerFixTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.SFOFireServiceCallAnalysis /data/SFFireServiceCall/SFFireServiceCallAnalysisPF.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

./bin/spark-submit --name FireServiceCallAnalysisDataFrameStragglerFixTest --master yarn --deploy-mode cluster --executor-memory 2g --executor-cores 2 --num-executors 2 --conf spark.sql.shuffle.partitions=23 --conf spark.default.parallelism=23 --class com.treselle.fscalls.analysis.SFOFireServiceCallAnalysisDF /data/SFFireServiceCall/SFFireServiceCallAnalysisPF.jar /user/tsldp/FireServiceCallDataSet/Fire_Department_Calls_for_Service.csv

DataFrame API implementation of application input, shuffles read, and writes is monitored in stages view. The below diagram shows that the input size of shuffle stages is ~17 MB currently and ~849 MB previously. The Shuffle read and write do not have multiple changes.

DataFrameStraggerFixStages

The “Executors” tab in the Spark UI provides the summary of input, shuffles read, and write as shown in the below diagram:

DataFrameStragglerFixExecutorsStats

The summary shows that the input size is 1.5 GB currently and 5.9 GB previously.

The time duration after reducing input size in RDD and DataFrame API implementation is shown in the below diagram:

Straggler Fix Output

Understanding Use Case Performance

The performance duration (without any performance tuning) based on different API implementation of the use case Spark application running on YARN is shown in the below diagram:

SparkApplicationWithDefaultConfigurationPerformance

For more details, refer our previous blog on Apache Spark on YARN – Performance and Bottlenecks.

We tuned the number of executors, cores, and memory for RDD and DataFrame implementation of the use case Spark application. The below diagram is based on the performance improvements after tuning the resources:

SparkApplicationAfterResourceTuningPerformance

For more details, refer our previous blog on Apache Spark on YARN – Resource Planning.

We tuned the default parallelism and shuffle partitions of both RDD and DataFrame implementation in our previous blog on Apache Spark Performance Tuning – Degree of Parallelism. We did not achieve performance improvement. But, reduced the scheduler overhead.

Finally, after identifying the straggler tasks and reducing the input size, we got 2 x performance improvements in DataFrame implementation and 4 x improvements in RDD implementation.

StragglerPerformanceBenchmark

Conclusion

In this blog, we discussed about Shuffle principles and understood use case application shuffle, straggler task detection in the application, and input size reduction to improve the performance of different API implementations of the Spark application.

We achieved 2 x performance improvements in DataFrame implementation and 4 x improvements in RDD implementation from the result of resource and partition running.

References

Apache Spark on YARN – Performance and Bottlenecks: http://www.treselle.com/blog/apache-spark-on-yarn-performance-and-bottlenecks
Apache Spark on YARN – Resource Planning: http://www.treselle.com/blog/apache-spark-on-yarn-resource-planning
Apache Spark Performance Tuning – Degree of Parallelism: http://www.treselle.com/blog/apache-spark-performance-tuning-degree-of-parallelism
The code examples are available in GitHub: https://github.com/treselle-systems/sfo_fire_service_call_analysis_using_spark
To understand Apache Spark jobs, stages, DAG and executors from Spark History server UI, refer our blog on: Text Normalization with Spark – Part 2.
Introducing Apache Spark 2.0: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html
Apache Spark: RDD, DataFrame or Dataset?: http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html
High Performance Spark book (Chapter 5 and Chapter 6): https://www.safaribooksonline.com/library/view/high-performance-spark/9781491943199/ch01.html

Apache Spark Performance Tuning – Straggler Tasks

Overview

Spark Shuffle Principles

Understanding Use Case Application Shuffle

Detecting Stragglers Tasks in Use Case

Low-Level and High-Level API Implementation

Submitting Spark Application in YARN

Understanding Use Case Performance

Conclusion

References

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...