The total spark application wallclock time can be divided into time spent in driver and time spent in executors. When a spark application spends too much time in the driver, it wastes the executors compute time. Executors can also waste compute time, because of lack of tasks or skew. And finally, critical path time is the minimum time that this application will take even if we give it infinite executors. Ideal application time is computed by assuming ideal partitioning (tasks == cores and no skew) of data in all stages of the application.
Driver vs Executor wallclock time
Critical and ideal application time
Core compute hours wastage by driver and executor
OCCH : One Core Compute Hours
Using the fine grained task level data and the relationship between stages, sparklens can simulate how the application will behave when the number of executors is changed. Specifically, sparklens will predict wall clock time and cluster utilization. Note that cluster utilization is not cluster cpu utilization. It only means some task was scheduled on a core. The cpu utilization will further depend upon if the task is cpu bound or IO bound.
Predicted wall clock time and cluster utilization with different executor counts
Not all stages are equally important. Start by looking at stages which occupy most of the wall clock time. Specifically look for lower PRatio and higher TaskSkew and fix accordingly.
Per stage metrics
PRatio : Number of tasks in stage divided by number of cores. Represents degree of parallelism in the stage
TaskSkew : Duration of largest task in stage divided by duration of median task.Represents degree of skew in the stage
If autoscaling or dynamic allocation is enabled, we can see how many executors were available at any given time. This chart also plots the executors used by different spark jobs within the application and what is the minimal number of executors (ideal) which could have finished the same work in same amount of wall clock time.
Executors available and executors required over time
SparkListener framework reports this data for each task in the application. Here we show the summary information like min, max, sum and mean for each of these value over all the tasks across all the stages and jobs in the application.