PySpark provides the low-level status reporting APIs, which are used for monitoring job and stage progress. We can track jobs using these APIs. These APIs intentionally provide very weak compatibility semantics, so users of these APIs should be careful in handling free / missing information.
The ids of jobs stage may be known to status tracker but will not have information about those stages. In that scenario, PySpark provides getStageInfo which returns none for a valid stage id.
The APIs spark.ui.retainedStages and spark.ui.retainedjobs will provide information of current jobs and stages.
It returns an array, which consists of the ids of all active jobs. The syntax is given below:
It returns an array, which consists of the ids of all active stages. The syntax is given below:
This function is used to get all known jobs in an individual job group. These jobs are in the form of a list. If jobgroup is None, then it will return all types of jobs that are not related to a job group.
It returns jobs which may be in the state of running, failed, and completed jobs but sequence may vary. Consider the following code.
Sometimes the above functions are not able to get job information due to garbage collection, for those cases we use getJobInfo. It returns a SparkJobInfo object, or None.
Similarly, sometimes it is not able to find job info due to garbage collection then use getStageinfo(stageId). It returns a SparkJobInfo object, or None.