PySpark StatusTracker(jtracker)

PySpark provides the low-level status reporting APIs, which are used for monitoring job and stage progress. We can track jobs using these APIs. These APIs intentionally provide very weak compatibility semantics, so users of these APIs should be careful in handling free / missing information.

The ids of jobs stage may be known to status tracker but will not have information about those stages. In that scenario, PySpark provides getStageInfo which returns none for a valid stage id.

The APIs spark.ui.retainedStages and spark.ui.retainedjobs will provide information of current jobs and stages.

def __init__(self, jtracker):
self._jtracker = jtracker

getActiveJobsIds()

It returns an array, which consists of the ids of all active jobs. The syntax is given below:

def getActivateJobsIds(self):
	return sorted((list(self.jtracker.getActivateJobs())))

getActiveStageIds()

It returns an array, which consists of the ids of all active stages. The syntax is given below:

def getActiveStageIds(self):
	return sorted(list(self.jtracker.getActiveStageIds)

getJobIdsForGroup(jobGroup = None)

This function is used to get all known jobs in an individual job group. These jobs are in the form of a list. If jobgroup is None, then it will return all types of jobs that are not related to a job group.

It returns jobs which may be in the state of running, failed, and completed jobs but sequence may vary. Consider the following code.

def getJobIdsForGroup(self, jobGroup=None):
	return list(self._jtracker.getJobIdsForGroup(jobGroup))

getJobInfo(jobId)

Sometimes the above functions are not able to get job information due to garbage collection, for those cases we use getJobInfo. It returns a SparkJobInfo object, or None.

def getJobinfo(self,jobId):
	job = self.jtracker.getJobInfo(jobId)
	if job is not None:
	return SparkJobInfo (jobId,job.stageIds())

getStageInfo(stageId)

Similarly, sometimes it is not able to find job info due to garbage collection then use getStageinfo(stageId). It returns a SparkJobInfo object, or None.

def getStageInfo(self, stageId):
       stage1 = self._jtracker.getStageInfo()	
       if(stage is not None):
       # Extract them in batch for best performance
       attrs = [getattr(stage,f) () for f in SparkStageInfo._fields[1:]]
       return SparkStageInfo(stageId,*attrs) 

Next TopicPySpark Serializer

← prev next →