PySpark Profiler

PySpark supports custom profilers that are used to build predictive models. The profiler is generated by calculating the minimum and maximum values in each column. The profiler helps us as a useful data review tool to ensure that the data is valid and fit for further consumption.

The custom profiler has to define some following methods:

Add

The add method is used to add profile to the existing accumulated profile. User should choose profile class at the time of creating a SparkContext.

from pyspark import SparkConf, SparkContext
from pyspark import BasicProfiler
class MyCustomProfiler(BasicProfiler):
     def show(self, id):
         print("My custom profiles for RDD:%s" % id)
conf = SparkConf().set("spark.python.profile", "true")
sc = SparkContext('local', 'test', conf=conf, profiler_cls=MyCustomProfiler)
sc.parallelize(range(1000)).map(lambda x: 2 * x).take(10)
sc.parallelize(range(1000)).count()
sc.show_profiles()
sc.stop()

Output:

[0, 4, 7, 9, 8, 15, 20, 18, 21, 25]
My custom profiles for RDD:1
My custom profiles for RDD:3

Profile

It creates a system profile of some sort.

Stats

This method returns the collection.

Dump

It dumps the profiles to the path.

dump(id,path)

This method is used to dump the profile into the path; here an id represents the RDD id.

def dump(self, id, path):
       if not os.path.exists(path):
           os.makedirs(path)
       stats = self.stats()
       if (stats):
           p = os.path.join(path, "rdd_%d.pstats" % id)
           stats.dump_stats(p)

Profile(func)

It performs profiling on the function and accepts func as argument.

def profile(self, func):
       raise NotImplemented

show(id)

This function is used to print the profile stats to stdout. Here id is the RDD id.

def show(self, id):
       stats = self.stats()
       if(stats):
           print("=" * 60)
           print("Profile of RDD<id=%d>" % id)
           print("=" * 60)
           stats.sort_stats("time", "cumulative").print_stats()

stats()

The stats() function returns the collected profiling stat.

def stats(self):
	return self._accumulator.value

class pyspark.BasicProfiler(ctx)

It is a default profiler which is implemented on the basis of cProfile and Accumulator.

def profile(self, func):
       pr = cProfile.Profile()
       pr.runcall(func)
       st = pstats.Stats(pr)
       st.stream = None  # make it picklable
       st.strip_dirs()
       # It adds a new profile to the existing accumulated value
       self._accumulator.add(st)

Next TopicPySpark StatusTracker

← prev next →