PySpark Broadcast and Accumulator
Apache Spark uses a shared variable for parallel processing. The parallel processing performs a task in less time. When the driver sends a task to the executor on the cluster, a copy of the shared variable is transferred to each node of the cluster, so that it can be used to perform the task.
Apache Spark supports the following type of shared variable.
A broadcast variable is one of the shared variables which is used to save a copy of the data across all nodes. It allows the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. For example, to provide a copy of a large input dataset to every node in an efficient manner. The following code provides the details of the broadcast class of the PySpark.
The above code shows the use of Broadcast variable. It has an attribute called value. It stores the data and is used to return a broadcast value.
It will give the following output:
Stored data -> ['scala', 'java', 'hadoop', 'spark', 'akka'] Printing a particular element in RDD -> hadoop
The accumulator variables are used to combine the information through associative and commutative operations. We can use the accumulation for sum operation and counters (In MapReduce). The following code describes the accumulator in detail:
We consider the following example to describe how to use the Accumulator variable. It has the attribute named value the same as the broadcast variable; this attribute also stores the data. Then it returns the accumulator value but it can only be used in the driver program.
In the following example, an accumulator variable is used by multiple nodes and returns an accumulated value.
The above program is saved as accumulator.py and gives the following output:
The accumulate value is 15