Javatpoint Logo
Javatpoint Logo

What is MapReduce?

A MapReduce is a data processing tool which is used to process the data parallelly in a distributed form. It was developed in 2004, on the basis of paper titled as "MapReduce: Simplified Data Processing on Large Clusters," published by Google.

The MapReduce is a paradigm which has two phases, the mapper phase, and the reducer phase. In the Mapper, the input is given in the form of a key-value pair. The output of the Mapper is fed to the reducer as input. The reducer runs only after the Mapper is over. The reducer too takes input in key-value format, and the output of reducer is the final output.

Steps in Map Reduce

  • The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case.
  • Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>.
  • An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of values for unique keys, and Final output <key, value> will be stored/displayed.
What is MapReduce What is MapReduce

Sort and Shuffle

The sort and shuffle occur on the output of Mapper and before the reducer. When the Mapper task is complete, the results are sorted by key, partitioned if there are multiple reducers, and then written to disk. Using the input from each Mapper <k2,v2>, we collect all the values for each unique key k2. This output from the shuffle phase in the form of <k2, list(v2)> is sent as input to reducer phase.

Uses of MapReduce

  • It can be used in various application like document clustering, distributed sorting, and web link-graph reversal.
  • It can be used for distributed pattern-based searching.
  • We can also use MapReduce in machine learning.
  • It was used by Google to regenerate Google's index of the World Wide Web.
  • It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile environment.

Please Share

facebook twitter google plus pinterest

Learn Latest Tutorials


B.Tech / MCA