MapReduce is used to compute huge amounts of data. To handle the
upcoming data in a parallel and distributed form, the data has to flow
from various phases.
Phases of MapReduce data flow
Input reader: The input reader reads the upcoming data
and splits it into the data blocks of the appropriate size (128 MB).
Each data block is associated with a Map function. Once input reads the
data, it generates the corresponding key-value pairs. The input files
reside in HDFS.
Map function: The map function processes the upcoming
key-value pairs and generates the corresponding output key-value pairs.
The map input and output type may be different from each other.
Partition function: The partition function assigns the
output of each Map function to the appropriate reducer. The available
key and value provide this function. It returns the index of reducers.
Shuffling and Sorting: The data is shuffled
between/within nodes so that it moves out from the map and gets ready
to process for the reduce function. Sometimes, the shuffling of data
can take much computation time. The sorting operation is performed on
input data for the Reduce function. Here, the data is compared using a
comparison function and arranged in a sorted form.
Reduce function: The Reduce function is assigned to each
unique key. These keys are already arranged in sorted order. The values
associated with the keys can iterate the Reduce and generate the
corresponding output.
Output writer: Once the data flows from all the above
phases, Output writer executes. The role of Output writer is to write
the Reduce output to the stable storage.



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.