MapReduce - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

MapReduce is an open-source software framework for distributed storage and processing of large datasets across clusters of computers using simple programming models. MapReduce in Hadoop refers to the implementation of the MapReduce programming model within the Apache Hadoop framework. 
 
MapReduce is the primary processing engine in Hadoop's implementation, enabling users to create distributed applications that handle enormous volumes of data in parallel over a distributed cluster. 
 
The framework effectively manages fault tolerance and data locality by distributing data and computation over several cluster nodes. 
 
The two main stages of Hadoop's MapReduce process are the "map" phase, which splits input data into smaller pieces and processes them concurrently across several nodes, and the "reduce" phase, which gathers and processes the intermediate results produced by the map phase to create the final output. 
 
Hadoop is an essential part of big data analytics and processing pipelines because of its distributed processing approach, which makes it possible for it to handle and analyze large-scale datasets with efficiency.


JobTracker and TaskTracker

In Hadoop MapReduce, the JobTracker and TaskTracker play crucial roles in managing and executing MapReduce jobs within a Hadoop cluster.

The JobTracker is responsible for coordinating and managing MapReduce jobs submitted by users. It is typically a single master node in the Hadoop cluster. Its main functions include:
  • Job Scheduling: The JobTracker schedules MapReduce jobs for execution based on the available cluster resources and job priorities.
  • Task Assignment: It allots map and reduce jobs to the cluster's available TaskTrackers.
  • Monitoring: The JobTracker tracks the success and failure of individual tasks as well as the progress of MapReduce jobs.
  • Fault Tolerance: TaskTracker or individual task failures are detected and handled to guarantee fault tolerance. A JobTracker transfers its tasks to other TaskTrackers that are available in the event that a TaskTracker fails.

The TaskTrackers, on the other hand, are slave nodes in the Hadoop cluster that carry out the map and reduce tasks that the JobTracker assigns them. Among their principal roles are:
  • Task Execution: TaskTrackers carry out the tasks that the JobTracker assigns them, including mapping and reducing. They handle the actual data processing.
  • Heartbeat: To let the JobTracker know they are still alive and well, TaskTrackers periodically send heartbeat signals to it. This makes it possible for the JobTracker to identify errors fast.
  • Task Progress Reporting: TaskTrackers send information about the success or failure of a task to the JobTracker by reporting task progress.
  • Speculative Execution: TaskTrackers have the ability to execute tasks speculatively in the event that they notice that a specific job is moving more slowly than anticipated on another TaskTracker. This reduces the time it takes to finish a project by executing duplicate jobs concurrently.

Together, JobTracker and TaskTracker ensure scalability, fault tolerance, and optimal resource use by effectively managing and executing MapReduce jobs in a distributed Hadoop environment.


Steps in Map Reduce

  • The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will not be unique in this case. 
  • Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of values associated with this unique key <key, list(values)>. 
  • An output of sort and shuffle sent to the reducer phase. The reducer performs a defined function on a list of values for unique keys, and Final output <key, value> will be stored/displayed.

Usage of MapReduce

  • It can be used in various applications like document clustering, distributed sorting, and web link-graph reversal.
  • It can be used for distributed pattern-based searching.
  • We can also use MapReduce in machine learning.
  • It was used by Google to regenerate Google's index of the World Wide Web.
  • It can be used in multiple computing environments such as multi-cluster, multi-core, and mobile environments.

Happy Exploring!

No comments:

Post a Comment