Hadoop Streaming, a versatile
feature since Hadoop 0.14.1, empowers developers to write MapReduce
programs in languages like Ruby, Perl, Python, C++, and more without
being confined to Java. This utility leverages UNIX standard streams,
allowing any program that reads from standard input (STDIN) and writes
to standard output (STDOUT) to act as mappers or reducers.
With Hadoop Streaming, non-Java developers find an accessible path to
process vast amounts of data using familiar tools and languages,
enhancing the Hadoop ecosystem's flexibility and inclusivity.
How Hadoop Streaming Works
Hadoop Streaming works by using
Unix pipes to connect the output of a mapper or reducer written in a
non-Java language to the input of the next stage in the MapReduce
pipeline. The six steps involved in the working of Hadoop Streaming
are:
- Step 1: The input data is divided into chunks or blocks, typically 64MB to 128MB in size automatically. Each chunk of data is processed by a separate mapper.
- Step 2: The mapper reads the input data from standard input (stdin) and generates an intermediate key-value pair based on the logic of the mapper function which is written to standard output (stdout).
- Step 3: The intermediate key-value pairs are sorted and partitioned based on their keys ensuring that all values with the same key are directed to the same reducer.
- Step 4: The key-value pairs are passed to the reducers for further processing where each reducer receives a set of key-value pairs with the same key.
- Step 5: The reducer function, implemented by the developer, performs the required computations or aggregations on the data and generates the final output which is written to the standard output (stdout).
- Step 6: The final output generated by the reducers is stored in the specified output location in the HDFS.
The distributed nature of Hadoop
enables parallel execution of mappers and reducers across a cluster of
machines, providing scalability and fault tolerance. The data
processing is efficiently distributed across multiple nodes, allowing
for faster processing of large-scale datasets.


No comments:
Post a Comment
Note: Only a member of this blog may post a comment.