HDFS File Read and Write Request Workflow - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

HDFS File Read and Write Request Workflow

Share This

HDFS File Read Request Workflow


Step 1: The client calls open() on the File System Object (which, in the case of HDFS, is an instance of the Distributed File System) to open the file it wants to read.

Step 2: To find the locations of the file's first blocks, the Distributed File System (DFS) makes a remote procedure call (RPC) to the name node. The addresses of the data nodes that have copies of each block are returned by the name node. The DFS provides the client with an FSDataInputStream so that it can read data from it. A DFSInputStream, which controls the data node and name node I/O, is in turn wrapped by an FSDataInputStream.

Step 3: The client then uses the stream to call read(). DFSInputStream connects to the principal (closest) data node for the file's primary block after storing the info node addresses for the first several blocks in the file.

Step 4: The client continually uses read() on the stream after receiving data via a stream from the data node.

Step 5: DFSInputStream locates the optimal data node for the following block after cutting the connection with the data node when the block ends. The client observes this transparently and perceives it as just reading an infinite stream. Blocks are read as the client reads through the stream, causing the DFSInputStream to establish new connections with data nodes. When necessary, it will also make a call to the name node to obtain the positions of the data nodes for the upcoming batch of blocks.

Step 6: The FSDataInputStream's close() function is invoked after the client has completed reading the file.


HDFS File Write Request Workflow


Step 1: The client calls DistributedFileSystem(DFS)'s create() function to create the file.

Step 2: To create a new file in the file system namespace without any blocks attached, DFS sends an RPC call to the name node. In order to ensure that the file is new and that the client has the necessary permissions to create it, the name node runs a number of tests. The name node creates a record of the new file if these tests are successful; if not, the client receives an error message, such as an IOException, preventing the creation of the file. The client can begin writing data to the FSDataOutputStream that the DFS returns.

Step 3: The DFSOutputStream divides the data written by the client into packets, which it then sends to the info queue, an indoor queue. The DataStreamer consumes the data queue and is responsible for selecting appropriate data nodes from the inventory to store the replicas and requesting the name node to allot new blocks. The set of data nodes creates a pipeline; in this case, we'll suppose that there are three nodes in the pipeline due to the replication level of three. The main data node in the pipeline receives the packets from the DataStreamer and stores them before forwarding them to the second data node in the pipeline.

Step 4: In a similar manner, the packet is stored by the second data node and forwarded to the third (and final) data node in the pipeline.

Step 5: The DFSOutputStream maintains an internal "ack queue" of packets awaiting acknowledgement from data nodes.

Step 6: To indicate whether or not the file is complete, this action sends up all of the remaining packets to the data node pipeline and then connects to the name node to await acknowledgments.

HDFS adheres to the Write Once, Read Many paradigm. Therefore, files that are already saved in HDFS cannot be edited, but they can be added by accessing the file again. Because of this design, HDFS can scale to support many concurrent clients because data traffic is distributed throughout all of the cluster's data nodes. As a result, the system's throughput, scalability, and availability all increase.


Happy Exploring!

No comments:

Post a Comment