Components of Hadoop Distributed File System (HDFS) - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

Components of Hadoop Distributed File System (HDFS)

Share This

Name Node

The name node serves as the master in the master-slave architecture of HDFS. It stores metadata which includes block locations, names, and permissions. 
 
The Name Node does not store the actual data content of files; it only stores metadata and coordinates data access operations. 
 
Since it is aware of every file's state as well as its metadata, Name Node serves as HDFS's controller and manager. Because the metadata is compact, it can be stored in the name node's memory, facilitating quicker access to the contents. 
 
Additionally, numerous customers use the HDFS cluster continuously, meaning that a single machine handles all of this data. It is responsible for carrying out file system actions such as opening, closing, renaming, etc.

Data Node

It stores and retrieves blocks as instructed by a client or Name node. They periodically send a report with a list of the blocks they are storing to the Name node. 
 
As indicated by the Name node, the Data node, which is a commodity hardware, also performs block formation, deletion, and replication.

Block

The smallest quantity of data it can read or write is called a block. 
 
HDFS blocks have a configurable default size of 128 MB. HDFS divides files into block-sized portions that are kept separate from one another. In contrast to a file system, an HDFS file does not occupy an entire block if it is less than the block size. 
 
Assume that a file of size, say, 5 MB, stored in an HDFS block of 128 MB then it only requires 5 MB of space. The large HDFS block size is solely intended to reduce seek costs. Let us take an example of a file which is 700MB in size. If our block size is 128MB then HDFS divides the file into 6 blocks. Five blocks of 128MB and one block of 60MB.

Replication Management

In order to offer fault tolerance HDFS makes use of a replication method. It does this by copying the blocks and storing them on various DataNodes. The number of copies of the blocks saved is determined by the replication factor. 
 
Although we can set it to any value, it is set to 3 by default. Every DataNode's block reports are gathered by NameNode in order to preserve the replication factor. The NameNode adds or removes replicas in accordance with whether a block is under-replicated or over-replicated, respectively.

Rack Awareness

Many Data Node machines are housed on a rack, and the production uses multiple racks of this type. Distributed placement of the block copies is achieved by HDFS using a rack awareness mechanism.  
 
Fault tolerance and minimal latency are provided by this rack awareness method. Assume that the replication factor that is set up is 3. 
 
The initial block will now be positioned on a local rack via the rack awareness algorithm. The other two blocks will remain on a different rack. If at all possible, it stores no more than two blocks in a single rack. 
 
Rack awareness helps optimize data locality by ensuring that replicas of a block are stored on different racks to minimize network traffic and improve fault tolerance. It ensures that if an entire rack or network switch fails, the data remains accessible from replicas stored on other racks.

Secondary Name Node

In the Hadoop Distributed File System (HDFS), the Secondary Name Node is essential to maintaining the dependability and effectiveness of the file system. 
 
In contrast to what its name suggests, the Secondary Name Node supports the primary Name Node by acting as an assistant. 
 
Its main purpose is to create new file system images by regularly merging the changes log with the existing file system images. The Secondary Name Node shortens the Name Node's recovery time in the event of a failure by carrying out this checkpoint function. 
 
The Name Node's computational load is partially relieved by this checkpoint procedure, which enhances system performance as a whole. 
 
Furthermore, the Secondary Name Node contributes to the general stability and dependability of HDFS by assisting in the detection and correction of anomalies in the file system's metadata. 
 
The secondary name node is an essential feature of HDFS that ensures dependable and seamless data management in remote contexts, even though it does not actively participate in real-time activities like the primary name node does.

Happy Exploring!

No comments:

Post a Comment