Hadoop YARN - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

"Yet Another Resource Negotiator" is what YARN stands for. It was added to Hadoop 2.0 in order to get rid of the Job Tracker bottleneck that existed in Hadoop 1.0. At the time of its inception, YARN was referred to as a "Redesigned Resource Manager," but it has since developed into a large-scale distributed operating system that is utilized for Big Data processing. 
 
The resource management layer and the processing layer are essentially divided by the YARN architecture. The resource manager and the application manager share responsibilities for the job tracker in Hadoop 1.0. 
 
Additionally, YARN makes it possible for various data processing engines, including batch, interactive, stream, and graph processing, to operate and process data stored in HDFS (Hadoop Distributed File System), greatly increasing system efficiency. It can plan the application processing and dynamically assign different resources through its numerous components. Effective resource management is essential for processing enormous volumes of data so that each application can make the most of them.


Features of YARN


The following characteristics of YARN helped it become more well-known:

  • Scalability: Hadoop can extend and manage thousands of nodes and clusters thanks to the scheduler in the Resource Manager of the YARN architecture. 
  • Compatibility: YARN is compatible with Hadoop 1.0 since it continues to support the current map-reduce applications without causing any issues. 
  • Cluster Utilization: Optimized Cluster Utilization is made possible by YARN's support for Hadoop's Dynamic Cluster Utilization. 
  • Multi-tenancy: This feature confers benefits to businesses by permitting access to numerous engines.


YARN Architecture


The YARN architecture's principal elements are as follows:

Client: It sends map-reduce tasks. 
 
Resource Manager: As YARN's master daemon, it is in charge of allocating and overseeing resources for each and every application. When it gets a processing request, it routes it to the appropriate node manager and allocates resources in accordance with the request's requirements for completion. 
 
It consists of two main parts: 
  1. Scheduler: It arranges things according to the resources that are available and the application that is assigned. To divide up the cluster resources, the YARN scheduler can be used with plugins like Capacity Scheduler and Fair Scheduler. 
  2. Application manager: It is in charge of agreeing to accept the application and to negotiate with the resource manager for the first container. In the event that a job fails, it also restarts the Application Master container.

Node Manager: This looks after a single node in a Hadoop cluster, managing its workflow and applications. Its main responsibility is to follow the Resource Manager. It sends heartbeats indicating the node's health state and registers with the resource manager. In addition to managing logs and keeping an eye on resource utilization, it can terminate a container in response to instructions from the resource manager. In addition, it is in charge of starting the container process when Application Master requests it.

  1. Application Master: A single job that is submitted to a framework is called an application. The application master is in charge of keeping tabs on the state and advancement of a single application, as well as negotiating resources with the resource management. The Container Launch Context (CLC), which contains all the necessary resources for an application to execute, is sent by the application master to the node management in order to request the container. The health report is periodically sent to the resource manager by the application once it has been activated. 
  2. Container: A container is a grouping of physical resources on a single node, such as RAM, CPU cores, and disk. Container Launch Context (CLC) invokes the containers by storing information such as environment variables, security tokens, dependencies, and so on.


Application workflow in Hadoop YARN


  1. Client submits an application 
  2. The Resource Manager allocates a container to start the Application Manager 
  3. The Application Manager registers itself with the Resource Manager 
  4. The Application Manager negotiates containers from the Resource Manager 
  5. The Application Manager notifies the Node Manager to launch containers 
  6. Application code is executed in the container 
  7. Client contacts Resource Manager/Application Manager to monitor application’s status 
  8. Once the processing is complete, the Application Manager un-registers with the Resource Manager


Benefits of YARN:


  • Flexibility: You may operate a variety of distributed processing systems, like Apache Spark, Apache Flink, Apache Storm, and others, with the flexibility provided by YARN. It enables the simultaneous operation of several processing engines on a single Hadoop cluster. 
  • Resource Management: The Hadoop cluster's resources can be effectively managed with YARN. It enables cluster managers to assign and keep track of the CPU, memory, and disk space needed by any application running in the cluster. 
  • Scalability: YARN can support thousands of nodes in a cluster and is built to be extremely scalable. Depending on what the cluster's applications require, it can scale up or down. 
  • Enhanced Performance: By offering a centralized resource management mechanism, YARN improves performance. It guarantees that the resources are used as efficiently as possible and that applications are scheduled to take full use of the resources. 
  • Security: Strong security features including Secure Shell (SSH) access, Kerberos authentication, and secure data transmission are all provided by YARN. It guarantees the security of the data processed and stored on the Hadoop cluster.


Drawbacks of YARN:


  • Complexity: The Hadoop environment is made more complex by YARN. More configurations and settings are needed, which can be challenging for users who are unfamiliar with YARN. 
  • Overhead: YARN adds more overhead, which might cause the Hadoop cluster's performance to lag. Applications must be scheduled and resources must be managed, which requires this overhead. 
  • Latency: The Hadoop ecosystem experiences an extra latency due to YARN. The scheduling of applications, the distribution of resources, and inter component communication can all contribute to this latency. 
  • Single Point of Failure: In the Hadoop cluster, YARN may serve as a single point of failure. The failure of YARN could bring down the cluster as a whole. Administrators must put up a backup YARN instance for high availability in order to prevent this. 
  • Minimal Support: For programming languages other than Java, YARN offers only limited support. While YARN is compatible with a variety of processing engines, its usability in some situations may be limited due to poor language support in some of these engines.


Happy Exploring!

No comments:

Post a Comment