Benefits and Limitations of Hadoop - BunksAllowed

BunksAllowed is an effort to facilitate Self Learning process through the provision of quality tutorials.

Random Posts

Benefits and Limitations of Hadoop

Share This

Benefits of Hadoop


Hadoop is affordable, scalable, and simple to use. Hadoop also has a lot of benefits in addition to this. We are going to talk about the top 12 benefits of Hadoop here. Thus, these are the advantages of Hadoop that contribute to its popularity:

  • Various Information Sources: Hadoop can handle a wide range of data. Data can be either organized or unstructured and originate from a variety of sources, including social media, email exchanges, and other interactions. Diverse data can yield valuable insights for Hadoop. Hadoop is capable of accepting data in CSV, text, XML, and picture formats. 
  • Economical: Because Hadoop stores data on a cluster of commodity hardware, it is a cost-effective option. Since commodity hardware is made of inexpensive computers, adding nodes to the framework won't cost you anything. Hadoop 3.0 has a 50% storage overhead compared to Hadoop 2.x's 200%. Because there is a considerable reduction in redundant data, fewer machines are needed to store the data. 
  • Execution: Hadoop processes enormous volumes of data quickly thanks to its distributed processing and distributed storage architecture. In 2008, Hadoop even outperformed the fastest supercomputer. The input data file is divided into many blocks, each of which is stored across multiple nodes. Additionally, it splits the task that the user submits into a number of smaller tasks that are assigned to worker nodes that have the necessary data. These smaller tasks execute concurrently, which enhances performance. 
  • Resilient to Error: Erasure coding in Hadoop 3.0 is used to enhance fault tolerance. For instance, the erasure coding algorithm takes six data blocks and yields three parity blocks; hence, HDFS stores a total of nine blocks. These parity blocks and the remaining data blocks can be used to recover the impacted data block in the event that any node fails. 
  • Extremely Adaptable: The HDFS architecture in Hadoop 2.x features a single active NameNode and a single standby NameNode, giving us a fallback option in the event that a NameNode fails. However, Hadoop 3.0 allows for numerous backup NameNodes, which increases the system's high availability even further by allowing it to continue operating even in the event that two or more NameNodes crash. 
  • Minimal Network Activity: Each job that a user submits to Hadoop is divided into several distinct subtasks, each of which is assigned to a data node. This reduces network traffic by moving a little bit of code to data rather than a large amount of data to code. 
  • Elevated Throughput: Work completed per unit of time is known as throughput. Distributed data storage, facilitated by Hadoop, makes distributed processing simple to use. A given job is split up into smaller jobs that process small amounts of data concurrently, resulting in high throughput. 
  • Open-Source: Since Hadoop's source code is publicly available, it is an open source technology. We are able to alter the source code to meet individual needs. 
  • Expandable: Hadoop operates on the horizontal scalability principle, which states that we should add a machine to the cluster of nodes in its entirety rather than altering a machine's configuration—a process known as vertical scalability—by adding RAM, a disk, or other components. The Hadoop cluster is scalable since nodes can be added at any time. 
  • Usability: MapReduce programmers do not need to worry about accomplishing distributed processing because it is handled automatically at the endpoint by the Hadoop framework, which handles parallel processing. 
  • Compatibility: The majority of newly developed Big Data technologies, such as Spark and Flink, are compatible with Hadoop. They employ Hadoop as a data storage platform, and they have processing engines that operate atop it as a backend. 
  • Multiple Languages Supported: On Hadoop, developers can write code in a variety of languages, including C, C++, Perl, Python, Ruby, and Groovy.


Limitations of Hadoop


  • Problems with Small Files: Hadoop works well for applications that handle a limited number of huge files, but it falls short when handling a large number of little files. All that constitutes a little file is one that is substantially smaller than Hadoop's block size, which is set at 128MB or 256MB by default. Because the Namenode keeps a namespace for the system, the sheer volume of little files overloads it and hinders Hadoop's functionality. 
  • Vulnerable By Nature: Because Hadoop is written in Java, a popular programming language, hackers can easily exploit it, leaving Hadoop open to security lapses. 
  • Processing Indirect Costs: When working with terabytes or petabytes of data, read/write operations become exceedingly costly in Hadoop since data is read from and written to the disk. Hadoop has processing overhead since it is unable to perform calculations in-memory. 
  • Limited to Batch Processing: Hadoop's primary processing engine is batch, which is ineffective for stream processing. It is unable to generate output with low latency and in real time. It is limited to using data that we pre-collect and store in a file for processing. 
  • Processing Iteratively: Hadoop cannot process data iteratively on its own. In contrast to Hadoop, which has data flowing in a chain of stages where the output of one stage becomes the input of another, machine learning and iterative processing have a cyclic data flow. 
  • Safety: Hadoop employs difficult-to-manage Kerberos authentication for security. The lack of encryption at the network and storage levels is a serious problem.




Happy Exploring!

No comments:

Post a Comment