HBase - Overview - BunksAllowed

Limitations of Hadoop

Hadoop can only handle data in batches, and it can only access data in a sequential fashion. This implies that even for the most straightforward tasks, one must explore the entire dataset.

Processing one massive dataset leads to another, equally massive dataset that needs to be handled in a sequential manner. To access any point of data in a single unit of time, a new approach is currently required (random access).

Hadoop Random Access Databases

Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are some of the databases that store huge amounts of data and access the data in a random manner.

Introduction to HBase

HBase is a column-oriented distributed database, constructed upon the Hadoop file system. It is horizontally scalable and an open-source project.

Similar to Google's big table, HBase is a data model created to offer rapid random access to enormous volumes of structured data. It makes use of the Hadoop File System's (HDFS) fault tolerance.

Random real-time read/write access to data stored in the Hadoop File System is made possible by this component of the Hadoop ecosystem. The data can be directly stored in HDFS or via HBase. Data consumers use HBase to read and access data from HDFS at random. HBase offers read and write access and is layered on top of the Hadoop File System.

HBase and HDFS

HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS. HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables.

It provides high latency batch processing. It provides low latency access to single rows from billions of records (Random access).

It provides only sequential access to data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

Storage Mechanism in HBase

The tables in HBase are sorted by row and the database is column-oriented. The key value pairs, or column families, are the only ones defined by the database schema. There are several column families in a table, and the number of columns in each column family is unlimited. The values of subsequent columns are kept consecutively on the disk. A timestamp is included with each value in a table cell. To sum up, within an HBase:

Table is a collection of rows.
Row is a collection of column families.
Column family is a collection of columns.
Column is a collection of key value pairs.

Column Oriented and Row Oriented

Column-oriented databases are those that store data tables as sections of columns of data, rather than as rows of data. Shortly, they will have column families.

Row-Oriented Database & Column-Oriented Database

It is suitable for Online Transaction Process (OLTP). It is suitable for Online Analytical Processing (OLAP). Such databases are designed for a small number of rows and columns. Column-oriented databases are designed for huge tables.

The following image shows column families in a column-oriented database: HBase and RDBMS

HBase is schema-less, it doesn't have the concept of fixed columns schema; it defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables.

HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has denormalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data.

Features of HBase

HBase scales linearly.
It offers automated support for failures.
It offers reliable reading and writing.
It is both a source and a destination for Hadoop integration.
Its client Java API is simple to use.
Data replication between clusters is provided by HBase.

Usage of HBase

Big Data is accessed randomly and in real time via Apache HBase.
Large tables are hosted on top of clusters of commodity hardware.
A non-relational database called Apache HBase is based on Google Bigtable. Bigtable operates on top of the Google File System, and Apache HBase operates on top of HDFS and Hadoop.

Applications of HBase

It is employed when writing complex programs is required.
Every time we need to offer quick random access to the data, we use HBase.
Internal users of HBase include Yahoo, Adobe, Facebook, Twitter, and Yahoo.

HBase - Architecture

The four main parts of HBase are master, zookeeper, region servers, and regions.

The HBase Master allocates areas and handles load balancing, while the Apache Zookeeper keeps an eye on everything. Read and write data is served via the Region server. Every machine in the Hadoop cluster is the region server. It is made up of several files as well as Region, HLog, Store, and MemoryStore. This entire thing is a component of the HDFS file system.

HBase Architectural Component: Regions

HBase datasets are horizontally partitioned into "Regions" according to the row key range. Regions are allocated to the cluster nodes, which are referred to as "Region Servers." These servers provide read-only and write-accessible data.

HBase Architectural Component: Region Server

The Region Server does the following jobs.

Interact with the client and manage tasks pertaining to data.
Take care of read and write requests for each region that falls under it.
Determine the size of the region by using the region size thresholds.

Every machine in the Hadoop cluster is the region server. It is made up of several files as well as Region, HLog, Store, and MemoryStore (Memstore). This entire thing is a component of the HDFS file system. Memstore functions similarly to cache memory. Anything entered into HBase is initially kept here. Afterwards, the memstore is flushed and the data is moved and stored as blocks in Hfiles.

HBase Architectural Component: Master Server

The Master Server does the following jobs.

Assigns regions to the region servers, using Apache ZooKeeper to assist in this process
Manages the regional load balancing amongst region servers. It moves the regions to less populated servers and unloads the busy servers.
Negotiates load balancing to maintain the cluster's state.
It is in charge of making modifications to the schema as well as other metadata actions like making tables and column families.

HBase Architectural Component: Zookeeper

The Zookeeper does the following jobs.

An open-source project called Zookeeper offers many services like naming, distributed synchronization, and configuration information maintenance.
Ephemeral nodes in Zookeeper represent various region servers. These nodes are used by master servers to find servers that are available.
The nodes are used to monitor server failures and network partitions in addition to availability.
It is used by clients to communicate with region servers.
It will be handled by HBase in standalone and pseudo modes.
ZooKeeper receives a heartbeat signal from an active HMaster, indicating that it is operational.
The inactive server serves as a backup. In case of failure of active HMaster, it will come to rescue.
When region servers are prepared to read and write operations, they notify ZooKeeper of this status.

BunksAllowed

Random Posts

HBase - Overview

Limitations of Hadoop

Hadoop Random Access Databases

Introduction to HBase

HBase and HDFS

Storage Mechanism in HBase

Column Oriented and Row Oriented

Features of HBase

Usage of HBase

Applications of HBase

HBase - Architecture

HBase Architectural Component: Regions

HBase Architectural Component: Region Server

HBase Architectural Component: Master Server

HBase Architectural Component: Zookeeper

Happy Exploring!

No comments:

Post a Comment

About BunksAllowed

Coding Challenges

Socialize

Categories

Followers

BunksAllowed

Comments

Report Abuse

Subscribe To

Total Pageviews

Blog Archive

Categories

Recent Posts

Popular Posts

Subscribe Us

Quick Contact

Translate

Popular

Recent

Featured Post

Interpolation Search in an Sorted Array

Archive

Follow Us

We Acknowledge

PEXELS

Recent Tutorials

Contact Form

Categories