Limitations of Hadoop
Hadoop can only handle data in batches, and it can only access data in
a sequential fashion. This implies that even for the most
straightforward tasks, one must explore the entire dataset.
Processing one massive dataset leads to another, equally massive
dataset that needs to be handled in a sequential manner. To access any
point of data in a single unit of time, a new approach is currently
required (random access).
Hadoop Random Access Databases
Applications such as HBase, Cassandra, couchDB, Dynamo, and MongoDB are
some of the databases that store huge amounts of data and access the
data in a random manner.
Introduction to HBase
HBase is a column-oriented distributed database, constructed upon the
Hadoop file system. It is horizontally scalable and an open-source
project.
Similar to Google's big table, HBase is a data model created to offer
rapid random access to enormous volumes of structured data. It makes
use of the Hadoop File System's (HDFS) fault tolerance.
Random real-time read/write access to data stored in the Hadoop File
System is made possible by this component of the Hadoop ecosystem. The
data can be directly stored in HDFS or via HBase. Data consumers use
HBase to read and access data from HDFS at random. HBase offers read
and write access and is layered on top of the Hadoop File System.
HBase and HDFS
HDFS is a distributed file system suitable for storing large files.
HBase is a database built on top of the HDFS. HDFS does not support
fast individual record lookups. HBase provides fast lookups for larger
tables.
It provides high latency batch processing. It provides low latency
access to single rows from billions of records (Random access).
It provides only sequential access to data. HBase internally uses Hash
tables and provides random access, and it stores the data in indexed
HDFS files for faster lookups.
Storage Mechanism in HBase
The tables in HBase are sorted by row and the database is
column-oriented. The key value pairs, or column families, are the only
ones defined by the database schema. There are several column families
in a table, and the number of columns in each column family is
unlimited. The values of subsequent columns are kept consecutively on
the disk. A timestamp is included with each value in a table cell. To
sum up, within an HBase:
- Table is a collection of rows.
- Row is a collection of column families.
- Column family is a collection of columns.
- Column is a collection of key value pairs.
Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections
of columns of data, rather than as rows of data. Shortly, they will
have column families.
Row-Oriented Database & Column-Oriented Database
It is suitable for Online Transaction Process (OLTP). It is suitable
for Online Analytical Processing (OLAP). Such databases are designed
for a small number of rows and columns. Column-oriented databases are
designed for huge tables.
The following image shows column
families in a column-oriented database: HBase and RDBMS
HBase is schema-less, it doesn't
have the concept of fixed columns schema; it defines only column
families. An RDBMS is governed by its schema, which describes the whole
structure of tables. It is built for wide tables.
HBase is horizontally scalable. It is thin and built for small tables.
Hard to scale. No transactions are there in HBase. RDBMS is
transactional. It has denormalized data. It will have normalized data.
It is good for semi-structured as well as structured data. It is good
for structured data.
Features of HBase
- HBase scales linearly.
- It offers automated support for failures.
- It offers reliable reading and writing.
- It is both a source and a destination for Hadoop integration.
- Its client Java API is simple to use.
- Data replication between clusters is provided by HBase.
Usage of HBase
- Big Data is accessed randomly and in real time via Apache HBase.
- Large tables are hosted on top of clusters of commodity hardware.
- A non-relational database called Apache HBase is based on Google Bigtable. Bigtable operates on top of the Google File System, and Apache HBase operates on top of HDFS and Hadoop.
Applications of HBase
- It is employed when writing complex programs is required.
- Every time we need to offer quick random access to the data, we use HBase.
- Internal users of HBase include Yahoo, Adobe, Facebook, Twitter, and Yahoo.
HBase - Architecture
The four main parts of HBase are master, zookeeper,
region servers, and regions.
The HBase Master allocates areas and handles load balancing, while the
Apache Zookeeper keeps an eye on everything. Read and write data is
served via the Region server. Every machine in the Hadoop cluster is
the region server. It is made up of several files as well as Region,
HLog, Store, and MemoryStore. This entire thing is a component of the
HDFS file system.
HBase Architectural Component: Regions
HBase datasets are horizontally
partitioned into "Regions" according to the row key range. Regions are
allocated to the cluster nodes, which are referred to as "Region
Servers." These servers provide read-only and write-accessible
data.
HBase Architectural Component: Region Server
The Region Server does the
following jobs.
- Interact with the client and manage tasks pertaining to data.
- Take care of read and write requests for each region that falls under it.
- Determine the size of the region by using the region size thresholds.
HBase Architectural Component: Master Server
The Master Server does the
following jobs.
- Assigns regions to the region servers, using Apache ZooKeeper to assist in this process
- Manages the regional load balancing amongst region servers. It moves the regions to less populated servers and unloads the busy servers.
- Negotiates load balancing to maintain the cluster's state.
- It is in charge of making modifications to the schema as well as other metadata actions like making tables and column families.
HBase Architectural Component: Zookeeper
The Zookeeper does the following
jobs.
- An open-source project called Zookeeper offers many services like naming, distributed synchronization, and configuration information maintenance.
- Ephemeral nodes in Zookeeper represent various region servers. These nodes are used by master servers to find servers that are available.
- The nodes are used to monitor server failures and network partitions in addition to availability.
- It is used by clients to communicate with region servers.
- It will be handled by HBase in standalone and pseudo modes.
- ZooKeeper receives a heartbeat signal from an active HMaster, indicating that it is operational.
- The inactive server serves as a backup. In case of failure of active HMaster, it will come to rescue.
- When region servers are prepared to read and write operations, they notify ZooKeeper of this status.




No comments:
Post a Comment
Note: Only a member of this blog may post a comment.