NoSQL Database Called HBase
HBase is an open-source non-relational, scalable, distributed database written in Java. It is developed as a part of the Hadoop ecosystem and runs on top of HDFS. It provides random real-time read and write access to the given data. It is possible to write NoSQL queries to get the results using APIs. In fact, it is modelled on the basis of Google’s Big Table.
Features
Linear Scalability – It is a distributed database that runs on a cluster of computers. Several commodity hardware forms an HBase cluster.
High Throughput – Due to the high security and easy management characteristics, there is a high write throughput.
Automatic Sharding – This is an interesting feature wherein tables are dynamically distributed by the system to different region servers when they reach a larger size than the threshold. Auto-sharding means splitting and serving regions.
Atomic Read and Write – Atomicity indicates that an operation occurs or does not occur at all. Hence, no other read and write operations can be performed during one read or write process.
Real-time and Random big data access – It accepts real-time random data and stores it internally using a log-structured merge tree. This type of data storage allows the merging of smaller files to larger files periodically and reduces the ultimate disk usage.
Built-in support of MapReduce – It provides fast and parallel processing of stored data using the built-in Hadoop MapReduce framework support.
API Support – It provides strong Java API support (client/server) for easy development and programming.
Shell Support – It provides a command-line tool to interact with HBase and perform simple operations like creating a table, adding data, etc.
Sparse & multidimensional database – It is a sparse, multidimensional, sorted map-based database that supports multiple versions of the same record.
Snapshot support – It allows you to take metadata snapshots in order to obtain the prior or correct state form of data.
HBase Architecture consists mainly of four components:
HMaster
HRegions
HRegionserver
ZooKeeper
In HBase, the HMaster is the implementation of a Master server. It serves as a monitoring agent for all Region Server instances in the cluster and a user interface for any metadata updates. Master runs on NameNode in a distributed cluster environment. Master manages a number of background threads. HMaster is responsible for the following functions in HBase.
- HMaster manages admin performance and distributes services to regional servers.
- HMaster assigns regions to region servers.
- Plays a critical role in cluster performance and node maintenance.
- HMaster includes important functions like regulating load balancing and failover for distributing the load between the cluster’s nodes.
- HMaster is responsible for any schema and Metadata actions that a client wishes to alter.
Table – createTable, removeTable, enable, disable
ColumnFamily – add Column, modify Column
Region – move, assign
The Region Server implementation is HRegionServer. It is in charge of providing and managing regions or data in a distributed cluster. The region servers are hosted on Data Nodes in the Hadoop cluster. HMaster can connect with several HRegion servers and execute the following tasks.
Region hosting and management
Automatic region splitting
Processing read and write requests
Direct communication with the client
HRegions are the primary building blocks of an HBase cluster that consist of table distribution and are made up of Column families. It has several stores, and each store pertains to one column family. It is primarily made up of two components:
Memstore
Hfile
HBase ZooKeeper is a centralized monitoring server that keeps configuration data and offers distributed synchronization. Distributed synchronization means gaining access to distributed applications running throughout the cluster and offering coordination services across nodes.
Clients connect to regions via a ZooKeeper. The ZooKeeper is an open-source project that provides several essential services.
- Keeps configuration information.
- Client Communication with area servers is established via distributed synchronization.
- Provides transient nodes that represent various area servers.
- Master servers can use ephemeral nodes to discover available servers in a cluster.
- To keep track of server failures and network partitions
HBase finds its use in domains where –
- A massive amount of non-relational data (petabytes) with variable schema is a regular feature.
- The data needs to be accessed faster.
- HDFS support is available along with a large number of nodes.
Now, let us look at some important applications of HBase.
Sports: It is used in sports to store match history for improved analytics and prediction.
Medical: It is used in the medical industry for storing genomic sequences to record the sickness history of people and areas.
E-commerce: It is used in e-commerce to capture and store consumer logs including search history for analytics and subsequently target advertising.
Web: It is used to maintain user history and preferences in order to improve consumer targeting.
Conclusion
We have briefly covered what HBase is, situations where it is used, and HBase features and architecture.
In conclusion, here are some of the key takeaways from the article –
- It can be said that HBase has proved to be a powerful tool over existing Hadoop environments.
- HBase is a popular NoSQL database with high throughput and low latency.
- Since its release, HBase garnered developer support from other companies and was adopted by companies for production deployment.
- As of now, HBase boasts of strong developer and user communities.
- It is a top-level Apache project that has become a core infrastructure component run on a production scale worldwide in several large organizations such as Facebook, Twitter, Salesforce, Trend Micro, and Adobe.
Comments
Post a Comment