Evolution of Region Assignment in the Apache HBase Architecture

The first part of this two-part series of blog posts provided an introduction to some of the important design aspects of Apache HBase. We introduced the concept of the AssignmentManager and the importance of its role in the HBase architecture. In this second post, we will cover how the redesigned AssignmentManager in HBase versions 2.x efficiently and reliably manages the process of region assignment.

Brief Introduction to the ProcedureV2 (Pv2) Framework

A Procedure represents an operation or a set of operations to be performed step-by-step, as a sequence of logged and restartable processes. The HBase Procedure framework, inspired by Apache Accumulo’s FATE framework, implements the supporting infrastructure, providing durability and reliability. Procedures are executed by worker threads allocated for ProcedureExecutor instances, which are managed by the active coordinator. A procedure can create one or more sub-procedures to achieve different goals, each themselves requiring durable state transitions, as part of the overall operation. Each ProcedureExecutor instance is associated with a ProcedureStore for durable persistence of procedure state. All procedures, and in turn their sub-procedures, can resume pending/in-progress execution whenever there might be a machine failure or service interruption. All these components are part of the Pv2 core framework, introduced in HBASE-13202.

HBase 2 provides two implementations of ProcedureStore:

A write-ahead-log based implementation, the WALProcedureStore.
A coordinator-local-region based implementation, the RegionProcedureStore, which is the default implementation since HBase 2.3.

When executing a Procedure, the ProcedureExecutor logs and tracks its current state with the ProcedureStore. A Procedure can be in exactly one of these states at any given time:

INITIALIZING — Procedure in construction, not yet submitted to the executor.
RUNNABLE — Procedure added to the executor and ready to be executed.
WAITING — The procedure is waiting on children (or sub-procedure) to be completed.
WAITING_TIMEOUT — The procedure is waiting on a timeout or an external event.
ROLLEDBACK — The procedure is rolled back because the procedure or one of the sub-procedures has failed. The rollback step is expected to clean up resources created during the execution step. In case of failure and restart, rollback may be called multiple times; hence, the rollback step must be idempotent.
SUCCESS — The procedure has been completed successfully without any failures.
FAILED — The procedure has been executed at least once and has failed. The procedure may or may not have rolled back yet. Any procedure in the FAILED state will be eventually moved to the ROLLEDBACK state.

Both execution and rollback steps must be idempotent as either may be called multiple times in case of machine or process failure in the middle of an ongoing execution.

Region Transition Workflows In The AssignmentManagerV2

As we have covered in the previous post in this series, the AssignmentManagerV2, or AMv2, is the improved version of the AssignmentManager available in HBase versions 2.x. The AMv2 implements all of the region management and assignment state transition activity as Procedures. Now, let’s take a deeper look into how the AMv2 operates. Remember that only the active coordinator is in charge of executing the AMv2 procedures.

At a high level, region transition procedures consist of a series of steps and/or multiple sub-procedures to be executed. Such steps might include:

Region closing (CloseRegionProcedure)
Housekeeping of discarded regions after merge (GCMultipleMergedRegionsProcedure)
Housekeeping of discarded regions after split (GCRegionProcedure)
Region merging (MergeTableRegionsProcedure)
Region opening (OpenRegionProcedure)
Region splitting (SplitTableRegionProcedure)
Region assignment state transition (TransitRegionStateProcedure)

Any of these steps could be of type RemoteProcedure, which would execute some action on remote regionserver. For instance, OpenRegionProcedure and CloseRegionProcedure both are examples of remote procedures that try to open (assign) and close (unassign) the region respectively on remote server (or regionserver). All remote procedures, when executed by the active coordinator, eventually use ExecuteProceduresRemoteCall to execute Runnable action at the destination regionserver. Both OpenRegionProcedure and CloseRegionProcedure are used as sub-procedures by many region transition procedures because opening/closing region is part of multiple workflows: split, merge, move, reopen, assign, and un-assignment.

The AMv2 uses Pv2-provided locks in order to prevent simultaneous execution of operations that might mutate table schema or region assignment state (more details on HBASE-16744). We have two types of locks: EXCLUSIVE and SHARED. An exclusive lock on an entity does not allow any simultaneous operation on that entity. On the other hand, we can take multiple shared locks on a given entity as long as no simultaneous update of that entity is required (update requires exclusive lock). For instance, multiple regions of the same table can be assigned/unassigned/split/merged as long as all these regions are exclusive. No same region should be assigned and split at the same time. Hence, an exclusive lock on a region ensures that only one procedure has exclusive access to that region, while multiple procedures might have already acquired multiple shared locks at the table and namespace level. In this case, if we try to modify/enable/disable the table, that operation requires an exclusive lock on the table, and an exclusive lock can be taken only if no other exclusive/shared locks are held by any other procedures. Hence, the procedure to modify the table will wait to acquire an exclusive lock on the table until all of its region level operations (holding the shared lock on the table) are completed. For the majority of region transition operations, procedures take exclusive locks.

Any procedure acquiring an exclusive lock on a schema resource would also acquire a shared lock on parent resources in the schema hierarchy, for example, the table (to which the region belongs) and the namespace (to which the table belongs). This prevents any concurrent changes to the containing table or namespace while a procedure is operating on region state. Operations on other tables and namespaces would be allowed unless those are locked by another procedure. Similarly, any procedure acquiring an exclusive lock on a table would also acquire a shared lock on the namespace (to which the table belongs). Similarly, an exclusive lock on a namespace would prevent concurrent operations to it or any subordinate resource. Collectively this locking maintains the invariant that, for any given modification of schema or region state, no parent or subordinate resources are concurrently modified. The Pv2 framework handles the safe release of locks and cleanup after failed procedures in the event that server failure or other crash cases prevent orderly execution and completion under lock.

Some of the steps in the region transition procedure diagrams mention executing coprocessor hooks. Coprocessors are a framework, loosely inspired by Google’s BigTable coprocessors, for flexible and generic extension of core implementation logic. We won’t cover Coprocessors here. For an introduction to coprocessors, please see https://blogs.apache.org/hbase/entry/coprocessor_introduction.

Region Assignment

If a region is ONLINE, it is available and can receive and process requests from clients; otherwise, it cannot. The act of region assignment means to transition a region ultimately to ONLINE state on a given regionserver. The procedure used to drive region assignment is TransitRegionStateProcedure (TRSP). TRSP is used for any of these operations:

ASSIGN — assign region i.e. bring region ONLINE on a server.
UNASSIGN — unassign region i.e. OFFLINE.
MOVE — move a region from the server (or regionserver) A to another server B i.e. UNASSIGN on server A followed by ASSIGN on server B.
REOPEN — reopen a region on the same server i.e. UNASSIGN on server A followed by ASSIGN on server A.

The TRSP procedure takes an exclusive lock on the region that it is assigning/un-assigning/moving/reopening and a shared lock on the corresponding table and namespace.

The region assignment procedure goes through a series of steps starting from finding a suitable target regionserver, to initiating the child procedure OpenRegionProcedure that tells the regionserver to open the region there. If the regionserver fails to open the region, the request is retried until the region is successfully opened by the regionserver, or until all retries are exhausted. And if all retries are exhausted, the region is marked FAILED_OPEN and the procedure is marked FAILED, which will be retried by the AMv2 at a later time. There are many runtime configuration options available for fine-tuning the details of this process.

Region Splitting

When a region becomes too large, after adding some data, the region is split in two at the midpoint of its keyspace, creating two roughly equal halves. This is a built-in automatic process, but a split can also be triggered by an administrator manually. The procedure responsible for managing a region split is known as SplitTableRegionProcedure (STRP). It consists of steps to achieve splitting a single region into two. The region eligible for split is known as the parent region and the two new regions getting created as part of the split are known as daughter regions. The procedure takes an exclusive lock on the parent region as well as two new daughter regions and holds the locks until the procedure is completed. If the split procedure fails to acquire exclusive locks on parent and/or daughter regions, or if it fails to acquire shared lock on the corresponding table and namespace, then the procedure is suspended and resumed later to avoid conflict with any other procedure holding the region level exclusive lock.

Rollback is not supported after reaching the SPLIT_TABLE_REGION_UPDATE_META transition in the state diagram, as it becomes too complex to roll back once we reach this point. Instead, the AMv2 will roll forward when performing failure recovery. The majority of operations performed as part of this state involve updating in-memory region states of parent and daughter regions and META table entries. Any failures to update these entries result in subsequent auto-retries.

This procedure consists of multiple sub-procedures to unassign parent region and assign daughter regions using TransitRegionStateProcedure (for both ASSIGN and UNASSIGN). The parent region is transitioned to SPLITTING and both new daughter regions are transitioned to SPLITTING_NEW until the split is complete. Once the split is successful, both daughter regions will be ONLINE and ready to receive/process requests from clients. The parent region is brought OFFLINE during the initial stages and is eventually either archived or deleted, depending on relevant runtime configuration settings.

Region Merging

Region merging is the inverse activity to region splitting. Multiple lexicographically adjacent regions of a table can be merged to form a new, single, region. Region merging is a built-in automatic process, which, if enabled, helps to normalize the size of table regions. Splitting and merging are the two complementary parts of a global optimization process that auto-shards table data into roughly equal sections. Merging can also be triggered by an administrator manually. The procedure responsible for merging multiple regions into one is known as MergeTableRegionsProcedure (MTRP). The procedure takes exclusive locks on all merging (parent) regions and the new merged (child) region and holds the locks until the procedure is completed. If the merge procedure fails to acquire shared locks on the corresponding table and namespace, or if it fails to acquire exclusive locks on merging and merged regions, then the procedure is suspended and resumed later to avoid conflicts with any ongoing procedures holding the exclusive locks on these regions.

Rollback is not supported after reaching the state MERGE_TABLE_REGIONS_UPDATE_META mentioned in the state diagram, as it becomes too complex to roll back once we reach this point. Instead, the AMv2 will roll forward when executing failure recovery steps. The majority of operations performed as part of this state involve updating in-memory region states of merging and merged regions and META table entries. Any failures to update these entries result in subsequent auto-retries.

This procedure consists of multiple sub-procedures to unassign merging regions and assign newly merged region using TransitRegionStateProcedure (for both ASSIGN and UNASSIGN). All parent (or merging) regions are transitioned to MERGING and the new region is transitioned to MERGED until the merge is complete. Once merge is successful, the newly merged region comes ONLINE and is ready to receive/process requests from clients. All previously merging regions are already brought OFFLINE during the initial stages and they are eventually either archived or deleted, depending on relevant runtime configuration settings.

Region Movement (Balancing) And Reopening

The active coordinator is responsible for balancing the placement of regions and load over the fleet of regionservers. As part of this process, the AMv2 might be asked to move regions from one regionserver to another. The process to do this consists of the Unassign Procedure to unassign the region on the source server, followed by the Assign Procedure to assign the same region to a different destination server.

Regions may also need to be reopened after a schema modification. When the schema for a table is altered, all regions of the table are very quickly closed and reopened, in order for the schema modifications to take effect. The process of region reopening follows the same sequence as region movement, with the only difference being that the Assign procedure’s destination server is the same as the source server.

Conclusion

In conclusion, the AssignmentManagerV2, or AMv2, available in HBase 2 releases is more robust, resilient, fault-tolerant, and scalable than the AM provided by previous HBase releases. Many of the shortcomings of the earlier design that shipped in HBase 1 have been addressed. With the improved retry mechanisms, the single-writer system, better state management, improved shared/exclusive locking, and durable and reliable logging of state transitions, HBase 2 achieves significantly improved availability and reliability in operation.

In the third part of the blog post series, we will cover the internal mechanism of two of the most complex and critical Procedures: ServerCrashProcedure (SCP) and TransitRegionStateProcedure (TRSP), and also cover how they coordinate to achieve reliable and robust region transition workflows during server shutdown events.

References

HBASE-14350 Procedure V2 Phase 2: Assignment Manager (including all sub-tasks)
HBASE-14614 Procedure v2: Core Assignment Manager
HBASE-24408 Introduce a general ‘local region’ to store data on the active coordinator

Evolution of Region Assignment in the Apache HBase Architecture — Part 2

Brief Introduction to the ProcedureV2 (Pv2) Framework