Evolution of Region Assignment in the Apache HBase Architecture

Written by Viraj Jasani, Andrew Purtell & 张铎 (Duo Zhang)

In the second part of this blog post series, we provided an overview of how the redesigned AssignmentManager in HBase 2 efficiently and reliably manages the process of region assignment. In this third entry in this blog post series, we will focus on the complexity of the ServerCrashProcedure (SCP) and TransitRegionStateProcedure (TRSP) procedures — two of the most critical procedures in the HBase architecture — and how they have evolved over time to become more robust and scalable.

What is ServerCrashProcedure (SCP)?

The HBase coordinator tracks the status of all regionservers by setting watchers on znodes in ZooKeeper that are created ephemerally by regionservers when they start up. (See RegionServerTracker.) Regionserver can be stopped either gracefully or non-gracefully. A graceful-stop implies that we have first moved all regions from the given server to other servers in the cluster successfully before we stop the server. Graceful stop minimizes the region level availability impact because when the server is finally stopped, all regions have been moved away to other servers. If we stop a server without first moving the regions it is hosting off to other servers, this is a non-graceful stop. When any server is stopped non-gracefully or the server crashes due to process, hardware, or operating system failure, the regions it was hosting still have to be redistributed to other servers in the cluster. This task falls to the coordinator’s AssignmentManager.

Because the regionservers register themselves as available for serving with, among other things, the creation of an ephemeral znode, ZooKeeper handles liveness tracking of the regionserver fleet for us. The stopped or crashed server’s ephemeral znode will be deleted after the stopped or dead regionserver misses its heartbeat. As we said earlier, the coordinator registered a watcher to receive notification of such events. When that happens, the coordinator initiates a ServerCrashProcedure (SCP) for the dead server. If the dead server held any regions before being brought down (i.e. non-graceful stop), SCP takes care of redistributing the regions to other live servers in the cluster.

Let’s take a look at some of the key responsibilities of ServerCrashProcedure:

A write-ahead log (WAL) file is maintained per server, but when applying edits from the WAL during recovery, they are applied at the region level. The WAL file must be split into per-region files for recovery. SCP manages the process of splitting the WALs of the dead server, producing region-specific recovered.edits files. The WAL splitting process is a distributed process where WAL split tasks are assigned to other live regionservers.
If the dead server hosted the special meta region, SCP must open the meta region elsewhere before reassigning any other regions.
SCP is responsible for updating the in-memory state of the coordinator. The coordinator tracks the state of all regionservers. Valid server states are ONLINE, CRASHED, SPLITTING_META, SPLITTING_META_DONE, SPLITTING, OFFLINE.
If the dead server had a replication queue assigned to it, SCP must reassign the replication queues of the dead server to other live servers (to be released with 3.0.0 and 2.5.0 versions, for more details please see HBASE-26029).
SCP must coordinate with TRSP to ensure smooth region transition of the regions that were previously present on the dead server.

How does ServerCrashProcedure work?

SCP is implemented as a Procedure, just like the other AssignmentManager workflows we have discussed in the previous posts in this series (Part 1, Part 2). Its primary responsibility is to split the write-ahead logs (WALs) that belonged to the crashed server into region-specific files ready for replay when the region is opened again. WAL splitting activity can be categorized into two variants with separate implementations:

ZooKeeper-coordinated WAL splitting
Procedure-based WAL splitting, the default in HBase versions 2.4 and above.

ZooKeeper-coordinated WAL splitting is a legacy feature. We will focus on the Procedure-based WAL splitting process exclusively.

ServerCrashProcedure is executed by the active coordinator to initiate necessary recovery actions after a server is stopped or has crashed. The special meta region must always be available so this is the first order of business. If the dead server hosted the meta region, SCP first executes a slightly modified region assignment workflow to ensure the meta region is available (ONLINE) again. This must be done before we can split any WALs. Then, SCP schedules the child procedures SplitWALProcedure, which internally schedules its own child procedure SplitWALRemoteProcedure to delegate the WAL splitting responsibility to one of the remaining live regionservers, chosen at random. In the context of WAL splitting, we call the selected regionserver a worker. One SplitWALProcedure is scheduled for each WAL file and each SplitWALProcedure selects its own workers. SplitWALManager tracks the workers for all SplitWALProcedures, and provides utility logic for assigning and releasing workers as required. The SplitWALProcedure assigns a worker for the WAL splitting task, delegates the task to the worker, and, once completed, releases the worker so that it can be acquired by another procedure.

We limit the number of tasks that can be assigned to a given worker to avoid excessive resource consumption, which could lead to additional failures during recovery. This limit is configured by hbase.regionserver.wal.max.splitters. The default value is 2. The peak concurrency of WAL splitting activity can therefore be defined as (number of online regionservers) * (max splitters configured). If all live servers are already busy splitting maximum tasks, SplitWALProcedure has to wait and resume only after a worker becomes available to take up further tasks. This strategy also ensures that WAL splitting tasks are equally distributed among the available online servers. If a few regionservers are processing tasks more slowly than others, the slow servers will not become a bottleneck.

At the regionserver (recall we call them workers in the context of WAL splitting), WALSplitter is responsible for splitting the WAL files. HBase affords a lot of flexibility in how low-level region-specific details may be structured and managed, via a plugin model, and so the WAL split process must also be flexible. OutputSink provides a common interface to support different ways of splitting a WAL file into region-specific replay files, depending on the implementation of low-level region details. By default, three writer threads are used for this operation, however, the number of writer threads can be adjusted with the configuration option hbase.regionserver.hlog.splitlog.writer.threads (some interesting work: HBASE-19358, HBASE-23286). Once the WAL splitting tasks for a given SplitWALProcedure have all successfully completed, SCP cleans up after the dead regionserver’s WALs, and completes its work by reassigning all of the regions that were previously hosted (ONLINE) by the dead server to other online servers in the cluster, using the normal region assignment procedure, TransitRegionStateProcedure (TRSP).

The per-region files (they are named recovered.edits) written by SCP for recovery are processed by the regionserver as part of the normal process of opening a region. When taking on writes, the region-level mutations are collected and sorted in memory in the memstore. When the memstore reaches a configured limit, the current snapshot of its contents are aggregated and written (or flushed) to disk as a new HFile. If a recovered.edits file exists for the region, the regionserver replays those edits into the memstore before marking the region available (ONLINE). As part of this replay process, all entries in the recovered.edits file are added to the region’s memstore and then the memstore flush is initiated. We optimize for the case where edits in the replay file are known to already be persisted in existing HFiles. This is done by assigning every edit a sequence number and tracking the highest sequence number for each store. If we find any entry in recovered.edits with a lower sequence number than the highest sequence number tracked in the store metadata, that particular entry can be skipped, because we know it has already been persisted in the region. Region metadata also tracks the maximum flushed sequence number and ensures no edit is written into the region’s memstore that has already been previously processed and persisted.

How Did TransitRegionStateProcedure (TRSP) Emerge?

As we discussed in the previous post, we used to have multiple procedures implementing the various phases of region assignment (assign, unassign, move, reopen), and now, in the most recent release of HBase, these are executed by only a single procedure: TransitRegionStateProcedure (TRSP).

Prior to HBase 2.2, we needed to coordinate somehow between the running procedures and the new procedures executed to handle server termination, and some edge cases were not correctly handled because of the complexity inherent in concurrently executing stateful processes attempting to manage the same resource. For instance, if a server dies, and some procedure is trying to assign a region on this server, the running procedure should be canceled and a new procedure should be executed to assign the region elsewhere. Similarly, if a server crashes and a procedure is trying to unassign a region, we might then have two procedures trying to accomplish the same thing: first, the previously running unassign procedure, and second, recovery actions initiated by the server crash procedure. This is more complicated than it might first appear. We might have been running the unassign procedure as part of the administrative activity to offline the region. When we offline a region, all edits present in memstores (and of course, in the server’s WAL) for the region going offline must be flushed to HFiles and made persistent. Before completing the unassignment of a region, the system should ensure that the region is assigned to a live server somewhere that is in good health. The server hosting the region will be responsible for persisting the region’s data before marking it closed (OFFLINE). Coordinating this activity is crucial for maintaining the integrity of data. The canceled procedure must carefully roll back whatever it was doing. Rollback is always challenging and invariably has edge cases. Meanwhile, the newly executing procedure(s) must be careful not to modify anything until that rollback has been completed, and this kind of distributed coordination is prone to accidents.

Region assignment, unassignment, or movement activity must carefully coordinate with server crash handling activity in order to avoid duplicate or faulty region state transitions. The best way to accomplish that was to unify all of the procedures that transitioned region state into a single procedure. The unified procedure, TransitRegionStateProcedure (TRSP), was introduced and the previously implemented region transition procedures were deprecated and no longer used by the AssignmentManager. The result makes the overall system more reliable, robust and simplified in its operation. The need for coordination among procedures was reduced to just two procedures. Only SCP and TRSP handle all the region transition related details now, and corner cases that arise from stopping servers are limited to this pair.

Let’s try to understand how this happens in detail. At any point in time, the system must ensure that a given region is being managed by only one TRSP procedure. More than one TRSP procedure operating on a given region could cause invalid transitions due to unexpected concurrency. The active coordinator maintains the latest region state for a region in a data structure named RegionStateNode. When any change happens to the region state, usually the corresponding RegionStateNode is updated in memory (in AssignmentManager) before the relevant entries are updated in the meta table. RegionStateNode contains a procedure field of type TRSP that indicates what procedure is currently updating the region state. When a region needs to go through assign/unassign/move/reopen transitions, the procedure field of the RegionStateNode should be updated to the corresponding TRSP only if the procedure field was null (not assigned) and when the procedure is completed, the procedure field should be set back to null. This way, the system guarantees that only one TRSP procedure (only assign or unassign or move or reopen) can actively transition the given region. When SCP needs to assign a region from the dead server to one of the live servers, it takes a lock on the region’s RegionStateNode and tries to assign it only if the procedure field of the RegionStateNode is null i.e. the region is not already going through some transition. However, if it is going through transition, the SCP internally wakes up the suspended procedure if any, to complete the ongoing transition (by invoking TRSP#serverCrashed). The SCP tries to finish the already in progress region transition by forcing the region location to be on some live server only. Forcing a new location for the region going through any basic transitions ensures that it gets assigned to one of the live servers only and doesn’t get stuck on the current dead server. The relevant work is accomplished as part of HBASE-20881 and HBASE-23035.

As part of the evolution of TRSP, one of the critical edge cases worth mentioning is related to HBASE-22060, where regionserver may report the region state transition to dead coordinator instead of the live and active coordinator and dead coordinator can still accept the request. The coordinator shutdown is asynchronous by nature and before the coordinator is finally shut down, it can still accept any RPC request. In the earlier versions (before HBASE-22074), the coordinator in the process of shutting down could still update the meta table while processing the report of region transition from a regionserver. This issue has been fixed in TRSP as part of HBASE-22074. The fix for the coordinator is to update the procedure state based on the region state transition report received from the regionserver and persist the updated procedure state to the procedure store, followed by resuming the TRSP procedure and finally update the meta table entry. When a coordinator is elected into the active role, as part of the initialization process, we have proper validation in place to ensure that only the active coordinator stays in charge of updating the procedure store. Hence, only the active coordinator can successfully process the reporting of the region transition requests from regionservers.

As we have mentioned in the previous post, Split and Merge region procedures also use TRSP as child procedure to assign or unassign regions while they are being split or merged. Hence, the coordination with SCP to avoid the concurrently redundant execution of the region assignment workflows, which is internally achieved by TRSP, is also useful for the more complex region transition workflows.

We are also planning to publish a series of blog posts covering the process of safe in-place rolling upgrades (with the option for rollback) from HBase versions 1.x to versions 2.x sometime soon. Watch this space!

Thanks to Eric Fox and Laura Lindeman for the additional reviews!

Evolution of Region Assignment in the Apache HBase Architecture — Part 3

What is ServerCrashProcedure (SCP)?

How does ServerCrashProcedure work?

How Did TransitRegionStateProcedure (TRSP) Emerge?

New to Salesforce?

About Salesforce

Popular Links

What is ServerCrashProcedure (SCP)?

How does ServerCrashProcedure work?

How Did TransitRegionStateProcedure (TRSP) Emerge?

Embracing Mutable Big Data

A Peek at Datorama’s Virtual Sandbox

How, Not Why: An Alternative to the Five Whys for Post-Mortems

Boost Delta Lake Performance with Data Skipping and Z-Order

New to Salesforce?

About Salesforce

Popular Links