In the first part of this blog, we described a solution for strongly consistent immutable secondary indexes. We extend this solution for mutable indexes here. By doing so, we provide a general solution for strongly consistent global secondary indexes.
The mutable tables allow existing rows to be updated many times, and updating an existing data table row may change a secondary key on this row. If this happens, we need to remove the old index row for this secondary key and insert a new index row with the new key. As the following illustration shows, changing the city for Alice from Seattle to Utah requires simply updating the the city column value for Alice’s row on the data table, however it requires multi-row updates on the index table.
Changing a secondary key column value on the data table makes the corresponding existing index row invalid. In the example, changing the city from Seattle to Utah for Alice makes the existing index row for Seattle invalid, and this row should not be in the result set of any query. This means we need to delete this row from the index table, and insert a new index row for Utah. This multi-row update should be done atomically so that for the queries on the index table we either return the Seattle or Utah row, but not both.
To detect a secondary key change because of a new update, we retrieve the current row from the data table and check for a changed secondary key. If found, we delete the index table row corresponding to the current data table row and insert the index row with the new secondary key. This means that we need to retrieve the existing data table row first before generating index table mutations.
As we concluded, we need to read the existing row state from the data table before updating the row. If the row exists, then we generate two sets of index mutations. The first set includes mutations setting the existing index row status to “unverified” and to insert the new index or updated index rows. Both the new index row and modified index row mutations are inserted with the “unverified” status. The second set of index mutations includes deleting the previous versions of index rows, if needed, and setting the status of the new versions of existing rows (or the newly inserted index rows) to “verified’. We apply the mutations in the first set to the index tables. Only after completing this step successfully do we update the data table row. Only if the data table update is also successful, do we apply the second set of index mutations.
For deleting an existing data table row, we use the same four-phase approach as for covered immutable indexes. Mutability of indexes does not require any change on this delete approach.
Since index table mutations are generated based on the current state of the data table row and the update (that is, mutation) on that row, we need to make sure that we generate correct index mutations even when the same row is going through multiple concurrent updates. Let us consider the following scenario. While one update on Alice’s row to change her city from Seattle to Utah is in progress, another update to change her city to Dallas arrives. If we let both updates proceed concurrently, we may conclude that the first update is to change the city from Seattle to Utah, and the second update is to change the city from Seattle to Dallas. This is because while processing the second update, the first update may not have been done on the data table yet. This results in deleting the Seattle index row twice and adding two separate index rows for Alice, one for the city Utah, and another for the city Dallas, as shown in the following illustration. After completing the processing of the concurrent updates, we should have only one (verified) index row for Alice regardless of the number of concurrent updates. This requires us to serialize the concurrent updates to a given row. We can achieve this using mutually exclusive row locks, which is explained in the next section.
Handling Concurrent Updates
To serialize the concurrent updates to a given row, we use row-level mutually exclusive locks. Ideally, we block a concurrent update until the previous update completes the data table row update. This means that on the server side (that is, on the HBase region servers), we take the following steps.
- Acquire the row lock for a given table row.
- Read the data table row state.
- Generate two sets of index table mutations (as explained in the previous section).
- Update the index tables with the first set of mutations.
- Update the data table.
- Release the lock.
- Update the index tables with the second set of mutations.
Since we use blocking RPCs in HBase, and data table and index table regions are independently distributed over HBase region servers, we need to update the index tables using the blocking RPC calls. Holding row locks over blocking RPCs can easily lead to cluster-wide deadlocks. To prevent this, we need to release the locks before updating the index tables, as this illustration shows.
Releasing the row lock before updating the index tables with the first set of mutations allows the subsequent concurrent update on the same data table row to acquire the row lock and start its own processing. However, we cannot read the data table row state from the data table directly (for this concurrent mutation) since the previous update on the same row may not be done. Instead of retrieving the data table row on disk directly, we need to use the data table row state prepared by the previous concurrent update on this row. To do this, we cache these row states in memory.
After releasing the lock and updating the index tables with the first sets of mutations, we acquire the table row lock again and check if we have updated the data table successfully for the previous concurrent update. If this update has not happened yet, then we release the row lock and wait for the previous update to complete or fail. If the previous update fails, we are required to abort the subsequent concurrent update(s) since the row state used to generate index mutations is no longer valid.
The solution presented here updates the index tables using a two-phase commit protocol. There is no guarantee that both phases will complete successfully. If the first phase fails or the second phase does not make it, we can have unverified index rows. As mentioned before, these rows are repaired from the data table if they are retrieved for serving queries. This read repair process simplifies handling write failures and concurrent updates, as we can simply abort the operations and leave index rows in the unverified state.
Monitoring Index Consistency In Production
A provably correct solution to keep index tables consistent with their data table has been presented. However, how do we ensure that this solution is correctly implemented, that the correctness will be preserved while the implementation evolves over time, or the assumptions made for the underlying services and systems are correct and never change? The answer is that we cannot and we need to constantly monitor the consistency level to discover issues.
Monitoring the index consistency level by itself is a challenge and can be costly. Ideally, we would like to compare every data table row with every index row, and make sure that they are consistent. Actually, this comparison has to be done in both directions: one from data table to index table, and another from index table to data table. For example, we start with the data table and make sure that for every data table row, the corresponding index rows exists. Then, we continue with the index table and ensure that for every index table row there is a data table row with the consistent content.
In addition to ensuring the completeness of the monitoring process, we also need to make sure that monitoring is done efficiently without impacting the other services in the cluster. The details of how we achieve this can be the subject of another discussion. However, it is worthwhile to mention that incremental verification to update the consistency metrics is the main idea for the efficient and scalable index consistency monitoring.
Thank you to Kevin Bradicich, Trish Fuzesy and Laura Lindeman for reviewing this blog post.