It is vital that a microservice striving for high availability have at its disposal several choices on how to rollback changes when unforeseen production issues occur. A very interesting article comes to mind from O’Reilly, Generic Mitigations, where the theme is to restore service functionality FIRST and FAST — and THEN root cause the real fix as a follow up. The top goal is to minimize user impact by using “pre-canned” rollback mitigations — ones that were practiced and that the team can apply virtually blindfolded — and employ them fast, restoring service the soonest. Software code changes make up most new deployments, and so the focus discussed below is on dealing with errant code changes, typically around tech debt improvements.
The Problem
When a service rolls out changes, it usually combines a group of pull requests (PRs) together into in a release bundle, which is then deployed into production. And when trouble starts, precious time is spent investigating performance data and log and error output, as well as reviewing code PRs to determine the issue — all the while the users are being affected. Following the Generic Mitigations pattern, good students of these teachings will invoke a ready mitigation (ex. a code rollback) and restore the service — all good, right!? Sort of; the user is able to transact, the service is back, but now more engineering resources and time need to be spent sorting out which PR did what to ultimately provide a follow on fix. Let’s face it: as microservice pattens flourish and become more prevalent, the system becomes more integrated and complex, and therefore it’s harder to rollout one flawless update after another.
A Solution To Reduce The Chaos
Typically, in any given release there is usually a code change that fixes tech debt or updates existing service flows — and when they regress and break, the rollout goes bad, users and the service are affected and the above problem repeats. One way to help mitigate these type of changes, is the idea of a code switch, where we wrap the existing code and the newly updated code in a simple IF test, one that is controlled by a code switch. This example pseudo-code is offered below.
If codeSwitch.isEnabled(metaServiceAuthN)
// execute new code - create and sign token and send to meta API
else
// execute existing code - get token from IDP and sent to meta API
Here, one can see that, if the code switch is enabled, the service uses its new AuthN mechanism to call an outside API. If the code switch is disabled, the existing AuthN, one that has been working for months is executed, as if the new code update and deployment never happened. The choice is made at runtime, providing the opportunity for changing the code switch on the fly, without any code rollbacks, service restarts, or updates.
Now, of course, there are numerous ways to do this type of pattern, one where new functionality can be tried out on a limited set of customers — think A/B testing, or SpringProfiles, or Canary patterns, etc. Baeldung.com has an excellent write up on this family of feature switch technologies here. All of which are excellent, and should be pursued, and their pros and cons considered. However, if you have a new service or have limited developer resources, those robust service capabilities are probably still on the drawing board. This is where the simplicity of code switches comes in to offer. For a light lift (on the order of a few sprints), you can build a feature switch pattern that is tailored for cross-service deployment failure mitigation.
The Build Out
As we built this out, in our environment of Amazon AWS, Kubernetes (K8s), Java SpringData JPA and a single Relational Database Management System (RDBMS) centrally backing all the workloads, the database was the obvious choice to store the code switch such that it could be shared across K8s workloads. As a side note, we also wanted to avoid service restarts to affect the change, so this ruled out an app.properties approach.
code_switch Table
name - varchar(100)
enabled - boolean
info - varchar(100)
A simple enum was employed to keep order with naming and storage consistency.
enum CodeSwitchEnum {
NONE("none", "none", false), TEST("test", "this is a test", true),
META_SERVICE_AUTHN("meta_service_authn", "none", false); String name;
String info;
boolean initialEnableState;
And finally, a simple method to check isEnabled(), which does a DB lookup of the value.
boolean isEnabled(CodeSwitchEnum switchName) {
codeSwitch = codeSwitchRepository.findById(toggleName)
return codeSwitch.getIsEnabled();
Immediately, we recognized the need to leverage the RDBMS cache on these values and specifically avoid using the JPA cache. If the JPA cache is part of the isEnabled() call, and the database field was updated at the SQL prompt or via an Admin API call, we would need to cache flush all JPA caches for each workload. which would be challenging. By using the RDBMS cache directly, anytime the field changes in the DB on the next read from any workload, the call would flow all the way through to the database to get the updated value and update the cache. This proved to be the simplest way to allow real-time database updates, forcing the app to re-read the new value. Another iteration of this could have been to work Redis into the isEnabled() flow, which offers some improvement, but is somewhat limited — the DB cache is good enough as a start!
Putting it all together, here we can see how, depending on the position of the code switch, a different code path will run. In the case of the switch being enabled, path two executes. Conversely, if the switch is disabled, path one executes. On first read, the first workload will need to go all the way to DB disk, but thereafter, each other workload calling isEnable(), will pull the code switch boolean from the RDBMS shared cache. Upon change of the code switch state, the RDMBS will flush the cache, and all workloads will get the newest value on the next read, which makes it a nice real-time switch.
Initialization … Bring out the Provisioner!!
We next considered what should happen the first time a new code switch is added; is it disabled or enabled by default? How is the initial state and the code switch added to the database — via a database migration or during the service start up sequence? If we chose the DB migration path, then, for each code switch, we would need to create a new YAML description, and DB migrate after we added the enum to the Java file. Adding the field initialEnabledState to the enum was the most straightforward way, as it made it simpler for the developer to create the code switch in one place.
enum CodeSwitchEnum {
boolean initialEnableState;
Next, we chose to provision the code switch at startup, using a simple @PostConstruct annotation in our Admin workload. Here the code switch provisioner will loop through all the enums, doing a DB lookup for existence. If provisioned previously, the switch is already in the DB, so NEXT! If not, then it will be added to the DB with its initial state set. All the while, the isEnabled() call will return a false (disabled) on all conditions unless it actually reads the value in the DB (or its cache). This alleviates any sort of timing issue on first initialization if a different workload reaches the isEnabled() check before it is provisioned. Lastly, the Admin provisioner loop is built such that the first admin workload to perform the DB insert wins, and the others will fail gracefully — no harm, no foul.
// ** will run once on startup
@PostConstruct
public void autoCreateCodeSwitches() { for (CodeSwitchEnum enumItem : CodeSwitchEnum.values()) {
if (enumItem == CodeSwitchEnum.NONE)
continue; if (Objects.nonNull(codeSwitchService.getCsInfo(enumItem))
continue; // already in DB, NEXT! // not found, create and insert in DB
CodeSwitchInfo codeSwitchInfo = CodeSwitchInfo.builder()
.name(enumItem.getValue())
.metaval(enumItem.getMetaval())
.isEnabled(enumItem.getInitialEnableState()).build();
codeSwitchService.createCodeSwitch(codeSwitchInfo);
}
log.info("autoCreateCodeSwitches done.");}
API Management
In addition to the workhorse of isEnable(), we crafted a group of management methods and APIs to handle management of the code switches, all protected with best API security practices.
CodeSwitchInfo createCodeSwitch(CodeSwitchInfo codeSwitchInfo);
CodeSwitchInfo readCodeSwitchInfo(String name);
List<CodeSwitchInfo> readCodeSwitchInfos();
CodeSwitchInfo updateInfoval(CodeSwitchInfo codeSwitchInfo);
void setEnableState(String name, boolean state);
void deleteCodeSwitch(String name);
GET /code-switch/{name} read code switch
GET /code-switch/switches read list of ALL code switches
PUT /code-switch create code switch (pre-req: enum exists in code)
PUT /code-switch/{name} update code switch infoval
DEL /code-switch/{name} delete code switch
GET /code-switch/{name}/enable enable code switch
GET /code-switch/{name}/disable disable code switch
Usage and Outcome
For our service, we have used this technique several times on areas of tech debt transition where there is the potential of “service breakage.” We are careful to not use this as a means to enable or disable customer features, because this begins a slippery slope of customers being able to choose which features they want or not. That would make it a nightmare to maintain the code and would fragment customer feature sets. Again, the context for using this technique is for a new service, a team just looking to evolve their service, or to keep uplifting tech debt, with a small team of developers that needs to be able to fallback fast.
We started with the default value of DISABLED for our first few code switches. Then, when the service owners were ready to enable the code switch, we did so, such that we had all of our monitoring instrumentation at the ready. Recall that I mentioned a good use of code switches is around areas that could break critical user flow functionality — so we wanted to be ready!
For the act of changing the code switch, we have an Admin API that only allows service operators (with the right security and credentials) to authenticate and make the change to the enabled state. We immediately started to see different logging and monitoring metrics showing that the new code path (see path two above) was running. So far so good. After another 20 mins, all appeared to be nominal, and we declared the production change a success, all the while knowing that, if there was any hint of concern, we could quickly revert back to code path one very fast. Which brings us back to the tenants of Generic Mitigations: minimize user impact and rollback fast!
In closing, clearly there are many more sophisticated patterns out there that could be the next level in a service that is following an iterative approach of continuous improvement. Code Switches as described here offer a relatively simple yet useful tool to go about tech debt service improvements in a lightweight way with the option to revert quickly.