Autonomous Monitoring and Healing Networks

Occasional failure is inevitable in any network system. The need of the hour is a robust, self-reliant automated monitoring tool that provides great insight and a lesser degree of manual intervention. We need autonomous interventions that save us time and enhance system availability. What Salesforce Edge now offers is a lightweight and fast solution to help minimize downtime and the amount of time we need to spend monitoring a Content Delivery Network (CDN).

Given the complexity of our systems, which include components like an Stunnel for secure communication and Redis to store in-memory configuration for thousands of Salesforce customers on the Edge network, probability of downtime does exist. Sometimes it’s a failure of one of our own features, or it could be downtime with a switch or a router that is managed by our network operations team. Of course our goal is 99.9999% uptime. We can further classify the errors as the ones that exist in the data path (including other components like switches, routers, or firewalls) and the ones caused by our own internal components like Redis or NGINX.

As part of our initiative to have a self-reliant and self-healing network, we propose two different monitoring schemes to provide an autonomous resolution to save on downtime, enhance availability, and reduce human interventions. The first would take the instance out of traffic path and the other would self-heal the malfunctioning component using the predetermined recipes.

Network Monitoring and Instance Removal:
In a multi-tiered network setup, a failure may not constitute a failure in one team’s service setup. It takes time and causes a significant downtime to identify the root cause and provide a fix. To avoid such a system outage, we discuss a monitoring scheme to identify the failures and take proactive steps. This scheme is more effective when the failures are small enough not to be considered as global outages and thus do not come under global systems failure radar but still cause customer impact.
Component Monitoring and Instance Repair:
Software solutions catering to a number of customers and solving substantial use cases do bring in various design components that need to work effectively together. This brings in greater probability of malfunctions within our service and the need to detect these and take corrective action. We will discuss the recipes and the steps that could be taken to minimize any service impact while self correcting the problem.

For both the schemes, the underlying concept is to have autonomous network monitoring and self-healing procedures.

Network Monitoring and Instance Removal

In the first part of this article we will talk about how we detect and act upon the errors that are noticed along the data path, especially in components that the team does not own and control. In Salesforce, data center-wide issues are usually owned by a dedicated team separate from individual service owners, such as the Edge team. In the event of a data center-wide failure, the best first mitigation for Edge customers is just removing the data center from the DNS rotation in the Global Server Load Balancer. A more contained scenario is when the scale of the outage is small enough that it is not a data center-wide failure but still causes a significant customer impact. For example, say stunnel running on our computing instance is not refreshing DNS for a DDNS upstream and is continuing to work on a cached IP that has been taken down. Or, an instance is using an expired certificate on one of our ~100 computing instances. In these scenarios, the safest bet is to take that instance out of traffic to minimize customer impact.

The scheme that we propose is to detect such failures by continuously monitoring traffic across each computing instance independently and take actions as appropriate. Monitoring works as a background thread on a set of pre-configured test domains that it uses for a health check. Given that Edge is a proxy between the clients and core Salesforce, our main goal is to see traffic reach its destination with no hiccups. So the monitored domains are selected to cover as many use cases as we would encounter with real traffic, thus creating silos. Defining silos would mean to list all use cases in our service offering and come up with a sanity domain to cover all those possibilities and corner cases. At this time we focus on domains that perform a 1-way and 2-way TLS handshake to Salesforce core apps and domains outside of the Salesforce network.

This is a high-level diagram detailing Edge and its interaction with different Salesforce components:

This helps us understand the problem and react to it in an appropriate manner. Before we go into dealing with a failure, let us first define what that means in our context. As mentioned, Edge is a proxy between clients and Salesforce core; our primary job is to maintain warm connections between Edge and the origin to reduce TCP handshake cost and also to provide caching to the clients. A failure is defined as any unsuccessful TCP connection between us and Salesforce core or an outside Salesforce domain. So we would look at connection timeouts (504), gateway being unreachable (502), and missing or incorrectly onboarded domain in Redis (403), and ignore most other status codes.

Once we have the failures, it is important to define sensitivity. This stands for the percentage of domains across use cases (silos) that should fail for us to take a reactive action. This is called the “negative quorum” and is essentially configurable. There are two situations that may arise that make it important to be careful with this setting. If it is too sensitive, we would get false positives. On the other hand, giving a lot of room would result in false negatives; for instance, an upstream instance failure in a cluster of five behind a load balancer could be overlooked, as its possibility of failure is 20%

One option to minimize false positives would be to repeat the health checks multiple times before we take the instance out of traffic.

negative_quorum: Is a configured percentage of failed domains after 
                   after which the instance shall be considered
                   unhealthy and taken out of traffic
# sanity_domains:  Are a list of domains marked to be used for our                 
                   health testing
while True:
    sleep_for_N_seconds
    failure_count = 0
    for each domain in sanity_domains:
        response_code, err = execute_http_request(domain)
        # Considering connection errors, acl errors, 5XX response 
        # code as failed responses
        if err or response_code = 403 or 
            (response_code >= 500 and response_code <= 599):
            failure_count++
    
    failure_percentage = failure_count/len(sanity_domains) * 100
    if failure_percentage >= negative_quorum:
        mark_instance_outof_LB()
    else:
        keep_instance_in_LB()

When performing end-to-end monitoring, it could be that the problem is not on our end; hence, we have a concept of raising alerts and notifying the team. We pipe alerts into the open source monitoring dashboard Refocus. This dashboard is monitored by Site Reliability and offers great features like paging the person who is on call.

It is important to note that this kind of monitoring is continuous and does not stop when a problem has been identified and an instance has been taken out of traffic. This is very useful for a long running service wherein for an external network issue some instances were taken out of rotation, but now they need to be brought back to traffic (after the external team has resolved the issue). This implies that the same above stated algorithm should be applied but this time waiting for a positive result across silos to be observed a threshold number of times. This greatly helps avoid any human intervention, and the failures can still be worked on later from the alerts that were created and registered with Refocus. Creating alerts provides traceability of an event that had occurred, which is important in our context as the instance is brought back to traffic.

Impact

In order to talk of the impact of this new feature in Edge team, we would like to talk about a network error spike we observed in our test/User Acceptance Testing (UAT) environment. We noticed alerts raised by our entire fleet of instances in a datacenter using this feature. After looking into it, we could see that the negative quorum was at a 100% and alerts were raised at the same time. Given the alerts were flapping around 01:55 UTC across our gateways it was safe to assume that it was not a software bug on our side but a network issue in the pipeline somewhere between us and the proxy we use to reach the internet.

The image above shows the number of aborted requests between 01:55–03:03 UTC. After reaching the above conclusion the investigation was handed over to the Network Operations team which analyzed logs on their side and found an issue wherein several interfaces (uplinks, inter-switch links and downstream links) flapped on a switch pair, thus causing this issue. Given the timeframe was small enough it didn’t come under the radar of Site Reliability, but Edge instances could safely stop taking the traffic (Salesforce internal test traffic) during that time and resume when data path was restored. Since alerts were raised in time for the team to identify that an issue had happened, by the time the team logged in to see the current status, the flapping had stopped. The issue with the flapping on switches was later taken up with the equipment vendor.

Component Monitoring and Instance Repair

Given the complexities of a multi-tenant and a distributed network architecture, there are various solutions that are used during system design. It is critical to monitor these for their health at runtime and to take self-healing actions when appropriate. This feature is an extension to the above discussion where our tool tries to detect failed components in the architecture and then corrects them. This requires continuous monitoring and component-specific fixes. The goal we have here is as little as downtime as possible and zero operations time to fix the issue at hand. The fixes here will be very specific to the design and we do not provide a one-size-fits-all solution. The proposal is to have individual recipes prepared to use as and when a malfunction or an outage is observed.

The algorithm we have to identify such scenarios is very similar to SELF MONITORING but with two major caveats:

Rather than using the sanity domains, we continuously monitor each component every defined interval for their health.
Our responsibility is to fix the component problem instead of just issuing an alert.

These are the differences between the network and component monitoring.

Component solution adopted in Salesforce Edge

In our system design, we use two major components — NGINX and Redis (as shown earlier). These components run in different Docker containers on every instance. This brings a huge challenge for us to make sure that they remain in a healthy state. To implement this, we use an operating system level virtualization software that spawns containers wherein we run these two components. This brings in an added complexity to keep a note of the containers’ health. Further, since the containers that run this software are interdependent, the order in which they come up is extremely important. For us, Redis holds some configuration data that is used at runtime by NGINX. So this has brought a few challenges for us:

A dependent container (NGINX) should not start to take traffic unless the main container (Redis) is fully loaded and operational.
Also, if a container were to crash or restart when the service is live, it would lose all configuration information fed to it during the initial bootstrap. This scenario has to be identified and the instance should be removed from live traffic as it is correctly bootstrapped again.
Another potential issue with the virtualization software we use is the possibility of two containers racing to come up during service bootstrap. The idea is for one to be ready before the second starts to load, to have an order and serialization. Docker is not perfect in avoiding such issues. With Docker, we can have a load sequence of containers based on the dependency order specified using the ‘depends_on’ option. While it does provide this load sequence, it puts the final onus of ordered service bootstrap on software developers. Hence, monitoring the state of these two containers becomes crucial.

It is important to note that the definition of health for each component varies and is application specific. For instance, to solve a Redis health issue in Edge, we seed a bootstrap key which we poll to make sure of its existence every health cycle. In case of a problem, the actions we take, irrespective of the component, are the following:

Isolate the computing instance from taking any traffic to make sure there is no customer impact.
Use the component-specific self-healing recipe in a separate thread from SELF MONITORING
Bring the instance back in traffic as the fix takes effect.

With this set of steps, we solve the problems of two dependent containers racing during bootstrap and one component not being ready to serve traffic. Unless each component is marked not healthy, the instance is not brought to serve traffic.

Conclusion

With a large scale distributed system network, there are many chances for malfunctions. Highly scalable and available networks are the need of the hour and hence it is critical to invest in strategies that make our systems fault tolerant.

We discussed two ways of doing this. First is an end-to-end network monitoring that helps identify issues that may not directly involve one software architecture but a problem somewhere down the network. We proposed a scheme that would be a passive fix — removing the malfunctioning instance from Load Balancer in an automated manner to reduce chances of decreased availability. This works best when the problem is localized in one part of the system, rather than, say, DNS resolutions causing a datacenter wide failure. Secondly, we talked about various solutions that are used during system design and the importance of monitoring the sub-systems with the intent of having a self-healing mechanism. Both the schemes together are great tools to help build a distributed system that is highly available.

Learn how you can join our team to tackle problems like this one by joining our Talent Network!