When a product is a network of interconnected services with a rich collection of external dependencies, it’s really hard to identify bottlenecks or unhealthy services. A single model for the whole system is too complex to define and understand.
A good strategy here is to divide and conquer. Monitoring can be simplified dramatically by focusing on every service separately and tracking how others use it and how it uses the services it depends on.
My team develops a product that’s composed of many services with an even greater number of dependencies on external services. Practically every service and external dependency is a distributed system. In context of this post, that means that the same service can work fine for one client but be completely broken for another one.
We, as service owners, want to know when our product performs as expected and catch problems as early as possible to prevent downtime and to avoid the unpleasant experiences for our customers. To achieve this, we monitor our product from inside and outside. For an inside look, we do proactive monitoring. Every instance of every service emits metrics and logs, all of which flow into aggregated storage. Data from there is used to determine the stability of every service and the product in general.
Our services are well instrumented for the purpose of proactive monitoring, so they all generate tons of data points every minute. The problem we faced here is how to use all this data to decide if every service is performing as it should and, if it isn’t, what the problem is. We set a constraint for ourselves: it should be possible to troubleshoot problems without SSH’ing to the servers where our software runs.
Keeping in mind that the system is pretty big, here’s how we applied the “divide and conquer” strategy.
We agreed to follow 3 simple rules:
- Every service should have its own Service Level Agreement (SLA).
- Every instance of every service monitors how others use it and how it responds.
- Every instance of every service monitors how it uses other services and how they respond.
Of course every instance of every service has a health check and produces metrics about its internal state to ease troubleshooting. In other words, every instance of every service is a white box for owners of the service but it’s a black box for everyone else.
Let’s focus on a single service and consider all others as external dependencies, even if they’re owned by the same team.
Clients of the service want to know what can they expect from it in terms of performance and availability: how many requests per second it can process, length of expected downtime during maintenance, or how long it takes on average to process a request. Usually, performance and availability of a service can be expressed using very few parameters, and in most cases the list can be applied to other services also.
These parameters are called Service Level Indicators (SLIs). If you need an explanation of SLIs, take a look at the Wikipedia page. There are few popular combinations of SLIs, like those displayed on this slide.
Once we define all the indicators and collect metrics for all of them, we then need to decide what is good and what is bad. To do this we should make 2 steps: baseline the metrics, decide what is acceptable for every metric and where the acceptable range ends.
With the numeric definition of the acceptable ranges we define Service Level Objectives (SLOs). You can read more about SLOs on Wikipedia. Examples of SLOs are that service should have 99.9% availability over a year, or that the 95th percentile of latency for responses should be below 300ms over the course of a month. It’s always better to keep some buffer between the announced SLO and zones where things start going really badly.
With SLOs in hand we can move forward and define a Service Level Agreement (SLA). Check out Wikipedia for more information about SLAs.
With a stable API, SLA, and some performance testing to prove that our service can meet the SLA, the service is ready for prime time.
Most importantly, when the SLA for every service in a product is known, then the SLA for the product in general can be defined more accurate.
Inbound monitoring tells us how others use our service and how our service responds.
A SLA is a two way agreement: clients of our service agree on conditions for using it, and we promise them that under those conditions the service will perform within some boundaries.
We trust our clients, but sometimes unexpected things happen. It’s better to be defensive here. Each service monitors its own SLIs. For example, let’s assume there are two SLIs: arrival rate (number of inbound messages per second) and latency (milliseconds to process every message). So, every instance of our service records and aggregates these two parameters. The aggregated metrics can used to make sure that our service and clients stick to their part of the agreement. If arrival rate grows beyond the agreed level, we can start throttling requests or denying to serve requests (first communicating this action to the client, of course). If latency grows beyond declared limits and there is no significant increase of the arrival rate, then we know that something is wrong on our side and it’s time for us to begin troubleshooting.
Because metrics are recorded for incoming requests, we call it inbound monitoring.
Outbound monitoring tells us how our service uses other services and how they respond.
Services don’t live in a vacuum; normally, they depend on other services. Again, we typically trust these external services but we need to be prepared for the unexpected.
We have SLAs from all services that our service depends on. When an instance of our service issues an outbound request, we record the SLIs defined by the owners of the call’s destination service. Because metrics are recorded for outbound requests, we call it outbound monitoring.
Outbound monitoring may sounds like overkill, but it’s very helpful. For example, when multiple instances of services A and B call multiple instances of service C, the combination of inbound monitoring from instances service C and outbound monitoring from instances of services A and B gives us a nice picture. We can better locate major sources and destinations of traffic and identify noisy neighbors, routing problems, or slow instances.
The combination of SLAs and inbound and outbound monitoring helps us get a good understanding of data flows in a system and dramatically simplifies observability at the perimeter of a service. The model allows us to quickly isolate a service or an instance of a service causing instability in the system because all services dependent on it will signal about SLA violations. The model also gives service owners a clear picture of what is performance of their service and how services around are doing. Service owners don’t need to know about the internals of a database their service depends on or how metrics from a messaging service are called. They can just look at numbers from their service and identify which dependency is unhealthy.
The model, as described, allows us to perform a statistical analysis of a system. To gain visibility on the level of individual messages, distributed tracing should be added.