At Salesforce, we operate thousands of services of various sizes: monolith and micro-services, both customer-facing and internal, across multiple substrates, i.e. first party and public cloud infrastructure. In our earlier blog “READS: Service Health Metrics,” we talked about the Service Level Objective (SLO) framework called READS that we developed Salesforce to standardize SLO tracking for Salesforce services using a minimal set of indicators. Note that, at Salesforce, we consider SLO tracking for features as critical to our customers and our success as SLO tracking for the services that serve the features. So, from here on, when we use the word “service,” we mean both a service and feature. A natural question to ask in this context might be, how do we manage the SLO onboarding process for services at scale? How do we simplify the Developer Experience (DX) for service owners? What are some key takeaways from our approaches?
Before we get to that, let us take a high-level look at the Service Ownership lifecycle at Salesforce. (We delve deeper into this topic in our recent post, “Transforming Service Reliability Through an SLOs-Driven Culture and Platform.”) Service ownership encompasses the following steps:
- Architect and define a service in a Service Registry
- Instrument the service to emit READS and other operational telemetry
- Define SLIs/SLOs and SLO alerts for the service
- Deploy services in production, monitor them, and visualize service health
- Analyze service health and incorporate learnings to achieve higher SLOs
Managing SLO Onboarding with GitOps
To enable the third phase in the above lifecycle, defining SLIs/SLOs and alerts for a given service, we built a proprietary configuration-as-code tool (referred to as as slo-cfg-cli) for service owners to onboard and manage SLOs, SLO templates, SLO/burn rate alerts, and other observability resource configurations centrally in Git. This is an extension of the GitOps model we follow at Salesforce to deploy and manage our services. Based on Git pull request (PR) workflows, the observability configs are published, via API, into the respective internal services that leverage the configs to power realtime SLI monitoring dashboards, SLO alert evaluations, and SLO analytics.
While there are alternatives to the GitOps approach, we preferred this model to manage SLO configs, to leverage the following benefits:
- It enables us to reuse configs across multiple SLO-driven use cases in a uniform and consistent way for all Salesforce services.
- It creates a single source of truth for observability configs, meaning there is no config drift in production due to manual edits.
- It gives us granular but continuous integration and delivery of changes, using Git workflows, in specific environments.
- It allows for fine-grained authorization, peer reviews, and approvals using Git pull requests.
- It provides audit tracking for changes in Git.
Before we talk about a typical Git PR workflow for SLO onboarding, let us go over the SLO entity model at a high level, along with brief examples of the SLO definition language. Note that, at Salesforce, we use YAML as the markup for service owners to onboard observability resource definitions, for YAML’s variable substitution, templating and other expressive capabilities. That said, the conceptual model can be implemented via any infrastructure as code (IaC) provider, such as Terraform.
SLI (Service Level Indicator) – a carefully defined quantitative measure of some aspect of the level of service that is provided, such as availability, latency, etc.
SLO: Service Level Objective – a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound. An SLO definition encapsulates both the SLI and the corresponding thresholds in one file, as below.
SLO Alert – encapsulates the inputs necessary to evaluate and notify an SLO violation alert for a given service, such as the below. (Note that these inputs are applicable to any alert, in general)
- General metadata, such as metric expression and cron frequency
- Trigger metadata that represents the SLO logic, for example, warning/critical trigger when warning/critical threshold criterion is satisfied
- Routing metadata: from simple rules, such as sending a PagerDuty alert to a given PagerDuty ID (PDID) related to a service for critical triggers, to complex rules, such as grouping multiple alerts to reduce alert fatigue and make alerts actionable
- Permission metadata, i.e. which users are allowed to manage the alert definition
Service Instance Dimensions – dimensions that uniquely identify a service instance (say, an individual cluster for a service), the granularity at which we report SLIs, apply SLOs, and set up the SLO/burn rate alerts, for every service. In the example above, note the $region and $environment variables in the metric expression, which are the instance dimensions for my-service.
Make it Easy
While we enable service owners onboard SLO configs using self-service workflows, given the scale of Salesforce services, it is critical that we simplify the DX for so many service teams, by reducing the onboarding time, repetitive effort and complexity, while allowing them leverage best practices and improve the accuracy of their definitions. One way we made the onboarding easyis by introducing templatesforSLIs and SLO alerts. These templates are managed by a team of internal SLO experts, and service owners leverage them to onboard SLOs and SLO alerts for their individual services.
An SLI template is a reusable template that can be leveraged by multiple services for an SLI type such as availability, provided the constituent metrics are emitted with standard naming conventions. For example, the expression from the SLO definition snippet above could be abstracted out as a template, using replacements such as serviceNamePlaceholder,autoGeneratedInstanceDimensions and the standard metric name, as in the snippet below. These replacements would be automatically resolved by the template engine, based on the service definition and instance dimensions, when a service owner sets up SLO definition for their service based on the template. This way, service owners do not need to spend effort crafting metric expressions from scratch. Note that these templates provide an extensible mechanism for service owners to configure other standard functionality such as the method (aggregateFunction) used to compute daily aggregate of the SLI for operational reviews.
As you may have observed in the previous section, service owners already define the important metadata required for SLO alerts, except the alert routing metadata, as part of their SLO definition. Given this, it is imperative that the observability platform has a way to reuse this metadata to set up SLO violation alerts, so that service owners do not maintain common configurations at two places. While at it, the default alert routing metadata for every service could also be automatically set up in an integrated and consistent way, obviating the need for every service owner to manually interface with multiple systems involved in the process.
We introduced the SLO Alert Templates framework to address these aforementioned critical requirements. Service owners generate service-specific SLO alert definitions using these templates with just one command. The templates cross-reference SLO inputs, using special macros, by default, as shown below. The SLO alert definitions generated from templates are kept up-to-date with events such as SLO definition changes and template changes. Even more specifically, when new service instances are discovered or old instances are decommissioned, the definitions are updated automatically without service owner intervention.
Sample generated SLO alert definition files for a service look as below. SLO alert definitions are published to our observability platform automatically by the Git PR workflows outlined in the next section.
Git PR Workflow
The figure below shows how the end-to-end lifecycle of SLO and SLO alert onboarding works, from a service owner perspective. Key wins with an automated PR workflow are:
- improved developer productivity by reducing cli setup (including authN/authZ) time and interaction, by integrating validations and persistence via Git workflows
- elimination of scenarios where developers unwittingly miss committing configs pushed via cli
- verifiability, with dry run mode and draft versions of dashboards for validation
Apart from a functional solution described above, a few aspects need to be kept in mind while designing the SLO config management solution, such as
- schema + API first approach to manage every resource in a system-of-record
- intuitive folder structure to organize and navigate the resources, by service
- ability for service owners to define custom SLIs and SLO alerts, in a composable way
- seamless multi-substrate support, i.e. ability to reuse and override configs for a service across multiple substrates, per user needs
- admin approvals, specific to critical tier services, integrated within the PR process
- ability for service owners to customize escalation policies for their team/service
- a connected and intuitive workflow for alert triaging, integrated with run books, debugging dashboards, etc.
- extremely intuitive and accessible documentation built within the workflows, resource definition files, and central documentation hubs
- Slack notifications for PR workflow results, to keep service owners in the loop
This project could not have been a reality without a significant collaborative effort across multiple internal service teams within the observability space, at Salesforce. Huge shoutout to the engineering and product teams part of this initiative 👏 👏 We continue to work together on refinements to ease SLO config management for service owners and enable them to further embrace service ownership.