Happy services are all alike.
In particular, happy services share a few common characteristics and success criteria. In Salesforce Security Engineering, our teams retrospected on these similarities and identified 3 positive “tells” for service health. Now, our engineering efforts explicitly invest to account for these measurable attributes:
- SLA attainment
- Engagement
- Code velocity
less > more: “If you have any more than three priorities, you don’t have any.”
Jim Collins
The right metrics help our service owners, our leaders, and our customers. Understanding what matters empowers service owners to act autonomously for results and to prioritize: “Does it move the metric? Do that first.” Alignment on metrics enables leadership to perceive service value and quality: “Are we meeting customer needs? Are we building the right product? Are we delivering value at a rate that makes sense?” Customers trust dependencies that are honest and lose faith in those that are not: “If I cannot use your service successfully, how can you be reporting green?!” If customer pain is not captured in the metric then the metric is a lie. When we constrain ourselves to only three priorities, we are forced to identify keystone metrics. The above three attributes trigger cascading positive outcomes.
SLA Attainment: “Don’t mistake activity with achievement.”
John Wooden
Services promise value to customers, which is captured in a statement of SLA (service-level agreement). Consider the GCP Cloud KMS: “…[Cloud KMS] will provide a Monthly Uptime Percentage to Customers of 99.5% for Encrypt, Decrypt, and Sign operations…” The SLA has fine print, but the promise is clear. In our organization, we define SLA attainment as our ability to deliver on the promise: “In a given reporting period (typically 1 week), did your customers receive SLA-quality service?”
Customers depend upon the SLA — a clear, measurable promise — to inform their own decisions. For the above example, a customer may ask, “If KMS is expected to be down for a period of time, should I build retry logic into my clients? Should I consider a design with multiple keys, assuming that they do not fail coincidentally?” When our engineering teams define an SLA, the structure of the promise matters, our ability to deliver the promise matters, and our ability to know that we broke the promise matters. A reasonable goal for a team could be to achieve SLA for 95% of customers in a week. If we hold ourselves to a high bar then we can expect to not be perfect — we leave room for progress. We can dive deep on those 5% which were outside of SLA and find ways to improve.
Reporting SLA attainment builds trust between service owners and customers, as well as service owners and leadership:
- Automated measurement implies automation, which implies detection and engagement before a customer notices. Transparency precedes trust. Trust in our ability to deliver grows when we proactively inform our customers of availability flickers.
- A shared definition of success held by the team, leadership, and customers clarifies the daily decisions and trade-offs we make as service owners.
- Teams can perform without traditional organizational paranoia (“Everything is great over here! Please don’t look too closely!”). An engineering leadership that expects and welcomes the service-level hormesis that follows incidents is a leadership that shares priorities with service teams. Measures exist in part to inform our planning and our investments to improve.
The specific SLA measurement depends on the service, its goals, and particular customer expectations. For example, an SLA for a PKI certificate issuance / renewal service could be, “99.9% of all certificates on healthy clients are renewed before expiration.”
SLAs (and associated metrics) often advance as the service matures. After a period of time (e.g. a quarter or semester), if business needs demand, a service might be able to “deliver another nine.” A service could strive to increase the SLA attainment target — for example, from 95% to 98%. The example certificate issuance service could drastically reduce certificate lifetime — for example, from 7 days to 24 hours — a policy change which would impact SLA without modifying SLA statement. Some services will need more than one SLA.
Every service can and should measure SLA attainment. Our goal is to make sure that all our services define and measure SLA, and that those measurements accurately reflect the customer experience.
Engagement: “The winner is the first company to deliver the food the dogs want to eat.”
Andy Rachleff
When customers are adopting and benefiting from your service, we say that they are engaged. Service engagement is a function of customer trust, value delivery, and quality. Engagement should measure actual customer interaction with a service, rather than approximate customer intent. In our teams, example engagement metrics include “number of unique secrets accessed this week,” “number of unique workloads authenticating with a certificate today,” or “number of unique keys used for crypto operations in last month.”
Prefer measures of engagement over measures of adoption. Adoption can quickly become a vanity metric, risking incorrect conclusions about service value. For example, tracking the “count of objects stored” could mask usability issues if 99% of the objects are only ever accessed once and then abandoned by the customers.
Every service can and should measure engagement. If you measure engagement, you begin to know your market, your customers, the problem you are solving and the value you provide. You know if you are spending your time wisely, or if you should focus elsewhere.
Velocity: “If you measure one thing in your organization, measure your commit to deploy latency.”
Adrian Cockcroft
Shipping frequently has so many benefits. Shipping rapidly has an obvious impact on SLA: you get to improve faster. This also impacts engagement: you deliver value faster, meeting market needs sooner. If shipping compromises SLA, you recognize the architectural defects and mitigate. Smaller deliveries are less destabilizing, and by shipping frequently we can invest to make deployment boring. Shipping is a keystone behavior with cascading positive outcomes. Poor velocity is a frequent symptom of disease.
Every service can and should measure code velocity. Visualize service deployment as a controlled brownout: a fixed portion of service capacity transitions to offline, and is then replaced by new instances of the service. If deploying is boring, it means that the service permits SLA delivery despite this rolling degradation. Conversely, announcing scheduled maintenance is a negative tell of bad architecture. Velocity bounds time-to-iterate, speaks to an optimized developer experience, catalyzes improvements to SLA attainment and engagement. Healthy velocity depends on upstream validation, meaning velocity is a first derivative of quality.
To help focus on the metrics, recast the duties of a service owner in these measurable terms: “I continuously improve my service to keep the promises I’ve made to my engaged customers.” Measure the rate of improvement. Measure the rate of promises kept. Measure the customer engagement. Be a happy service owner.