7 Ways We Put Kubernetes to Work at Salesforce

Salesforce Principal Architect Steve Sandke has been fired up about Kubernetes ever since he attended the Kubernetes 1.0 launch at OSCON in 2015 and it was clear that there was a broad and very strong community forming around it. He started drafting a proposal to his management chain to take a strong bet on Kubernetes on his plane ride home from that conference, which jumpstarted our usage of the technology. We became an even bigger part of the community when we joined the Cloud Native Computing Foundation (CNCF) back in 2017 and are excited to be at KubeCon + Cloud Native Con North America this week as a Platinum Sponsor. Here are 7 things Kubernetes helps us do across multiple product lines and within our underlying infrastructure.

Deploy and Monitor Quickly and Reliably
Kubernetes allows us to deploy and monitor applications quickly and reliably, while treating the underlying infrastructure as ephemeral. When an underlying node fails, Kubernetes schedules around it. When we need to patch a node, Kubernetes can drain it cleanly. When a new application needs to be deployed, all we need to do is build a container using our continuous integration infrastructure and point some Kubernetes resources at it. Because Kubernetes is a fault-tolerant platform, using it allows us to host highly scalable applications and run upgrades and patches with zero downtime.
Debug More Easily
We got the idea for Sloop from our experiences building 25+ production Kubernetes clusters and supporting large numbers of internal customers. Sloop lets you view Kubernetes resources over time, allowing easy debugging of historic incidents, including entities that no longer exist. This greatly improves a number of common debugging tasks including finding info and events related to pods that have been deleted (for example, pods replaced by newer pods from a deployment update, pods evicted from a lost node, etc.) and viewing rollout details, including timing of pod replacements and health signals. Sloop allows you to retain and display information for Kubernetes events on resources for weeks beyond the ~1 hour that Kubernetes keeps them.
Allow Service Owners to Control Their Own Deployment Cadence
Our services run in data centers all over the world, in most cases on “bare metal” servers. We deploy and run Kubernetes directly atop bare metal, integrating it into the rest of the Salesforce infrastructure. We’ve been working with Kubernetes since 1.0 and getting the bare-metal deployments reliable and consistent took a bunch of effort; this is certainly an area that would be easier if we started the effort now. A key goal of this effort is that service owners are able to control their own deployment cadence, while still maintaining the high level of Trust our customers expect. Salesforce has internal management systems which provide things such as certificates and secrets to applications, whether they’re running in Kubernetes, on bare metal, or in other ways. We provide simple integrations with these systems for services running under Kubernetes, which frees our service owners from all the nitty gritty details normally required to consume them. The end result is that we have a community of service owners able to manage their own services, deploying them to datacenters across the world at their own cadence. Introducing new services is relatively easy, enabling us to be much more agile. Read more in Adopting Kubernetes.
Support Multiple Languages on the Heroku PaaS
Buildpacks were conceived seven years ago as a way to make Heroku a polyglot platform supporting Ruby, Java, Node.js, and more. They helped us create a modular build system that was open to our users; any one can write a buildpack and run it on the Heroku Platform-as-a-Service (PaaS). What’s more, any other PaaS can run buildpacks. Some platforms, such as Cloud Foundry, have adopted Buildpacks as the primary mechanism in their deployment pipelines while others, such as Knative, offer it as an option. The result is a widely adopted industry standard whose reach goes far beyond Heroku. Read more in Standardizing Heroku Buildpacks with CNCF.
Keep Pace with the Growing Demand for Storage
We are leveraging Ceph for block storage in Kubernetes. Ceph is a large scale distributed storage system which provides efficient and durable storage for block images, among other things. Kubernetes has supported Ceph block images as Persistent Volumes (PVs) bound to block devices since 2016. With the use of Ceph images as volumes in k8s, we have been leveraging the RBD (RADOS Block Devices) tool and KRBD (Kernel RBD) module. RADOS refers to Reliable Autonomic Distributed Object Store. We’re able to scale out storage software services, both in our first party data centers and across other substrates, to keep pace with demand. Read more in Provisioning Kubernetes Local Persistent Volumes.
Process Telemetrics and Applications Logs with Spark (at Scale!)
We wanted the ability to provide Spark clusters on demand, particularly for streaming use cases. We were looking at other uses for the engine as well. We wanted to create a separate Spark cluster for each of our internal users to maximize the isolation between them. So, we had some choices to make about what cluster manager to use and ultimately chose Kubernetes. We’re continually pushing the Spark-Kubernetes envelope. Right now, one of our challenges involves maximizing the performance of very high throughput batch jobs running through Spark. As we discover new issues running Spark at scale, we try to solve them and contribute those solutions to the Spark and Spark Operator projects. To learn more about our work with Spark on Kubernetes, watch the webinar.
Scale Infrastructure Dynamically in Response to Application Needs
Pardot previously used statically provisioned virtual servers to support rule processing on event data like prospect form submissions to dynamically apply rule-based actions. Since event data can spike in volume, our static infrastructure needed to be over-provisioned to handle these spikes while maintaining our SLAs. Kubernetes, on the other hand, gives us the ability to scale the underlying infrastructure dynamically in response to application needs. Using the Kubernetes horizontal pod autoscaler, we are able to automatically add and remove pods based on CPU utilization, memory utilization, and custom application metrics. As a result, we are saving 50–60% on infrastructure costs and have the flexibility to adjust in the future to accommodate both short- and long-term changes in utilization.