The open source Kubernetes project has become in integral part of our data center and cloud infrastructure and I wanted to learn more about the history behind this. I sat down with Steve Sandke, Principal Architect at Salesforce, to hear about how his team is using Kubernetes, why, some lessons learned, and more about our open source contributions to the project. Before we dive into his answers, a quick heads up that we’ve sponsored chapters 10 & 12 of the Kubernetes Cookbook so you can get those for free!
How is your team using Kubernetes?
We, like many companies, are on a journey from a monolithic application architecture towards a more fine grained microservices model. We’ve been refactoring functionality into microservices for quite a while now, as well as building new microservices. We found that doing this was stressing our underlying infrastructure, which was more oriented towards earlier monolithic applications’ needs. We needed a mechanism which allowed us to scale our service teams’ ability to introduce services, iterate quickly, and deploy at scale. We decided to leverage Kubernetes as the substrate which would enable this.
Our services run in data centers all over the world, in most cases on “bare metal” servers. We deploy and run Kubernetes directly atop bare metal, integrating it into the rest of the Salesforce infrastructure. We’ve been working with Kubernetes since 1.0 and getting the bare-metal deployments reliable and consistent took a bunch of effort; this is certainly an area that would be easier if we started the effort now.
A key goal of this effort is that service owners are able to control their own deployment cadence, while still maintaining the high level of Trust our customers expect. In addition, we want zero human involvement required to actually execute deployments. Our approach to this is known in the industry as “GitOps”. Service owners describe the desired state of their services (e.g. “version 2.4 of my service should be running in Paris and Frankfurt”) declaratively using files checked into a git repo. We’ve implemented an approval process (leveraging git pull requests) which ensures that all necessary release approvals are in place. Once approved, deployments are completely automated — relevant docker images are promoted into the right places, and Kubernetes deploys them as requested.
Salesforce has internal management systems which provide things such as certificates and secrets to applications, whether they’re running in Kubernetes, on bare metal, or in other ways. We provide simple integrations with these systems for services running under Kubernetes, which frees our service owners from all the nitty gritty details normally required to consume them.
The end result is that we have a community of service owners able to manage their own services, deploying them to datacenters across the world at their own cadence. Introducing new services is relatively easy, enabling us to be much more agile.
What are the reasons you decided to use Kubernetes?
Back in 2015, we’d identified the need for a mechanism which allowed us to scale our service teams’ ability to introduce services, iterate quickly, and deploy at scale. We were researching how best to do this, and while we had team members who’d built systems like this for other companies, we knew that there were open source offerings that held significant promise. We spent a few months researching options including Kubernetes.
I attended the Kubernetes 1.0 launch at OSCON, and it was clear from the start that there was a broad and very strong community forming around it. Furthermore, talking to some of the key project members, the underlying approach was clearly sound, and built on proven strategies learned at Google. I was sold, and on the plane back I wrote a “modest proposal” to take a strong bet on Kubernetes. After a little bit of discussion, my management chain agreed, and we were off!
A key selling point of my proposal was the speed at which the project was moving. It was clear even then that if there were feature gaps, the community would probably address them before we needed them. (For example, the Kubernetes Deployment object appeared just as we were starting the design of what would have been the equivalent.) Also, given the strong community focus, we saw the opportunity to participate as needed. We’ve been able to contribute code (more on that below) and were also happy to support the community by joining the CNCF last year as a Gold member.
Today it is clear that our bet paid off with Kubernetes being the obvious choice for the future of our microservice infrastructure management.
What have you learned running Kubernetes in production?
We need monitoring everywhere. With distributed systems, you can always expect the unexpected to happen. Disks fill up, NICs go haywire, machines go down, and application software crashes. This is true both at the infrastructure and application level. We provide Kubernetes as a service to our service owners; this means we need to monitor our infrastructure carefully. When things do go wrong, one of the first things we consider in our postmortems is “how could we have detected this (and ideally repaired it) automatically?” We’re continually improving here; the journey is far from over.
We provide a lightweight abstraction over Kubernetes to our services. In the beginning, the abstraction hid almost all of Kubernetes; over the years, we find that our customers inexorably demand more and more of the power of “raw” Kubernetes. We still leverage the abstraction, but Kubernetes bleeds through a whole lot more now.
We chose early on to run our own control software in the Kubernetes cluster it’s controlling. This proved quite useful; leveraging the same great ability to develop and deploy applications is so much easier than the alternative.
We leverage the CRD/Controller pattern heavily in our software stack. We’re benefiting from all the features of the Informer framework to write fast and efficient controllers; client-go has greatly evolved since we started!
Back up your etcd data. (We learned this the hard way early on in one of our lab environments.)
Another big lesson is that the Kubernetes community is amazing! We’ve learned to rely more on the community to help us address needs we have and work with them on solutions to shared problems. As I mentioned above, things like the Deployment object appeared right as we discovered the need for it.
It is clear that the network effects driven by the community around Kubernetes are a powerful force that we must work with. So we’ve been contributing more and more to Kubernetes and that community.
What open source contributions has your team made to Kubernetes?
Having team members contributing code to Kubernetes is a huge win-win. Our team gains familiarity with the code base, increasing our confidence. We give back to the product and community. And we benefit from building on a de-facto industry standard instead of building something proprietary which makes hiring and on-boarding hard.
One of the potential attack vectors we identified with using Kubernetes was that we couldn’t easily control the group id that the Docker subprocess ran as. And since the default was to run as root, we needed to fix that. We made a proposal, sent a PR with the API changes, made a variety of other related fixes, and got the change (under a feature flag) in the 1.10 release.
In addition, we’ve contributed a ton of improvements and bug fixes.
We’ve made a variety of other fixes and are working on more. It is great to be part of a friendly and helpful community!
Thanks Steve for your insightful details about our use of Kubernetes! If you’d like to watch Steve’s talk from KubeCon 2017 for more details, check out the recording. Also, don’t forget to grab our sponsored chapters of the Kubernetes Cookbook. Let us know how your KubeCooking goes.