How Salesforce uses Immutable Infrastructure in Hyperforce

Credits go to: Armin Bahramshahry, Software Engineering Principal Architect @ Salesforce & Shan Appajodu, VP, Software Engineering for Developer Productivity Experiences @ Salesforce.

To leverage the scale and agility of the world’s leading public cloud platforms, our Technology and Products team at Salesforce has worked together over the past few years to build a new generation of infrastructure platform for Salesforce — one that uses cloud-native tools, deployment patterns, security, and processes. We call this architecture Hyperforce, and it’s available across all of our product lines. Here is the overview blog on Hyperforce.

Software is not permanent; it changes continuously. At the same time, we have an obvious responsibility to keep our products up and running at all times and minimize the risk of these changes impacting our customers. Being comfortable with constant change requires consistent quality. Instead of stopping the changes or accumulating them for longer, we opt to release them safely in smaller increments. This makes it easier to validate and maintain high trust for our customers and agility for our developers.

In this series of blogs, we would like to delve into how teams release changes safely with 1. Immutable infrastructure, 2. Infrastructure as Code, 3. Safe Releases, and 4. Developer experience in building secure and compliant services on Hyperforce.

Since its inception in 1999, Salesforce has been running services on a static set of servers in our data centers. Changes (like operating system updates and service updates) on these hosts were managed by operators using tools such as Puppet, Apache Ambari, etc. These tools were geared towards mutable infrastructure—that is, infrastructure that you modify “in place,” changing binaries and configuration on the hosts over time. Upgrade processes like these are difficult to reason with because failures can result in partial changes to hosts, making the recovery very complicated. The mutable nature of such deployments also results in a constant temptation for engineers to apply manual fixes for urgent issues. Unfortunately, these fixes are often forgotten, resulting in lingering drift in configuration.

Instead, the idea of immutable infrastructure changes this by making our deployment mechanisms more idempotent and robust so that we can overcome issues like this. Hyperforce provides capabilities that enable us to roll out changes safely and immutably, like:

Infrastructure as a Service: Software-driven Virtualized infrastructure, where every part of the infrastructure, compute, network, and storage can be provisioned and managed dynamically via API calls.
Elasticity: The ability to elastically add or remove infrastructure, based on demand.

Combined with the above capabilities, Virtual Machines (VMs) and Containers allow us to embrace a new immutable form of deployment. If a change is made to the code or configuration, a new image for the entire VM or Container is built and deployed as a unit, replacing older VMs and Containers (rather than changing them in place).

The phrase “immutable infrastructure” sometimes causes confusion with our customers, so to be perfectly clear, “Immutable” doesn’t refer to the contents of our services. Obviously, the data you enter into Salesforce is highly mutable (you can change it whenever you want!).

Instead, “immutable” refers to the resources (servers, containers, services, networks, and their respective code or configuration) that never change after deployment. This means that once the resource is in place, we replace it wholesale with an updated version rather than making patches or changes to it directly in our production environment. Immutable deployments are a way of managing infrastructure that moves the unit of update from an individual/set of binaries to an entire compute unit.

Immutable deployments require that:

Setup and deployment for every part and layer of your infrastructure are automated. This is made possible with the public cloud Infrastructure-as-a-Service capabilities.
You make zero manual changes to any part of a system once it’s deployed. All changes to code or configuration are applied by deploying a new system and tearing down the old one. This is made possible by the Public cloud elasticity (you don’t have to pay the cost of keeping twice as many servers around all the time for this).

Immutable Infrastructure and deployments have several significant advantages:

Replacing a system at the lowest level forces you to depend on automation at every step of your deployments. This enforces repeatability and ensures that environments can be managed with minimal human intervention.
Completely replacing, instead of updating, an existing part of your infrastructure makes deployments less complex. As the desired state of the world is known, edge cases are reduced.
Immutable deployments are safer. An immutable deployment unit can be entirely tested in test and staging environments and then gradually released to Salesforce customers.
Immutable deployments also make patching far easier. The patch process is built into the base Operating System image baking pipeline and its deployment, using the same automation for code or configuration changes and related safety measures. Immutable deployments completely replace patching.
Immutable deployments result in a more secure environment. If we are rebuilding the system for each deployment, we are constantly erasing any foothold an attacker may have gained and requiring them to try to regain that surface.

In Hyperforce, we rely heavily on all of these benefits:

Infrastructure-as-Code (IaC): All aspects of the service are managed via IaC, from build to deployment of the service and its related resources. Terraform is used via Spinnaker pipelines, where the state of infrastructure is managed in the Terraform service for each Hyperforce Instance. Managing infrastructure as code enables the safety we require. (Stay tuned for a future blog post on Hyperforce IaC!)

VM Deployments: When any part of the software needs to be updated, a new machine image is baked with the changes. Instead of deploying an updated binary into an existing EC2 instance, a new EC2 Instance is started with the new machine image, and the load balancer is pointed to the new server. The old server is then removed. Patching of existing servers is never permitted.

Container Deployments: Leveraging Kubernetes, container deployments are immutable by default. The continuous integration pipelines build new container images, which are then deployed to Kubernetes via Spinnaker pipelines. The Kubernetes nodes, being EC2 instances, are replaced with new EC2 Instances running updated versions of the Base Operating Systems Image.

Infrastructure Configuration: Leveraging Terraform and Kubernetes, the infrastructure configuration is also maintained in code. Environment-specific configuration is declared in code and wired via Spinnaker pipelines.

Zero Downtime Deployments: In Hyperforce, teams adopt safe Zero Downtime deployment practices such as Blue/Green Deployments to ensure the changes are tested on the new service instance before switching customer traffic. More advanced services perform “Canary” deployments, where a small percentage of customer traffic is routed to new service instances and observed for any regressions before opening to the entire traffic stream. This decreases the blast radius and impact of any breaking change. Enabling Blue/green for stateless services might be trivial, but doing the same for stateful services (such as data stores) requires additional work in coordinating the state changes, preserving the ability to roll back, etc.

Capacity Awareness: Planning for sufficient capacity when using strategies such as blue/green deployment is important. It may not seem like a problem when you’re doing this for a single service, but capacity planning is required when you’re doing it at scale across multiple services simultaneously. At Hyperforce scale, efficient capacity planning and reservation are critical for the cost-to-serve and availability of our services.

Feature Flags: Enabling teams to release changes that are conditionally made available to the customers provides another level of safety. This allows teams to release the changes early without impacting the customers. One might argue that Feature flags are an “anti-pattern” to immutability. However, when feature flags are documented and tested in all environments and follow proper change control processes, they prove to be very useful in releasing changes frequently (enabling early feedback) and safely, minimizing customer impact.

As you can see, an immense amount is involved in the immutable infrastructure concept. Still, it’s been a massive leg up for our ability to deliver secure, highly available software for our customers.

In the next part, we will discuss how Infrastructure-as-Code enables changes to our infrastructure to follow the same lifecycle as any other part of our software system — validation, peer review, automated testing, staging, and gradual rollout. Stay tuned.