Written by Anish Bhatt and Scott Nyberg.
When a legacy operating system (OS) approaches its end-of-support date, some organizations will upgrade their OS as fast as possible. Others may kick the can down the road, delaying any headaches they might encounter during the upgrade process.
Six years ago, Salesforce Engineering put the pedal to the metal, migrating to CentOS 7, an open-source operating system based on the Red Hat Enterprise Linux (RHEL) source code. Back then, the team’s biggest challenge for upgrading was the tedious and time-consuming manual labor involved, ranging from determining the health of machines to scheduling their OS upgrade.
As CentOS 7 races to the end of its operational runway, Salesforce Engineering tackles its new OS upgrade task head-on. This time around, the team faces a much bigger hurdle: migrating 200,000 machines from the current OS to RHEL 9, an advanced OS that delivers enhanced performance, boosts security, and drives integration of next-generation hardware.
Due to the sheer number of systems the team must migrate, a manual conversion is not tenable. Instead, the team will use automation — enabling them to eliminate downtime, ensure machine health, improve visibility, and power parallelization so that machines are efficiently and reliably ported to RHEL 9 faster — and more reliably — than ever.
Tyson Lutz, Sr. Vice President of Software Engineering (foreground), leads the RHEL 9 implementation team (background).
How does migrating to RHEL 9 deliver an improved OS capability?
From integrating cutting-edge processors to stopping bugs in their tracks to boosting security, upgrading Salesforce’s OS to RHEL 9 provides a durable enterprise-grade OS platform and unlocks many concrete benefits for Salesforce Engineering and our customers.
Enables essential technology. Salesforce engineers require the latest hardware to harness new software innovations for our customers. CentOS 7 cannot sustain highly advanced processors, however, RHEL 9 has first-class support for next-generation ARM-based architectures, delivering 20-30% in savings, while providing the same level of performance.
Provides for every use case. Salesforce customers may have highly specific workloads that require significant computing power. Other customers run workloads that require less processing needs. RHEL 9 now backs both use cases — enabling customers to select the environment that best fits their needs.
Finds and fixes bugs faster. Historically, the team may have spent a week working to determine the root cause of a unique problem. As the team learns about that bug and fixes it, they cannot apply their knowledge to fix it again because the bug does not reappear. Moving to RHEL 9 provides a new level of customer support, whereby Red Hat engineers can help Salesforce Engineering pinpoint issues in mere minutes — enabling rapid fixes.
Improves security posture. Outdated technology may lead to compromised cybersecurity — potentially leading to ransomware or damaging malware attacks that require costly rebuilds. RHEL 9 takes security to the next level for Salesforce, leveraging technology that governments around the world use to ensure heightened levels of security and satisfy stringent security compliance requirements.
Under the hood: How does automation help drive the OS migration?
As they plan their transition to RHEL 9, the conversion team uses four key automation-driven tools:
- The first is their conversion playbook, which defines an automated schedule, detailing when machines should be converted.
- Next, the team’s graph database — a fleetwide management and control system — kicks off the migration process.
- Together, the conversion playbook and graph database inform the conversion orchestrator system — which determines the machines that should move over and when. The orchestrator then scales the migration across Salesforce’s 200,000 systems at a measured rate. By staggering conversions — about 5,000 machines daily — the vast majority of machines remain active, ensuring a seamless and transparent experience for Salesforce customers.
- After each batch is converted, an automated configuration management system ensures that the machines satisfy their normal specifications on the new OS and performs upgrades as needed.
During off-peak system usage hours, automation kicks into high-gear, rapidly converting systems to RHEL 9 until all hosts are migrated.
What are automation’s biggest benefits?
In 2017, Salesforce Engineering migrated to CentOS manually, a challenging experience that required the conversion team to navigate numerous productivity roadblocks. Automation’s many benefits alleviates those issues, enabling the team to pave a much smoother path for a RHEL 9 migration, ahead of CentOS 7’s end-of-support date, with time to spare.
- Eliminates downtime. During the previous OS conversion, teams of engineers needed to coordinate efforts across time zones to manually solve machine issues. Automation eliminates that productivity lag — delivering an always-on capability that instantaneously remediates issues — so machines can smoothly onboard to RHEL 9.
- Ensures system-wide health. During the manual conversion to CentOS 7, the team required significant communication and hands-on coordination to define the schedule for machine migration and move the hardware over. Any missteps could have disrupted machine health — potentially impacting Salesforce customers’ productivity.
After confirming the machines’ health and readiness, the automation system converts them to the new OS on a scheduled basis. Should one of the machines need a software update, the system performs the fix, unless a human technician must fix a physical anomaly. Once repaired, the machine automatically migrates to the new OS and the process repeats with the next 5,000 machines until all 200,000 machines successfully convert to RHEL 9.
- Enables visibility. Manual conversions have historically introduced visibility challenges, where the team lacked insights on machine fleet size, which machines required migration, and if machines were operational. Following the CentOS conversion, the team scrambled to find and fix machines to avoid any outages.
Now, using automation, real-time monitoring and health metrics, the team has complete visibility of the machine fleet, from its size to its health to which machines may be converted.
- Powers parallelization. During a manual migration, a human technician can only perform one task at a time, such as repairing a system or migrating it to RHEL 9.
Conversely, automation allows engineers to “set and forget” the system, where it performs infinite parallelization of OS migration tasks — operating at a scale that human engineers cannot match. For example, the system may be tasked with scheduling 50 machines for migration to RHEL 9. After scanning them, it could learn that 25 require repair. As it performs the fixes, it simultaneously converts the remaining 25 machines.