
By Dheeraj Bansal and Ankit Jain.
In our “Engineering Energizers” Q&A series, we spotlight engineering leaders who tackle high-stakes challenges with precision and innovation. Today, we feature Dheeraj Bansal, a Principal Member of the Technical Staff at Salesforce. Dheeraj played a crucial role in a large-scale Kafka cluster migration designed to modernize and streamline critical technology stack and infrastructure across the company.
Learn how his team successfully migrated production systems handling over 1 million messages per second, resolved OS and Kafka version compatibility issues, and ensured data integrity through checksum validation and rack-aware replication—all without any downtime.
What was your team’s mission for this Kafka migration and OS upgrade, and why was this project so critical for Salesforce?
The team’s mission was to migrate and unify Marketing Cloud’s Kafka technology stack and infrastructure with Salesforce’s centralized Ajna Kafka, creating a seamless, standardized system across the organization. Additionally, the team needed to upgrade the underlying OS from CentOS 7 to RHEL 9, a crucial step as CentOS 7 was being phased out. By combining these initiatives, the team aimed to minimize engineering overhead and prevent redundant efforts across two infrastructure layers.
Why was this critical? At Salesforce, the vision is to achieve a “Salesforce One” architecture — unified, with minimal tech debt, shared infrastructure, and standardized toolchains. This unification allows for the seamless reuse of Kafka feature sets across different clouds, streamlines operations, and accelerates delivery across engineering teams. The project also addresses organizational sprawl by eliminating siloed infrastructure, which previously required separate dedicated teams for management and maintenance. With the unified Kafka and control planes, the company has achieved operational autonomy, enhanced compliance, and long-term maintainability.
Ultimately, this wasn’t just a migration. This was a strategic investment in scale, performance, and long-term velocity, ensuring Salesforce remains at the forefront of innovation and efficiency.
What were the biggest technical hurdles in unifying Marketing Cloud’s Kafka with Salesforce’s Ajna Kafka?
The primary challenge was executing a live migration on production systems handling 1 million messages per second, spanning 12 Kafka clusters with over 700 Kafka nodes and 60+ Zookeeper nodes. These clusters processed approximately 15 terabytes of data daily, and the team had zero tolerance for downtime, data loss, or customer-facing errors.
The Marketing Cloud Kafka streaming solution plays a crucial role in handling real-time data streams, which is essential for the smooth operation of Marketing Cloud features such as Journey Builder and the Event Notification Service.
Journey Builder relies on real-time data to personalize customer journeys and trigger actions based on user activities such as ad clicks, email opens, app downloads, and cart abandonment. Similarly, the Event Notification Service needs to process and respond to events instantly to provide timely notifications and updates.
To support these functionalities, the Kafka streaming solution must be highly available, ensuring it can handle varying data volumes without interruption. Zero downtime, no data loss, and durability are critical to maintaining the consistency and reliability of customer interactions and notifications, which in turn keeps marketing campaigns and customer engagement efforts seamless.
Migrating live clusters required operating in mixed-mode for extended periods, with some nodes running the old CentOS 7 + Marketing Cloud Kafka stack and others on RHEL 9 + Ajna Kafka. These environments were not designed to coexist, leading to compatibility issues across OS versions, Kafka versions, authentication mechanisms, and supporting services like control planes.
To overcome these challenges, the team conducted months of comprehensive proof-of-concept (POC) testing, simulating failure scenarios, validating inter-node communication, and building a staggered orchestration pipeline. This resulted in a phased rollout strategy, enabling the team to upgrade multiple nodes at a time, validate behavior, and ensure the cluster remained healthy throughout the process. This staged migration allowed for confident scaling without compromising performance or stability.
How did you ensure zero downtime while migrating more than 760 physical nodes—during peak traffic of 1 million messages per second?
Ensuring zero downtime required a relentless focus on data integrity, orchestration resilience, and preemptive validation. The team developed a fully automated orchestrator pipeline to handle both the OS upgrade and Kafka migration simultaneously. This pipeline included pre-, post-, and in-flight validation checks to monitor everything from disk state to application health.
Disk-level checksum validation was used before and after each migration step to ensure no data corruption occurred. Synthetic data was generated and validated post-migration, and each node’s data signature was matched against the pre-migration baseline. This approach provided confidence that no hidden data degradation was taking place.
Monitoring was crucial. During the rollout, the team maintained 24/7 eyes-on-glass monitoring through 15 custom dashboards, each designed to provide a comprehensive 360-degree view from cluster-level down to individual node metrics. Automated alerts and real-time health checks were also built to flag even the smallest anomalies before these could impact traffic or trigger a failure cascade.

High-level workflow for orchestrating safe and seamless Kafka cluster upgrades.
How did your team manage the risk of data loss or message duplication during the transition?
We managed the risk with a focus on data resilience from the start. Kafka clusters were configured with a replication factor of three, ensuring that even if two nodes failed, the third copy of the data remained intact and recoverable. This setup guaranteed fault tolerance even under partial node loss.
To further mitigate data loss, clusters were designed for uniform data distribution, avoiding storage hotspots or single-node pressure points. This meant that if a node failed or was mid-migration, no disproportionate volume of data would be at risk.
Rack-aware placement was enforced, ensuring that replicas of any data partition were physically located on separate racks. This meant that even in the case of an entire rack failure — a common hardware scenario in large data centers — data remained protected.
Finally, checksum-based verification was used to validate that no data was lost or duplicated during the process. This combination of redundancy, distribution, and verification ensured that message delivery remained accurate, complete, and uncompromised.
What monitoring and troubleshooting strategies did you implement to maintain visibility during the migration?
Maintaining visibility during the migration was achieved through a robust monitoring and troubleshooting strategy. The approach centered on proactive observability and deep granularity. 15 dedicated dashboards were created to monitor cluster-level health, node-level metrics, and rack-specific issues. These dashboards provided the ability to drill down into individual machines, identify anomalies in real time, and correlate issues across the entire stack.
Each alert was linked to a real-world failure mode that had been modeled during testing, including disk failures, network blips, version mismatches, and cluster imbalances. Alerts for unexpected behaviors, such as unbalanced partition leaders or spontaneous data movement across nodes, helped catch misconfigurations early.
Comprehensive pre-flight validation and post-migration checks were also implemented. These included configuration audits, health probes, and telemetry validation, ensuring that no regression went undetected.
Thanks to this system, no unexpected failures occurred in production. All issues had been identified and addressed during the QA and POC phases. The one minor hiccup—a flipped configuration causing internal data movement—was detected early through the dashboards and resolved before impacting performance.
What security challenges came up when consolidating Kafka clusters, and how did you address them?
During the consolidation of Kafka clusters, we faced several security challenges, primarily related to the differences in authentication mechanisms between the legacy Marketing Cloud Kafka setup and Salesforce’s Ajna Kafka stack. Ajna Kafka employs a unique authentication method that required specific certificates, which were not available in our existing environment.
This discrepancy affected inter-node communication, client access, and control plane integration, all of which depend on robust authentication protocols. To mitigate these risks, we identified the issues early during the proof-of-concept phase. We conducted a detailed analysis of the Ajna Kafka authentication flow and replicated it in a controlled test environment.
Once we validated the solutions in the testbed, we aligned the upgraded Kafka infrastructure to the authentication standard, which was supported and compatible with the one in the Marketing Cloud first-party ecosystem. This alignment ensured seamless communication across the mixed-version clusters and maintained the necessary security boundaries throughout the migration.
The key takeaway from this experience is the importance of conducting a deep dive early in the process, fully aligning authentication protocols, and never assuming that authentication will work seamlessly across different systems.
Learn more
- Stay connected — join our Talent Community!
- Check out our Technology and Product teams to learn how you can get involved.