How MCAE Engineers Used Automation to Reduce Complexity & Incidents

Keeping a Mission-Critical Platform Running: How Automation Helped MCAE Engineers Reduce Complexity and Incidents featured image

In our “Engineering Energizers” Q&A series, we highlight engineering leaders who tackle complex infrastructure challenges to build scalable, high-performance systems. Today, we feature Charlie Curtis, whose engineering team ensures that Marketing Cloud Account Engagement (MCAE) — a mission-critical marketing automation platform — remains resilient and efficient. As part of the Salesforce Marketing Cloud, MCAE demands a robust infrastructure to deliver seamless performance at scale.

Discover how Charlie’s team overcame challenges by streamlining MCAE’s architecture through a balanced approach to microservices and monolithic design, optimizing large-scale data processing for low latency, and leveraging automation to minimize incidents and operational overhead.

What is your team’s mission?

The mission is to ensure MCAE remains a highly scalable, reliable, and low-maintenance marketing automation platform for customers. As a critical system powering enterprise marketing workflows, the focus is on maintaining uptime, optimizing performance, and proactively addressing issues before they affect users.

To achieve this, incident prevention and architectural efficiency are prioritized. This involves reducing unnecessary complexity, eliminating bottlenecks, and leveraging automation to keep the system running smoothly. Instead of reacting to failures, the infrastructure is designed to self-heal and auto-scale.

A key aspect of the mission is identifying and resolving long-standing technical debt, which can lead to unnecessary maintenance and reliability issues. By prioritizing structural improvements over short-term fixes, MCAE scales efficiently. The goal is to create an infrastructure that allows engineers to focus on meaningful innovations rather than constantly firefighting system failures.

What led to the decision to migrate certain microservices back to a monolithic architecture and how did this impact scalability and performance?

At one point, parts of MCAE were designed as microservices to enable independent scaling and modular development. However, for the specific use case, the microservices approach introduced unnecessary complexity without delivering proportional benefits. One of the main challenges was managing service dependencies and cross-service communication overhead. Microservices required additional infrastructure, deployment coordination, and operational oversight, making it difficult to maintain system stability. Debugging distributed issues also became time-consuming, as failures could cascade across multiple services.

By migrating key services back into the monolith, inter-service communication latency, infrastructure overhead, and deployment friction were reduced. Today, 90-95% of MCAE’s business logic operates within the monolith, while only 5-10% of features remain as microservices for specific cases where operational overhead is low. This transition resulted in a more streamlined, scalable architecture, significantly improving performance, maintainability, and incident resolution speed. Instead of managing dozens of independent services, a more cohesive and efficient codebase is now operated, enabling faster iterations and more predictable behavior under load.

As part of migrating back into the monolith, some services were able to significantly reduce their read operations/sec by leveraging improved data caching and query batching.

How does MCAE process large-scale data efficiently while maintaining low latency?

MCAE handles massive datasets across thousands of customer accounts, requiring high-throughput data processing while ensuring sub-second response times for key operations. To achieve this, several scalability optimizations are implemented at the infrastructure and application level.

First, database read replicas distribute query loads. Instead of overwhelming a single database instance, queries are balanced across multiple read-optimized nodes, reducing bottlenecks.

Second, batch processing is used for background tasks, grouping operations to minimize real-time compute overhead. This ensures that heavy workloads, such as email automation triggers and segmentation updates, do not degrade system responsiveness.

Lastly, distributed caching with Redis reduces redundant queries by serving frequently accessed data from memory. This significantly decreases database contention, improving response times for high-traffic endpoints.
These optimizations enable MCAE to scale dynamically without performance degradation, ensuring customers experience a fast and responsive platform, even under heavy workloads.

What were the toughest reliability challenges in ensuring MCAE remains highly available and performant at scale?

Ensuring high availability in a platform as complex as MCAE requires addressing multiple points of failure, including database constraints, infrastructure resilience, and real-time workload balancing. Meeting these objectives, especially under peak load, has been a welcomed challenge. To tackle this, a proactive approach to failure detection and prevention was adopted, identifying recurring failure patterns and implementing self-healing mechanisms.

A critical change involved reworking how jobs are queued and distributed. Instead of processing tasks one at a time, a probabilistic worker distribution model was implemented that dynamically allocates compute resources based on queue depth and processing urgency. This prevents task starvation and backlog buildup, especially during high-traffic conditions.

Moving away from static scaling thresholds, Kubernetes-based auto-scaling was implemented. This allows the infrastructure to dynamically scale services up or down based on real-time traffic patterns, ensuring system responsiveness even during unexpected surges.

By focusing on automated failure recovery, dynamic workload balancing, and predictive monitoring, we are able to provide a high quality of service for all customers while maintaining high system stability and uptime.

What strategies did you implement to reduce incident frequency while maintaining system performance under heavy loads?

Ensuring system stability while minimizing downtime and incident alerts has been a top priority. To achieve this, focus has been placed on reducing noise in monitoring alerts, improving failure detection, and preemptively addressing known pain points.

A major shift involved handling real-time performance degradation. Instead of passively responding to slowdowns, automated recovery workflows were implemented to detect latency spikes and take preemptive action, such as scaling services or redirecting traffic, to prevent user impact.

Known system issues were re-prioritized based on historical failure rates. By systematically addressing the top root causes of incidents, the volume of system alerts was significantly reduced, leading to smoother operations.

What are the key lessons your team learned from building automation that reduced operational overhead and improved system resilience?

The most significant lesson learned is that automation is most effective when it eliminates entire categories of recurring issues, rather than just speeding up response times. By focusing on proactive solutions — such as self-healing systems, anomaly detection, and infrastructure auto-tuning — the need for human intervention has been significantly reduced, resulting in a more predictable and stable platform.

Another key insight is that simpler architectures often scale better. While microservices offer flexibility, consolidating services into a monolith has improved efficiency, maintainability, and the speed of incident resolution.

Ultimately, the best strategy for scaling and resilience is continuous improvement. Every engineering decision should aim to reduce long-term operational friction and enhance system performance.

Learn more