Apache Kafka is becoming increasingly popular at Salesforce, with data volumes growing exponentially in recent years. To service this rapid growth we now operate large Kafka clusters in numerous data-centers around the globe. We also replicate data from every production cluster, across the internet, to a few centralized aggregate clusters. Reliable cross-datacenter replication between Kafka clusters, at our scale, has proven to be a challenging problem. Until recently we relied on Apache Mirror Maker for this, but as the volume and variety of the data we handled increased, it became clear we needed a new solution to maintain our high standards for service reliability. We decided to develop a new replication tool better suited to the dynamic, multi-cluster environment at Salesforce.
In this post we introduce Mirus, a tool we developed to make it easier to replicate data between multiple Kafka clusters. The word mirus is the Latin root of mirror, and means wonderful, amazing, or extraordinary. Mirus uses the Apache Kafka Connect framework, and is freely available as an open-source project at https://github.com/salesforce/mirus.
Why not Mirror Maker?
Mirror Maker is a useful tool, well suited to one-to-one replication between pairs of Kafka clusters. Mirror Maker is simple by design: each instance joins a consumer group and runs a single Kafka Consumer and Producer pair to replicate data between individual clusters. It can be configured to operate very efficiently, and generally works well when replicating from one cluster to another, particularly when topics are not being created or deleted frequently. At Salesforce, however, we operate in a changing, multi-cluster environment, and we found this pushes the limits of what Mirror Maker can handle in several ways.
First, Mirror Maker is a statically configured application, meaning even a small configuration change requires all Mirror Maker processes to be restarted. This is true even for minor changes, like updating the list of topics to mirror. For us, applying a small change across all data-centers was a heavy-weight operation and required unnecessary down-time while Mirror Maker processes were restarted. Since we are moving to an API-driven, self-service model for our customers, we increasingly need lightweight, dynamic, API-driven configuration, which Mirror Maker’s static model cannot support.
Mirror Maker is also unable to handle multiple clusters efficiently. Each Mirror Maker process is intended to handle one-way replication from a single source cluster to a single destination cluster. Separate Mirror Maker processes, and consumer groups, are required for each source and destination cluster pair. In a multi-cluster environment, the need for additional Mirror Maker processes is inefficient and quickly becomes difficult to manage.
At Salesforce we operate in a “push” configuration for security reasons: we co-locate replication with the source Kafka cluster and push data over the internet to the destination. To achieve high throughput when pushing over the internet multiple producer instances are required (see our blog post Mirror Maker Performance Tuning), but Mirror Maker’s design means each process is limited to hosting a single Kafka Producer instance. This necessitated running multiple Mirror Maker processes on each host, even for one-to-one replication between individual clusters, further complicating the management of our clusters. Note that we recommend using a “pull” configuration wherever possible. Co-locating replication workers with the destination cluster has several benefits, including reducing the risk of duplicates, limiting round-trip time on producer batch acknowledgements, and reducing the need for large producer buffers.
Finally, we found that Mirror Maker often failed to handle error conditions gracefully. For example, Mirror Maker will throw an unhandled exception when a destination topic becomes unavailable during normal operation. While we were able to work around these issues with a combination of custom patches and careful management, Mirror Maker was not amenable to a clean solution.
To address these issues, we developed Mirus as an extension of Apache Kafka Connect. Kafka Connect has become an industry standard approach for copying data to and from Kafka clusters. Unlike Mirror Maker, Kafka Connect offers dynamic configuration via a REST API, and provides an extensible framework for distributing work across a cluster of instances. Mirus is essentially a custom Kafka Connect “Source Connector” specialized for reading data from multiple Kafka clusters.
- Dynamic API-driven configuration: Mirus uses the Kafka Connect REST API for dynamic configuration
- Precise replication: Update the set of replicated topics at run-time with the REST API. Supports a regex whitelist and an explicit topic whitelist
- Simple management for multiple Kafka clusters: Supports multiple source clusters with one Worker process
- Built for dynamic Kafka clusters: Able to handle topics and partitions being created and deleted in source and destination clusters
- Scalability: Creates a configurable set of worker tasks that are distributed across a Kafka Connect cluster for high performance, even when pushing data over the internet
- Fault tolerance: Includes a monitor thread that looks for task and connector failures and optionally auto-restarts
- Monitoring: Includes custom JMX metrics for production ready monitoring and alerting
How does Mirus work?
The Mirus source code includes custom
SourceTask implementations. The
MirusSourceConnector runs a
KafkaMonitor thread, which monitors the partition metadata from the source and destination Kafka clusters. When changes are detected, it applies the topic white-list, assigns each matching partition to a
SourceTask, and provides the Kafka Connect framework with a set of
MirusSourceTask configuration objects. Kafka Connect then spins up the appropriate task set across the cluster, and replication begins. Each
MirusSourceTask instance runs an independent
KafkaProducer client pair, so we naturally support the “push” use case with multiple producer instances running in a single Worker process for improved throughput. To better understand how Mirus distributes work across a cluster of machines please see the Kafka Connect documentation.
The diagram below shows a single Kafka Connect worker using the
MirusSourceConnector to replicate topic “A”, containing two partitions. The source connector is created by submitting a configuration object to the Kafka Connect configuration API. Since the topic currently exists in both the source and the destination cluster, the Kafka Monitor thread submits two Source Task configuration objects to the Kafka Connect Herder: one for each partition. The Worker then starts two independent
MirusSourceTask instances and replication begins.
To start replicating data from an additional source cluster, we simply need to submit an additional
MirusSourceConnector configuration object to the configuration API.
The Mirus README includes Quick Start instructions, and is available here: https://github.com/salesforce/mirus
While Mirus offers a custom entry point for convenience, it can also run within a standard Kafka Connect cluster. It shares all configuration options with Kafka Connect, and adds a few of its own. Since Mirus uses standard Kafka Consumer and Producer objects, the art of tuning Mirus for performance shares a lot with Mirror Maker tuning. For guidance on tuning Mirror Maker or Mirus I recommend reading our blog post: Mirror Maker Performance Tuning — Tuning Kafka for Cross Data Center Replication.
Mirus completely replaced Mirror Maker across all production data-centers at Salesforce in April 2018. Since then our data volumes have continued to grow. We plan to continue adding new functionality to Mirus. Potential new features include:
- Dynamic topic creation and resizing of source or destination topics
- Support for compression pass-though, to avoid the cost of expanding and recompressing messages
We are pleased to be able to contribute Mirus to the open-source community, and look forward to collaborating with outside contributors!