Expanding Visibility With Apache Kafka

Expanding Visibility With Apache Kafka featured image

Salesforce’s operations teams have been betting big on pub/sub with Apache Kafka

What would you do if you had terabytes of operational data being generated in production each day, and hundreds of engineering teams wanting to use that data to improve their services … but no way to connect the two?

This is the situation we found ourselves in a few years ago. In the early days of Salesforce, the operational engineering teams had written a fair amount of software that ran in our engineering environments (non-production) and routinely accessed production machines to pull out metrics — that is, non-customer-specific information about the performance of our systems. When any team needed to do this, they’d follow the example of those who had come before, and write some new code that did essentially the same thing, for their own special case.

This worked well enough … at first. But as we grew, the audience for this data expanded to include analysts, capacity planners, performance engineers, feature teams, etc. It became evident that tightly coupling our data producers and consumers wasn’t going to scale, and we’d end up with an extremely complex system to do what was really a fairly simple task.

So, we began a project to move towards a new type of architecture — a “publish-subscribe” pattern. This would allow multiple teams to consume metrics independently, and to evolve over time in an agile way. (If you’ve never heard of “pub/sub” systems, read this 5 minute intro we posted last week.)

The technology we picked to implement this was a relatively new (at the time) Open Source system called Apache Kafka, which had been developed at LinkedIn. Why Kafka? Because it’s architected for exactly this kind of a “publish and subscribe” usage pattern, and it had very promising performance characteristics. It decouples the producers of data (publishers) from the many potential customers (subscribers) and allows them to scale and iterate independently without interruption.

In our system, metrics are first ingested into a local Kafka cluster in the production data center, before being shipped (via a copy process called MirrorMaker) to an aggregation Kafka cluster in the engineering environment. At the time of implementation, the main consumer was Graphite. Since then, the consumer base has expanded to include other destinations, including a scalable time series database called Argus which we created and released as open source.

Spoiler alert: the project was a huge success, and the system (which we code-named “Ajna”, after the Chakra of vision and insight) quickly became an indispensable part of our infrastructure. As we used it more, we found it lent itself very well to other data transport scenarios as well. We’ve since added our full application logs, which means we’re now shipping multiple terabytes of data per day to a central aggregation facility in near-real-time.

A Broader Vision

This started us thinking bigger. There are a vast number of different components that make up Salesforce’s systems, generating massive amounts of data, and we need a robust way to collect, transport, process and transform all of these streams of information. In broad strokes, the architecture we’re evolving towards looks like this:

The key simplifying point in this picture for us is Apache Kafka. It allows us to use a unified, near-real-time transport for a wide variety of data types that we’re ingesting, including system metrics and state information, system logs, network flow data, and application logs.

As we delivered the initial implementations of our streaming systems, we started to realize that this was a concept that extended well beyond operational visibility data. We started conversations with other teams who wanted to use a similar architecture: decoupling the systems that produce and consume event data, where the pub / sub model is a natural fit. And we began to realize that this is actually a platform, rather than a solution to one specific problem.

One example of this, which you’ll be seeing more of soon, is our upcoming Enterprise Messaging feature, which will power newer versions of the Salesforce Streaming API. For this feature, our engineering team wanted to work with the concepts of pub/sub directly. Rather than setting up an entirely new infrastructure that closely resembles Ajna, we have been working with the team to help onboard them on Ajna directly.

Of course, when you shift your thinking from “feature” to “platform”, it suggests all kinds of avenues for expansion. Self-service management, more granular authorization, service protection, etc — all become more important when the set of use cases you are serving expands beyond your own team. We’ve also started to explore the other puzzle pieces we’ll need for an end-to-end platform: stream processing services (like data filtering and transformation), schema services (so disparate groups can collaborate on the same data with low communication overhead), and so on.

We’re very excited to see these pieces evolve over the coming months!

Kafka In Practice

What do our clusters look like? Salesforce has a global presence with multiple data centers in multiple locations. The overall system architecture is sharded (which is explained in more detail in Ian’s blog post, here). In every data center, we have multiple independent Kafka clusters for data ingestion, each with their own set of brokers, zookeeper hosts (for cluster management), and MirrorMaker hosts (for replication and aggregation).

Our aggregate volume of monitoring data on Kafka today is in the range of millions of events (hundreds of Megabytes) per second. This isn’t huge by industry standards (see LinkedIn’s recent post about their scale). But what’s striking is the growth rate: like every other part of Salesforce, our data processing needs have been growing exponentially for some time now. Our log data volume has doubled in the last year, and we expect that to continue. In aggregate, that results in many TBs of data every day, approaching PBs per month. And as we add new data sources (such as network flow data) we’re expecting an even bigger jump in volume.

What’s Next

The future for Kafka at Salesforce is bright. Many teams already use it for a wide variety of tasks (Editor’s note: we’ll be publishing more about these in the coming weeks). Everything from external data ingest (in Thunder) to release pipelines to push notifications — lots of systems benefit from this architectural design pattern. And we’ve got some exciting things to announce at Kafka Summit regarding new capabilities there, as well.

(Editor’s note: OK, the cat’s out of the bag … our exciting announcement is Heroku Kafka, Kafka as a service for Heroku apps!)

My team DVA, which stands for “Diagnostics, Visibility, and Analytics”, has always had a goal of providing real-time insights from our production systems to our engineering teams. That ranges all the way from low-level operational metrics, like CPU utilization and machine reachability, to high-level business metrics like capacity and usage, and the availability statistics on our trust.salesforce.com site. Kafka has been a key part of achieving this goal.

We’ve made some early forays into contributing to Kafka as well, and have plans for more. In the .8.2 series, we created a patch that added SSL support to Kafka, which we used in our own clusters for encrypting communication. (This patch isn’t part of the current Kafka main branch, because an alternate implementation was added for the 0.9 release which more broadly supported the scenarios other contributors were facing; all part of the process of an evolving open source project!)

Looking forward, we aim to increase our involvement and investment in Kafka. Our previously long list of contribution ideas actually got a bit shorter with the release of 0.9, because it added a lot of we had wanted! (For example, per-topic throttling was high on our wish list, and now it’s part of the main release.) We’ve also been exploring adding more automation around cluster management and elastic scaling, to aid us in working with the large number of independent clusters we have.

We’d love to connect with other teams who are running and managing large, multi-broker Kafka installations for a diverse set of customers. If you’d like to swap stories, talk big ideas, or find out about opportunities to join one of our teams, send an email to dvacareers@salesforce.com or if you are at Kafka Summit (April 26, 2016), come see us at lunch!