Building a Scalable Event Pipeline with Heroku and Salesforce
Have you heard of Chatbots? In Salesforce Service Cloud, a surging number of customers across the globe are finding Chatbots to be a useful addition to their Service offerings. Chatbots are an exciting new technology, and an enabler for organizations and teams who embrace automation. By setting up conversational dialogs and rules, custom scripts, and a sprinkling of machine learning or artificial intelligence, a carefully configured chatbot can answer many targeted questions for your customers.
In order for this to work for our customers at scale, there is a lot of work that went in behind the scenes! Enter: the Event Pipeline.
How is Chatbots related to an Event Pipeline?
When building an application at Chatbots scale, the need to reliably share data, export logs, and collect metrics is critically important. This data is used to support the application, the users, and their businesses.
The Chatbot platform required an architecture to reliably export “event logs” or “debug logs” from our main Chatbot Runtime application. Customers themselves need these logs for key insights, such as:
- Auditing Chatbot’s logical flow — is it working and performing as intended?
- Pinpointing any exceptions that are occurring within the logic of their bot
- Pinpointing any exceptions occurring in custom code they created as part of an Invocable Action or Invocable Flow
Once we defined the attributes an event would contain, we needed a way to export these events in near real-time for each user session within an Organization. These events provide crucial visibility into the logical steps a bot takes when responding to a user.
What drove our architectural design?
The main design questions were:
- How could we ensure the events would be reliably delivered, without degrading the performance of the main app?
- How can we ensure resilient delivery of events to individual Salesforce Orgs, in a manner that would scale out as the volume of these events increased over time?
Answering these key design questions led to an architectural design that utilized Salesforce’s own Heroku Platform, and namely Apache Kafka on Heroku. An event pipeline built on Kafka enabled us to design the pipeline’s characteristics to meet our scalability needs, targeting volumes of tens of million events per day.
The event pipeline built and used today transmits events from an asynchronous and non-blocking Producer with low latencies. Rather than Chatbot Runtime being overloaded with processing and uploading event data to external sources via HTTP posts, this design limits Chatbot Runtime’s responsibility to simply constructing and buffering these events into the pipeline. The responsibility of data upload via an HTTP post or other means is passed to a downstream Consumer. This results in a clean separation of concerns and direct paths to scaling these features, with less impact on Chatbot Runtime’s core responsibility of processing bot messages.
These events in transit are stored in a Kafka topic for up to 7 days, with a carefully selected number of partitions that allow us to scale out and deliver these events in near real-time. Consumers read from this topic, collect and batch events, and upsert bulk sets of data to downstream data stores used by a given organization.
Benefits of this Architecture
This architecture allowed us to design the pipeline and consumers with 3 large benefits in mind:
- Defer high latency HTTP calls
- If the number of events corresponded to an increase in HTTP requests, performance in the main app would quickly start to degrade.
- Instead of posting events via HTTP from the main app, which can involve high latencies, events are emitted into a Kafka topic via an asynchronous and non-blocking Producer. Emitting an event takes about 1ms. Even at extremely high volumes of events, this is not impactful to the main application’s performance
- The consumers then read and batch events to efficiently deliver those batched payloads downstream
- As an added optimization, the HTTP posts by the consumers happen asynchronously via a pool of executors allowing multiple posts to happen in parallel
- Tailor the size and configuration of our Kafka cluster and topics for the projected volume of events
- By carefully selecting the number of partitions per broker, you can balance the need to scale out Consumers processing data, alongside the performance of the cluster itself in terms of rebalancing and replication.
- The number of partitions per topic is a directly limiting factor on the number of consumers that can read from a given topic in parallel. Too few partitions and you limit throughput; too many and you risk consuming higher resources, which increases time to re-elect leaders when brokers are recovering from failure.
- By structuring events and consumers so that events can be processed in any order, you can parallelize consumers and evenly distribute events across each partition; if order of events is vital, you have an option to assign a class of events to a given partition by giving it a key (e.g. a session ID or an org ID).
- Other key reliability and performance considerations for your Kafka cluster are number of brokers in each availability zone, replication factor, data retention policy, and event compression.
- Allow for efficient failover queues and reprocessing
- In the event a downstream datastore is not available at the time of publishing consumed data, you want your system to try and re-deliver this data downstream at a later time
- Processes that rely on HTTP, such as HTTP bulk posts, are destined to have failure rates greater than 0. For these processes, retry topics are a great method of identifying failed deliveries and retrying.
- Retry topics in your cluster act as queues that other retry consumers can consume from and attempt to re-post at predefined intervals. By emitting failed bulk posts into retry topics, you can add a high degree of bulk post resiliency.
- By handling failures and retries via special retry topics, each queue can be processed efficiently and move forward, without causing your pipeline to stagnate or back up.
Figure 2: Example graphs of Chatbots Event Pipeline in Production delivering millions of events per day used by Organizations building bots. Note: these numbers are not showing actual Production numbers.
Summary
When building your application or service, you will likely have a need to publish events for a variety of use-cases and business demands. An effective way to do this and to decouple the performance of your main app from the publishing pipeline is to defer HTTP publishing downstream. Apache Kafka, and especially a managed Kafka cluster such as the one offered by Heroku, is a battle-tested platform that provides this capability. By using this platform and some key design considerations, you can reliably grow your event pipeline without sacrificing performance or scalability of your core services.