Chatbots can be a powerful force-multiplier for Service organizations. To give Service Cloud customers the ability to develop and debug a custom-configured Chatbot effectively requires providing them with tools such as debug logs. Debug logs are a way to view the step-by-step execution of their Chatbot. In addition to debugging, it is helpful to see critical metrics that can highlight which portions of the bot are being used most often vs. those which are throwing exceptions. Whether it’s a call to a custom Apex script, an external request to an Einstein.ai Natural Language Processing (NLP) endpoint, or tracking the conversation flowing between two dialogs, customers need to observe that their bot is executing as intended.
To allow for these insights, the “engine” for Chatbots emits events containing data and metadata for each action taken, along with counts of critical metrics. A Chatbot administrator ultimately views these events and metrics. These streams of data flow asynchronously through the Chatbot Events Pipeline. It is the pipeline’s responsibility to deliver this data and metadata reliably.
“Failures are a given, and everything will eventually fail over time.”Werner Vogels
Preventing faults before they occur is one goal we strive for as engineers. However, there will be times when not all goes as planned: services we may depend on outside of our control will be unavailable or will fail; perhaps a configuration that is external to our architecture might cause failures downstream, or maybe there is a network-level failure.
At any scale, your data pipeline or service needs to anticipate and mitigate the fundamental and most common faults.
- If you are required to send data downstream via HTTP, it is inevitable that those endpoints will not be available or will be unresponsive at some point.
- In the case of a data pipeline, it is likely you will be required to reprocess data from a point in time
If your design does not handle these common conditions in a fault-tolerant manner, the result can not only negatively impact your service but also can affect the health of services upstream and downstream from it.
What critical design choices can you make to help address these common faults and increase the reliability of the data pipeline? What steps can you take to deliver data while also limiting the risk of service degradation?
Presented here are five architecture and development choices made in the creation of a fault-tolerant event pipeline for Einstein Bots (Figure 2), along with explanations for each.
1. Choose a fault-tolerant event streaming platform such as Apache Kafka
“How many times da I have to tell ya…the right tool for the right job!”Montgomery Scott
Apache Kafka is a durable and fault-tolerant publish-subscribe messaging system. While it is highly scalable, it is not just for high throughput use-cases. Kafka comes with configurable durability guarantees. Kafka replicates topic data across partitions in a cluster, and ideally, you should spread these partitions across brokers in different network Availability Zones.
Kafka Consumers can fail and pick up where they left off, or where another Consumer in the consumer-group last processed events. Because Kafka retains messages (or events) for a configurable amount of time, it enables the ability for your Consumer to reprocess data from a given topic offset.
Downstream from Kafka, you’ll need to ensure that your event consumers are following fundamental fault-tolerant practices when making HTTP requests from consumers to downstream destinations.
2. Implement Timeouts and Retry Failed HTTP Posts
For helping protect your service, a solid start to HTTP fault-tolerance is with sensible connection timeouts (
7a in Figure 3 above). High latency responses from legitimate downstream endpoints are bound to occur. If you have not set a timeout as an upper limit for your HTTP connections, this can cause exhaustion of connections in your connection pool, causing delays or even failures for subsequent posts. You will want to use timing data from your logs to drive this timeout value. Start by setting a timeout value slightly larger than the median latency of a request (E.g.,
1.5 * medianLatency). You don’t want to set the timeout too low and risk rejecting legitimate connections, but you want to ensure HTTP requests always continue to flow by not exhausting your service’s available connections.
Retry Failed requests
Whether your downstream request fails due to a timeout, or due to some other HTTP error response, you will want to retry that request (
#10 in Figure 3 above) at some interval in attempts to deliver its payload. You should be respectful of the downstream services and not proliferate retries beyond a reasonable upper limit. For instance, if the downstream service is attempting to recover, hitting it with batch after batch of retries from your application will limit its ability to do so. In the Chatbot Events Pipeline, failed requests are retried up to a maximum of six times over 16 hours — if the downstream endpoint becomes stable and responsive during that time, the Consumers successfully deliver the payload without adding an avalanche of requests.
3. Design with Idempotence and Upsert Semantics
Life is much less stressful with idempotence. Most commonly, this will apply when you are intentionally reprocessing events from a specific Kafka topic offset, but there can be other situations where assurance of not creating duplicate records can save many headaches for you and your customers. The main design trait here is to avoid creating duplicate data downstream during reprocessing.
One related best-practice is to ensure each of your events has a universally unique identifier (UUID) and to upsert into downstream datastores with this value. Subsequent upserts then can check for this and update the row or avoid duplicates. Whether you are running an UPSERT or checking for duplicates before INSERT is a decision for you and your team, but the key is to design for reprocessing such that you avoid generating duplicate data.
4. Implement a CircuitBreaker per Downstream Endpoint
CircuitBreakers will significantly aid with protection for your service and each downstream endpoint (
#8 in Figure 3 above). The high-level concept is simple. A buffer holds the health of each HTTP endpoint in memory; if the number of failures in that buffer surpasses a given threshold, the circuit is set to
OPEN, and subsequent calls to that endpoint within a mandatory
waitInterval are short-circuited as errors, providing two significant benefits:
- If posts to a downstream URL are failing consistently, your service does not wait an unnecessary amount of time (up to your timeout) for every potential request. By not wasting resources on an endpoint in a known-failed-state, your service can more readily service requests to other healthy endpoints.
- By not sending traffic to a struggling endpoint, you are not pelting that endpoint with requests, and allowing it an increased opportunity to recover.
After the mandatory
waitInterval has passed in an
OPEN circuit, requests start to trickle to the target in a
HALF_OPEN state. If the number of failures during this sample phase does not exceed your configured threshold, that endpoint has “healed” (the circuit is set back to
CLOSED state), and requests begin flowing again to that endpoint!
There are some battle-tested open-source CircuitBreaker implementations available and worth checking out.
5. Implement a Bulkhead per Downstream Endpoint
As you are sending data downstream, you should ensure that the number of concurrent requests you send downstream to a given endpoint is limited. The Bulkhead Pattern does precisely this (see
#6 in Figure 3 above). As is the case of the CircuitBreaker, this helps to protect your service and the downstream service you are posting to:
- If you implement a bulkhead, you limit the number of concurrent requests to that downstream endpoint; it also distributes requests from your service more equally across endpoints. In other words, bulkheads prevent the overuse of a pool to a given endpoint. Even with timeouts and CircuitBreakers, there is a risk that a given endpoint could consume a disproportionate number of threads, temporarily blocking your service’s ability to post elsewhere. For example, if your service is posting downstream to Endpoint A, Endpoint B, and Endpoint C and requests to Endpoint C become unresponsive — without protection from a bulkhead you risk the case of your service waiting in line for Endpoint C requests to complete or timeout.
- By limiting the number of concurrent requests to a given endpoint, you are setting an upper limit at any given moment for the requests sent from your service to that endpoint. Even with fanned-out Executor threads performing HTTP posts, for instance, you can ensure all requests pass through a Bulkhead and are therefore limited.
I hope you found these ideas insightful, thought-provoking, and helpful towards your development efforts. Please share your opinion in the comments or feel free to clap your hands. Have a fault-tolerant year!