During Dreamforce 2017, in his talk “How Salesforce Leverages Machine Learning & Artificial Intelligence in the Infrastructure,” Kartikeya Chandrayana discussed strategies for getting the most out of, well, anything you need to optimize.
That bears repeating. At Salesforce, Trusted Customer Success is our number one value. We are committed to providing the fastest, most reliable, consistent, and most secure experience to our customers. In the Infrastructure (the organization in Salesforce which manages the physical data centers), delivering the highest standard in system availability, performance and security is our highest priority. To reach that standard, we use machine learning (ML) and artificial intelligence (AI) as basic tools to continuously improve performance.
Kartik, during his talk, drew upon his extensive experience working with ML and AI to discuss three key things that Salesforce does to make the right decisions to optimize performance. These are:
- Dynamic decisions
- Use multiple data points
- Use feedback-driven analysis
- Reevaluate the baseline
Static vs Dynamic Decisions
Static decision making happens when you react to an event. You are riding your bicycle, and your rear tire goes flat with a loud POP. That’s the event. How do you react? You walk your bike 23 blocks to the nearest bike shop.
Dynamic decisions are predictive. Every day you check the tread on your bike tires. One day you see a tear in the sidewall. That’s also an event. But this time, instead of walking your bike, you ride to the shop and get a new tire.
Reactive decisions need only one data point in time (the flat tire). Predictive decisions depend on a constant flow of data (frequent visual checks). When you discover a vulnerable tire, you replace it before there’s a crisis.
Use Multiple Data Points
More data, yes, but it doesn’t mean we look at just one thing more often. It also means we need to look at different things so we can paint a fuller picture. Kartik gave the example of his experience with an early iPhone on an LTE network. Download response was extremely sluggish. Looking at the phone, its signal strength meter displayed 4 bars. Strongest signal, right? Then why were downloads so slow? After some digging into the telephony network, he discovered that the network had very little bandwidth, resulting in speeds similar to a coffee shop dialup wifi. The one data point (the signal strength meter) didn’t tell the whole story. When it comes to collecting data to analyze later, more is better — but within limits (we’ll talk about that later).
Feedback Drives Analysis
Salesforce stores and process an immense amount of data over many, many (did I mention many) disk drives. Drives have longevity ratings, therefore in any one day you know some of them may fail. But how to predict which will fail? We started out by monitoring all facets of a drive in operation, collected and aggregated that information in an ML way for every drive in the data center, and then reviewed those stats against known failed drives using AI. This gave us an initial set of danger signs to look for. Drives could be replaced before they would likely fail.
But we didn’t stop there — if we had, that would have been a static decision. With the ML/AI loop in place, we can make constant adjustments to improve our predictions. Maybe we are too optimistic (allowing some drives to fail in place) or too pessimistic (replacing drives much too soon). Maybe we need to look more carefully at different drive properties. Further analysis loops help refine that.
Set a Baseline, Repeatedly
Moore’s Law tells us that technology evolves at a rapid pace. One of the implications of this is that any system designed to make use of an analysis loop also needs to evolve.
You set up your baseline based on your environment: 100 500M SATA drives in a closet, a network of 300 iPhones on an LTE network, or four all-season tires on an SUV. The data you collect, and how you collect and look at that data is tuned in context.
Fast forward a week, and the world shifts a bit. Now you have 200 1T Solid State drives in your kitchen, a 5g network of 1500 Note 8’s and 2500 iPhone X’s, and 18 wheels on a loaded Peterbuilt 579. How you look at performance and the data you collect needs to change in response.
Earlier I mentioned that more data is better. When you start to analyze data for prediction, you need to make some coarse-grain decisions on what not to pay attention to as part of the baseline set. Because my drives are not in the closet anymore, I don’t need to worry about ambient temperature over 80°f. Because my network excludes 3G phones, I can ignore conditions relevant only to earlier OS versions. Because I’m driving a long haul Peterbuilt 579, I don’t need to worry about treadwear on the 16” wheels found on most SUVs.
The point is we can constantly refresh our ability to accurately predict what to do based on some big assumptions (the baseline) and on a pile of minutiae (data within the baseline environment).
It’s a Not-So-Vicious Circle
This dynamic feedback cycle gets you started. But as I noted above, it has to be constantly evolving. Using dynamic decisions driven by multiple data points, feedback to determine what to look at, and adjusting the proper baseline is the beginning. Just remember to keep the goal in mind.
Though a recording of the original talk is not available, you can review the slides used for the presentation for more information.