Machine Learning for a Secure, Available, and Performant Infrastructure

This blog post summarizes a Dreamforce 2017 session that was delivered on Tuesday, November 7. To watch that session, check out this recording!

On day 2 of Dreamforce 2017, a trio of experts explained how machine learning (ML) is being used to uphold the #1 value at Salesforce: Trust. Attendees of Dreamforce frequently hear about trust and what it means at Salesforce. For infrastructure, trust is the combination of providing customers with security, availability, and performance. But as the quantity of data generated by infrastructure grows in scale, the ability to process and analyze it using traditional processes breaks down. ML algorithms, however, can thrive in the environment of big data.

When it comes to security, analyzing events captured in logs is critical for spotting suspicious activity. But Salesforce generates tens of TBs of log data per day. According to presenter Erik Bloch, Salesforce’s Director of Security Products, it takes two trained personnel to analyze just 100GB of data. Not only is the manual approach not scalable, there aren’t enough qualified prospects to hire even if it were scaled.

Rules engines can handle big data, to a point. But even rules engines become unrealistically complex trying to process an amount of data that’s equivalent to the entire catalog of the United States Library of Congress every day. And rules engines are fixed in their understanding of the data. They don’t learn. For example, Erik pointed to how potential intruders trying to brute-force passwords might find a way around a lockout mechanism based on 5 failed login attempts: They might stop at 4 and make a change to avoid being locked out. A rules engine would fail to spot a pattern, but an ML algorithm would be able to associate the origin of the failed attempts with other dimensions, such as user behavior, and raise a flag over the anomalous pattern.

One of the key factors affecting availability is component failure, and no hardware component fails as often as a hard drive. Salesforce generates billions of data points per day from many thousands of machines, including high-cardinality data from hard drives. So just feed that data to an ML algorithm, right? Not quite, said Yoni Michael, a Salesforce engineer involved with improving the way Salesforce predicts hard-drive failures. Because Salesforce uses hard drives from a number of different vendors, and those drives report metrics and other information in their own way, the result is a highly non-homogenous dataset.

One truth about the output from ML is that it’s only as good as the input. Yoni’s team first has to scrub and standardize the data reported by hard drives, and only then feed it to ML algorithms that can correctly correlate failures with specific conditions. Once the algorithm is properly trained, it can start predicting which hard drives are likely to fail, and when. Proactive repair or replacement of those drives, which are a key part of Salesforce’s infrastructure, prevents the kinds of issues that can impact availability.

Mobile and Web performance is so complex, that there is simply no silver bullet when trying to improve it. Ideally, optimization should happen as close to real time as possible, because network latency, bandwidth, and failures can affect millions of sessions at once. And somewhere in each of those sessions is a customer attempting to complete a task.

Not only is more data being moved for greater distances, but today’s networks and payloads are also more volatile than ever. A customer could be on a WiFi, cellular, or wired connection, and the data might be encrypted. She could be using a cell phone, laptop, or other device. She might be in the same city, or across the globe. What she’s looking at may be all in one server, or assembled from resources located in two or more remote locations. Throw in different carriers, times of day, and quality-of-service requirements, and it’s a recipe for generating an ocean of data that must be understood very quickly.

Gabriel Tavridis, a senior director of product infrastructure at Salesforce, explained how Salesforce’s EDGE, an ML-based product geared to optimize network connections, solves this problem. EDGE consumes millions of data points per minute, then uses ML algorithms to process and analyze them. Finally, EDGE makes actionable recommendations to improve connections. But it doesn’t stop there.

Once decisions are made, EDGE can again look at their efficacy to better inform future recommendations. Routing a connection through a specific set of nodes that made sense yesterday, for example, tomorrow might have to be reevaluated for new conditions or a different context.

The Fourth Industrial Revolution has brought with it some amazing new technologies, but also a massive increase in the amount of data to analyze. Fortunately, it’s also brought new cutting-edge solutions that can keep up with that data. Salesforce is innovating with these solutions to create more effective security, prevent hardware problems before they occur, and make networks more intelligent.

Follow us on Twitter: @SalesforceEng
Want to work with us? Let us know!

Machine Learning for a Secure, Available, and Performant Infrastructure

New to Salesforce?

About Salesforce

Popular Links

Scaling AI Systems: Secrets for Managing 100,000 Training and Metadata Requests Per Minute

Sales Cloud’s AI Transformation: Welcome to the Autonomous Selling Era

Analytical Model for Capacity and Degradation in Distributed Systems

Open Sourcing TransmogrifAI

New to Salesforce?

About Salesforce

Popular Links