written Ram Sangireddy and Kartik Chandrayana, Product Management, Big Data Platform @ Salesforce, with contributions from our colleagues at Salesforce: Andrew Torson, William Earl, Vincent Poon, and Lars Hofhansl
The world has come a long way from the business needs around data and supporting technologies a couple of decades ago. Those were the days of small data, where data could be held in a single store, and appliance-based data warehousing (DW) solutions would meet the most business needs. As data volumes grew rapidly at large organizations to petabyte scale, the limitations of appliance-based DW solutions (such as the need for data consolidation, coupled storage and compute, maintenance costs, lack of data agility, etc.) quickly caught up and started stifling the effectiveness of the interactive analytics solutions.
With that, the industry started building new models that included separation of storage and compute — for more effective ad hoc interactive analytics on Big Data — via efficient independent scaling of the compute and storage depending on the workloads. As analytics involved querying on increasingly multi-dimensional big data, storage of data in columnar formats also has become more prevalent.
These trends ushered technology innovation for data warehouse-like capabilities without the constraints of appliance-based traditional data warehouse solutions to enable wider range of use cases — for example, querying on increasingly distributed data sources, ad hoc interactive analytics instead of pre-canned reports and dashboards, and processing needs on non-aggregated data at scale for finer, more granular insights. These use cases demand a more of a massively parallel processing (MPP) technology with low cost ad hoc query processing.
At Salesforce, data at massive scale is widely distributed. There are dozens of petabytes of data in a transactional store, over 5PB of logs data accumulates every month in data centers, nearly 200PB accumulated so far in Hadoop storages, nearly 200 million metrics per minute are accumulated into time series databases, and enormous amounts of data in other properties. Besides, a lot of this data is also geo-distributed. At that scale and complexity, Salesforce is unmatched among B2B enterprises and is comparable with the largest B2C scale companies. We need to support batch, stream, and interactive ad hoc query processing on these data sets; while Apache Spark is our preferred solution for a wide variety of Big Data Compute use cases, we needed to find a suitable solution for ad hoc interactive query processing.
Given all of the above factors, MPP compute engine would be a great fit with its capability to deliver ad hoc, interactive, human-time analytics on highly distributed data sources, while also being agnostic to the hosting substrate. The engine should support analytics queries in SQL — a primary source of data exploration for analysts — and also support emerging users like citizen data scientists intending to discover insights directly on big data.
In a nutshell, our evaluation to find a technology suitable for our needs around interactive analytics via ad hoc querying encompassed a few key criteria:
- Human time response latency for ad hoc queries on peta byte scale data
- Mature support for standard SQL queries, with ability to join disparate data sets
- Connectivity to a wide range of data sources, with ease of adding new connectors
- Compute engine, with distributed processing of queries, scalable independently from storage
- Open Source software with an active community
We evaluated a number of technologies for the above criteria and wanted to share what we found, along with the solution that we ultimately decided worked for us.
- Drill: Apache Drill is a highly scalable open source application framework which includes a SQL query engine. It can fetch data from a variety of mainly non-relational data stores, such as NoSQL databases. It is based on a schema-less JSON document model for data, so it is more flexible but slower than engines based on schema-based columnar data formats.
- Impala: Apache Impala is a highly scalable, open source, distributed SQL query engine for big data, primarily oriented toward data on Hadoop clusters. It trades off fault tolerance for speed, keeping intermediate results in memory for high performance, and, by some metrics, is the fastest interactive query engine. It is optimized for the Parquet columnar data format, using files on the order of 256 MB. It can perform poorly with a large number of small files, for the same amount of data.
- SparkSQL: Apache SparkSQL is a highly scalable, open source, distributed SQL query engine for big data, with connectors to many data stores. It can deliver very high throughput for schema-based columnar data formats. For very large queries, running hours to days on many processors, it is a good choice, as it captures intermediate results in temporary files and can restart failed parts with low time penalty. On the other hand, the minimum time for a very small query is relatively high; also, the resource usage and service time for small to medium queries are increased by the cost of saving intermediate results if the query plan cannot keep them in memory.
- Presto: Presto is an open source, distributed query engine for big data with mature SQL support. Presto bypasses MapReduce and uses SQL-specific distributed operations in memory. The architecture is designed such that all stages are pipelined so there is no wait time, no need to write to disk in the intermediate, no need to fit all data into the memory, and no disk IO delays. Presto delegates some of the operations to the underlying data stores it queries, thus leveraging their inherent analytics capabilities. Further, Presto can perform cross-platform joins, thus providing centralized support for querying historical data across disparate data sources. Presto has advanced SQL support with capabilities including dynamic filtering, dynamically resolved-functions, SQL-defined functions (CREATE FUNCTION). Like Impala, Presto sacrifices fault tolerance for speed.
- Druid: Druid is unique among related technologies: it is both an OLAP database (can be compared with Vertica, RedShift and Snowflake), a distributed query processor, a time-series DB, has stream processing features, and has UI visualization that supports pivots. Druid brings its best abilities with numerical and time-series data, and its ability to continuously ingest real-time business event streams makes it better for real-time queries. Long-running BI queries that need to touch a lot of historical data will have to go through cold storage upload and will have longer processing times.
- Kylin: Apache Kylin is built to manage OLAP cubes in HBase to support fast SQL queries. OLAP cubes need to store many secondary indexes (one per dimension) and then use fast random access to retrieve the records. Data warehouse files are oriented toward full table scans (“for each item in haystack, add to result set if it looks like a needle”). Kylin is best suitable for smaller cardinality data, and can be a much higher cost option for very large datasets.
In summary, considering the use cases at Salesforce for human time interactive ad hoc analytics on peta byte scale data, across disparate data sources, with ability to join these data sets, and mature SQL support, we determined that Presto was the solution we needed. Presto is now an integral part of our Big Data Compute Platform which needs to support data analysts, data scientists, and developers. With Presto, data analysts and developers can access the data from multiple data sources, either on cloud or on-premise, simultaneously, and can run their ad hoc queries via the custom application interfaces or BI tools of their choice. Data scientists can connect notebooks to Presto to perform exploration and ML modeling on data at scale, instead of working with a slice of data.
As we continue to build the Big Data Compute Platform at Salesforce, we will share more engineering experiences and business outcomes. Stay tuned!