SQL or NoSQL

This post describes some experiences and lessons learned by the Twin Prime team regarding database selection. Today the team is integrating its data-driven acceleration technology with Salesforce’s mobile and desktop applications, and looking at further opportunities for applying its unique data approach inside Salesforce.

Our journey for finding the optimal database for our mobile performance optimization service began with a small amount of data and a requirement to ramp up fast. We already had event-logs coming from our server in MsgPack format, which is equivalent to having raw JSON data. We liked the idea of storing the event-logs “as is” without the need to define a schema. NoSQL seemed like the right way to go. We needed a database for multiple purposes: internal testing and debugging, performance analysis, customer-facing dashboard, and long-term storage. We assumed that one database should be the backend for all of those activities.

MongoDB seemed like the default choice: free, easy installation, large community, tight integration with scripting languages and a rich query language. We did not bother to benchmark its performance, because it had a good scaling reputation.

For a while things looked good:

As the volume of incoming data became more significant, we noticed that our aggregation queries grew slower. We heavily rely on aggregation for building performance reports and troubleshooting the root cause for poor performance. It turns out that MongoDB is not really optimized for this purpose. It does not have any specific indices for “group-by” queries. We realized that we had made the wrong choice of database, but it was not easy to get rid of MongoDB overnight. Mongo’s query language is proprietary and not compatible with other NoSQL databases, so migration seemed painful.

We spent some time building our own aggregation layer on top MongoDB. The basic idea was to reduce the database size by replacing chunks of raw records with “aggregated records”, that could represent them more efficiently. We realized there would be a loss of information involved in this process and came up with a progressive summary approach: the most recent records were kept in raw format for debugging purpose, while older ones were summarized on a daily basis, and later re-summarized on a monthly basis. This did not go well. We ended up feeling this was the kind of compromise where none of the database consumers were happy. Besides, the problem seemed generic enough to assume that someone must have already built tools for this. That was the point we started looking for a replacement for MongoDB.

This time, we defined a very simple benchmark test and applied it over all candidates. We loaded 100 million rows with the following format:

network-type (enum), device-type (enum), speed (int), size (int)

Then we ran aggregation queries like:

SELECT DEVICE_TYPE, COUNT(*), AVG(speed)
FROM logs
GROUP-BY DEVICE_TYPE;

We figured that our dashboard application would need such queries to run in a sub-second response time. MongoDB, though, needs more than 100 seconds to complete such a test.

We then explored other NoSQL options. We looked at ElasticSearch, which seemed 20x faster than MongoDB. It still did not meet the sub-second requirement. We were also reluctant about learning another proprietary query language like Mongo’s. We looked at New Relic, which overwhelmed us with their performance: we were able to build a dashboard that had 9 widgets, each one aggregated 100 million records differently, and the whole page was refreshed in sub-seconds. Alas, New Relic’s solution was a hosted one, when we needed to keep our data in-house. We looked at Splunk, which has a nice approach for processing machine-generated logs, but the performance was not there. We looked at a variety of options from the Hadoop/Cassandra world and got the impression that “aggregation” in this context is done in batch, while we were looking for real-time reporting. We also looked at the OLAP world and got the impression that these are mostly enterprise tools which have not adapted to cloud-scalability yet.

When our MongoDB became unusably slow, something interesting and unexpected happened. People still needed an ad-hoc solution for performance analysis. So they began deriving small portions of data from Mongo into temporary local SQLite databases. It was clearly a non-scalable approach, but it taught us some important lessons:

Everybody knows SQL. It’s not a proprietary query language like Mongo’s. SQL goes beyond the scope of the engineering department into almost everywhere in the organization. From an organization scaling perspective, it’s easier to hire new people who already know SQL, than looking for people who used MongoDB.
SQL has successfully been there for decades. Any competing language would need a significant amount of time to hash out all the conceptual challenges which encompass the life cycle of data applications. MongoDB along with its NoSQL counterparts is clearly not there yet.
The “no schema” concept of NoSQL is a smaller advantage than we had originally expected. The schema may quickly change at the product’s early days, but later on it stabilizes. It was not too hard to take our JSON format and normalize it into tables, columns, keys, etc.
So many databases support SQL/ODBC. If your application is written on top of SQL/ODBC, then it’s really not hard to switch between back-end databases. This was exactly our problem in getting rid of MongoDB.
Having multiple copies of the data may not be such a bad idea. If different tools are needed for accomplishing different jobs optimally, then why insist of having a one-size-fit-all tool?

So we took an interim decision that our next database would be SQL-compatible. It was still unclear how we’re going to meet the benchmark requirement. Postgres and MySQL are nowhere near the barrier we’ve set. And then, as it so often happens, we found the right Google search words: “analytic databases”. It turned out there’s a whole family of SQL databases which are built and optimized exactly for our kind of aggregation queries. One of the leaders in this category is Vertica, which was acquired by HP a few years ago. Vertica is internally a columnar database, but externally it features a full-blown SQL interface. Vertica offers a high-end enterprise product that certainly meets the performance requirements out-of-the-box. We found that its snappy response time can be leveraged to build new interactive tools that are otherwise not feasible. Ultimately, we decided on Redshift, which is another member of this columnar family since it better fit our overall business constraints. Thanks to SQL compatibility, we did not have to rewrite any of our data applications. Migration has finally became simple.

What lessons can be learned from our experience? First, upon choosing a data technology, it’s important to look beyond the “coolness factor” and the ease of ramp-up. Take an informed decision about the right database for your specific use case. Second, have an “exit strategy” from the database you choose, because your data requirements are likely to change.

Happy benchmarking!