A Peek at Datorama’s AWS S3 SQL Query Tool

Datorama Reports for Marketing Cloud enables you to generate, view, and share a detailed analysis of your Email, Push, and Journey campaign-level data. For that, Datorama extracts large volumes of data to its Data Lake solution for the Marketing Cloud marketing analytics, which is stored in a structured table compatible architecture.

Through our Datorama product, we digest large volumes of data from many possible input sources. One of those potential sources is an Amazon AWS S3 bucket that contains a customer’s data (for more info about S3 you can visit What is Amazon S3?). We have designed a process to Extract-Transform-Load (ETL) this data from their source to our own Data Lake for various purposes.

Why do we need SQL Query tool over AWS S3?

We sometimes need to analyze the data inside some S3 files in their original non-manipulated state. There is no out-of-the-box capability to do this on multiple files, which currently requires tedious manual work. One must first download all the files from S3 to a local storage, then merge them to a unified data file, and only then start to analyze the data. In addition, each file object in our case can include only a portion of a more extensive data set, so a preliminary step would be to download all the shards of that specific data set and merge them to get the complete data set. Lastly, we can have vast amounts of data entries (over 1M) in each file. Even analyzing only one file before merging it with other files can be a difficult task requiring some dedicated analytics tool.

For example, suppose we want to check whether our automatic ETL processes are operating as expected. In that case, we can use such capability to manually extract the data, or some insights on that data, directly from S3 and compare that to the results of the automatic data extraction results.

One file for some date might be of the form:

What exactly do we need?

We want to have the capability to execute SQL-like queries, in real time, over the current, most-up-to-date data that is stored in multiple files (objects) in some AWS bucket using dynamic parameters to specify the required set of files.

Some base assumptions

The files are in Hive-compatible format and all reside in the same AWS bucket.
The files are nested in a path of the following format: “{BUCKET_NAME}/{PATH_PREFIX}/YYYY/MM/DD/{PATH_SUFFIX}/{FILE_NAME}“.

How do we do it?

1. First, we receive from the user all and only the necessary parameters to calculate the S3 paths of the required files, i.e the date range, FILE_NAME, etc. We then calculate all the required S3 full paths.

2. Later, we use Trino with Hive Connector to create an External Hive Table that points to the base path of all the files, i.e. “{BUCKET_NAME}/{PATH_PREFIX}”.

3. Once we have that external table existing in Trino’s Hive Connector schema, we need to make it point to the actual files.

4. Since the files are stored in date-based partitions, we use a similar partitioning method to map those files in the external table. For each date path, we add a single partition in the table.

5. For example, in case we want to query data for the date range 17/03/2022 till 18/03/2022, we’ll create a table pointing to “s3a://PREFIX” and add two partitions to it, one for “2022/03/17/SUFFIX/my_file” and one for “2022/03/18/SUFFIX/my_file”.

6. Finally, in order to allow easy human-compatible use of the new tool, we integrated it in our existing Query Lake mechanism that accepts SQL queries, uses Trino to execute them over preconfigured table, gathers the results, and presents them back in our web application.

High level design:

Performance

To check the performance of the above tool, we executed a select all data queries using our Query Lake feature on some chosen file path patterns. Here are some run time examples:

Query “select * ” over –

First run –
- Date range of one day including 1 partition path containing 13 shard files with total of 10,092,906 rows.
- Execution time: 7-10 seconds
Second run –
- Date range of 3 months including 91 partition paths with total of 4,126,829 rows.
- Execution time: 4-6 seconds.
Third run –
- Date range of 3 months including 91 partition paths with total of 393,577,975 rows.
- Execution time: 23-30 seconds.

For comparison, before using the tool, one had to manually browse through all those 91 partition folders in AWS S3, download all the files in each of them, merge them locally and only then then try to analyze the data in them. Such work would have taken days and could include many mishaps. On top of this, trying to work with files that contains millions of rows is not an obvious task on its own.

Note: Choosing a “thinner“ query (such as select * limit 5) results in a much shorter execution time. The above times include the network traffic between our web and backend query service in addition to loading data into the web, hence the actual query time is even shorter. In addition, since we are querying S3 data in real time, the measured time is also very dependent on the round-trip network traffic to and from the S3 data server.

Security Concerns

The straightforward method of accessing S3 directly and downloading the files manually usually involves each analyzer needing to have credentials for S3 with the necessary permissions. Unfortunately, this requires a lot of configuration in S3 and spreading those credentials, which might cause credentials, and with them, access, to get into the wrong hands.

In our approach, the authentication with S3 is done using an IAM role assigned to the query server. This IAM role, in turn, uses Hadoop S3A connector’s capability to Assume Role to gain permissions to the bucket. This approach means that the entire authentication and authorization process is done aboveboard and without any human passwords involved, reducing this security risk to the very minimum. A significant additional benefit is the lack of PII exposure risk. In this approach, no files get downloaded locally, and as such, there is no concern about them leaking to a non-authorized party.

Conclusion

The new internal feature makes the lives of our support and software engineers much more manageable. It allows them to work faster, smarter, and with quicker response time on scenarios where they need to compare the source-of-truth data in S3 to the data in our ETL processes – the process is now shorter by days! On the support side, this capability removes the need to access files directly, know the exact file structure, and do a lot of manual work every time such data analysis is needed. Additionally, on the engineering side, the new tool allows us to quickly, in real-time and in a secure way, analyze huge amounts of data without bringing it into our Data Lake.

Read more about why we’ve chosen to use Trino and how we use it to ETL at scale.