Salesforce Research has been growing rapidly over the past few years. The needs of our machine learning team have shifted from maintenance of a standardized, default training model to experimentation involving a diverse set of inputs. To meet this new target and stoke its flames, we needed a standardized process and unified platform.
As the platform engineers who support researchers performing deep learning training, our job is to provide a stable playground where researchers can launch training jobs with uploaded datasets and hyper-tunable parameters, the output of which can then be further consumed down the training/prediction food chain.
We already had an existing train server, but, unfortunately, it was built for specific use case systems and was tightly coupled with one version of modeling code, meaning it could handle only limited concurrent training requests by type. Additionally, produced models were cumbersome to track down and compare.
We realized that, in order to establish a platform that supported true end-to-end training and experimentation efforts with iteration and scalability, we needed to construct a new framework, the crux of which would be built around divorcing training logic from platform logic to retain sensitive degrees of control over parameters while offering blissful ignorance of what happens under the hood.
Encapsulation Is Key
Specifically, our goal was to advertise the ability to run encapsulated, read-only language-agnostic logic on customer data. Consistent use methods and a uniform interface were a must — users would need to be able to reproduce results and have clear audit trails of application and system logs they could follow, and model outputs needed to be considered not as an afterthought, but as a keystone piece of the design to allow for comparative ease. Safeguarding datasets and restricting access of each step of the training process were priorities, as was addressing the time-honored question of scalability. How could we ensure a reasonable number of training requests could be run, monitored, and cleaned up concurrently?
The concept of containerizing an application is not new but has seen immense growth and adoption since the 2013 emergence of Docker. A container packages up code and all its dependencies into a lightweight, standalone executable unit of software that runs agnostic of environment. Unlike virtual machines (VMs), which include entire operating systems, containers only contain an application layer, making them far less memory and space intensive and much quicker to boot. Individual containers operate independently of others but can share the same resources, consuming minimal resources from the host. Dockerized jobs, built upon the premise of decoupling application code from environment, offered us the atomized control we were looking for.
In terms of infrastructure, our team had already been migrating its services off of Amazon EC2 and onto Amazon Elastic Kubernetes Service (EKS), which would handle the deploying and scaling of containers for us. With the aim of continuing on Kubernetes, we considered a few open source alternatives for training job execution environments:
- Workflow orchestrators: Workflow orchestrator services including Apache Airflow, Azkaban, and Luigi exist, many of which have extensions for Kubernetes as execution runtime. They serve use cases involving more complex pipelines of jobs as well as periodic, cron-style jobs, while our users would often be submitting jobs in an ad-hoc fashion.
- Kubeflow: This platform provides a set of services and frameworks dedicated for machine learning in Kubernetes environments with bindings supporting TensorFlow, PyTorch, and many other popular frameworks. However, to introduce Kubeflow, we would have needed to migrate all of our existing services (data management, authentication service, prediction serving, etc.) to Kubeflow. The scope of this project would have included replacing our entire platform or splitting users between the new and existing platforms
After careful consideration of our user needs and accounting for team capacity, timelines and, the effort to integrate with the Salesforce ecosystem, we decided to build our own working solution.
Generic Hierarchy and Representation
The new platform retains the functionality of our existing server, allowing for creation, retrieval, updating, and deactivating of jobs. Built around the Kubernetes Jobs object which spawns a pod and ensures lifecycle completion, the Training and Experimentation (T&E) framework employs a deliberate design and uses a generic Project Hierarchy and an SDK library. In this Hierarchy, a Project logically groups together related Tasks, and a Job represents an attempted execution of a Task. A Project serves a specific purpose, and its object hierarchy lends itself naturally to the iterative nature of experimental trials which serve that purpose.
Consider the following scenario, one of the earliest examples motivating our T&E design. A data scientist has developed a training algorithm and is keen to evaluate its performance on designated customer datasets. She has packaged her training algorithm as a Docker image and, using our self-serving API, creates a Project specifying the purpose and the customer datasets of interest. She obtains approval and then initializes a Task, which is a trial run of her algorithm on said customer data, and spawns a job that retries upon failure. The Job itself goes through multiple epochs, emitting metrics and producing checkpoint models, both of which are later retrievable through the same API. If additional trial Tasks are warranted, say, if the scientist wishes to iterate on her training algorithm, she can package her enhanced algorithm as another Docker image and create a new Task to run again on the same customer data. After all desired trial runs are concluded, she can easily query and compare the results of all the Tasks in her Project, or even run inference with any checkpoint model.
T&E’s generic Hierarchy is also amenable to one-time Tasks. A model-training request falls into this category, as does an ad-hoc request to validate the syntax of an uploaded dataset. Possibilities abound in the data science realm, as well as in the end-to-end Deep Learning Pipeline.
Platform Implementation And Design
At the heart of T&E are the Training Service and Training Monitor. The former hosts the API endpoints; the latter launches Jobs in Kubernetes, monitors them for timeouts, and cleans them up upon completion. They share an RDS database.
A typical workflow is initiated with a call to the E.ai Api-Server. Once the caller is authenticated, the E.ai Api-Server forwards the request to Training Service. If the request is authorized, Training Monitor will launch a Job for it on a Kubernetes pod via a combination of generic templates containing pod specs and user-defined parameters and representing Kubernetes Deployments, Services, Jobs and relevant Configmaps. Jobs are launched sequentially with respect to time of creation and are contingent on resource capacity and availability (a check performed as the basis for better scheduling and paid plan prioritization we hope to implement down the line). Training Service works with our in-house data management service to fetch designated customer data and to persist intermediate and final Job output. Lastly, when the Job completes, Training Monitor will save the Job’s logs and reclaim the Job’s Kubernetes resources. All components, except for the E.ai Api-Server, reside in an E.ai Virtual Private Cloud (VPC).
The aforementioned SDK library serves as the primary method of communication between container and platform and allows users to integrate their new code easily with ours. The container calls functions made available by the SDK, leaving data fetching (global resources, datasets, embeddings) and data processing responsibilities to the container, which then uploads the processed data by calling back to the SDK again. The platform remains agnostic of the format of the final compressed dataset file, and the container is again liable to ensure the split is consumable by the training code. Other functions the SDK offers include creating checkpoints with model weights and metrics per epoch, updating job metadata with learning rate and total epochs executed, and updating job status (success or failure with job logs).
Introducing the SDK guarantees a necessary degree of abstraction — as long as the function signatures remain the same, we can do as we wish on the platform side of things. Major changes involving updates to the SDK are clearly denoted in release version notes; data scientists and platform engineers can rely on this separation to work more efficiently and autonomously.
Allowing users to provide their own Docker images for experiments means there is no guarantee of the trainer’s behavior, so we needed a way to monitor the launched jobs. Detecting errors or unexpected behavior early is crucial for users to get feedback quickly and for Training Service to save valuable computing resources. The following were a few scenarios we needed to watch out for, and our solutions for each:
- Failure to start within the launch time limit. How do we check to see that a job has begun? We require the trainer process to report that it has started running, rather than looking at the pod status in Kubernetes. In this way, the user can perform their own job initialization steps, and fail early if any precondition is not met. The monitoring system picks up any job that fails to report a running status within the launch time limit.
- Failure to submit result of each epoch within the epoch time limit. The trainer process can submit the result of each epoch to Training Service, and these epoch results are crucial information to further monitor and analyze the training algorithms. We enforce a time limit on each epoch and detect a timeout when the next epoch’s result isn’t received within a specified duration since the last epoch’s result was reported.
- Failure to complete within the time-to-live (TTL) of the task. Even if a training process starts running successfully and each epoch’s result is reported in a timely fashion, the process might hang and hold computing resources. We enforce an overall time limit on each job to address this.
Completion, whether failure or success, correlates with an update in the database, a retrieval of container logs, and a cleanup of the Kubernetes resources, a trivial task seeing as Kubernetes Deployments and Jobs maintain ownership over spawned pods and trigger cascading deletes.
T&E Infrastructure — Security and Data Access Control
The T&E infrastructure is secure in two respects. First, Mutual TLS authentication (mTLS) is in place between the various E.ai services, as well as between Training Service and the Kubernetes pods. Second, stringent network policies are enforced. Notably, the Kubernetes pods run in an air-gapped environment without any Internet connectivity, except authorized S3 access to download designated customer data by way of pre-signed S3 URLs.
How is access to customer data controlled? Authentication and authorization of public training requests is done primarily by the E.ai Authentication Service, which ensures that a customer can only train a model from her own data. Furthermore, to oversee Salesforce-internal customer data-accessing operations (such as those performed by the E.ai Data Scientist Team), the Unified Approval Process (UAP) was introduced in T&E. The UAP governs who can do what on what customer data in what capacity for what duration. For instance, under the UAP, a Project must have a TTL, and permissions have to be obtained and may be revoked even before expiration. Training Service, aided with its RDS, enforces the UAP, and any permission — current, active, expired, or revoked — is queryable through the T&E Admin API and thus traceable.
To sum it up, our platform marries encapsulated, specific logic with a generic code-agnostic platform to empower users to harness the wealth of data captured in the Salesforce eco-system.
By constructing our own platform, we’re able to reap benefits that would have otherwise been unaccessible. We led with a security-first mindset and embedded our own custom UAP throughout the platform. We were able to integrate easily with our existing services, allowing for extensibility of our lifecycle through model serving as well. The optionality of functions in the SDK is supportive of jobs that require different sets of parameters, deployment objects, and behavioral conditions, including Notebooking and Labeling jobs, and furthermore, lends itself nicely to exciting expeditions into territories such as distributed training (a venture already begun by some teammates in an internal hackathon).
In this way, we’ve not only created a solution to the pressing need to run reliable sequences of both iteratively scheduled and ad-hoc jobs, but one that is flexible and readily aware of future needs and improvements.
We would like to thank: Savithru Lokanath for provisioning and tirelessly supporting the T&E infrastructure among the many moving parts, Robert Xue for driving the design of the Generic Hierarchy representation, Arpeet Kale for encouraging the use of an SDK, and Amit Zohar and Vivian Hsiang-Yun Lee for their code contribution to T&E. We are immensely grateful to Ivo Mihov for sharing his initial vision of a Kubernetes-based generic training and data science platform and motivating us on the E.ai Platform team to make it a reality.