Some Background:
The Edge team at Salesforce operates an application that provides machine learning-driven network optimization policies for mobile traffic. The app runs periodically and creates a training set from raw data stored in a relational DB, trains on this data, and writes optimal network policies to be consumed by the downstream system.
We were using Ansible to provision our VMs in the cloud, running a base playbook that deployed common resources across all our VMs, and then a specific playbook to provision the app code. Using the same Ansible playbooks, we would deploy locally on a Vagrant box and on the cloud for production.
While this approach worked, a couple of issues made development slow and maintenance painful.
- Ansible started to get out of hand; we got to a point where members of the team had private notes for making the playbooks actually run, and local environments were constantly breaking.
- Our deployment was mutable, and changes were hard to track, monitor, and revert. This made our local environments diverge from the production environment, causing development issues.
- Deployment almost always needed human involvement.
These issues were mostly caused by configuration-driven deployment of mutable components. Changing our approach to immutable deployment proved to be the right solution in our case.
This was also a good excuse to make some nice changes that were needed anyway:
- As part of our new security compliance requirements, we had to switch from Ubuntu to CentOS.
- Our secrets were stored on disk and environment variables, so we needed to rethink the way we handled them.
- This app never had a continuous integration (CI) solution set up for it, so now was a good opportunity to configure one.
Solution:
Docker containers seemed to solve some of the issues we were trying to address by providing the local/remote environment consistency and the ease of use we needed. While we plan to move to an immutable deployment across the board, our machine learning (ML) application was an obvious place to start tackling this project as it is relatively isolated.
To solve the above issues we introduced the following changes:
- Created a Docker setup for our application, making it an independent service that would run periodically.
- Created multiple layers for our Docker environment: a base layer that would ideally not change very often and has all of the large libraries we use (Numpy, Pandas, etc.), and another layer with the app code. This setup makes rebuilding the final image much faster, as the base layer can be cached.
- Migrated secrets into a public cloud managed encrypted data store and started managing these secrets using an open-source project called Credstash. When deployed, the container would be authenticated using native cloud solutions and would fetch the secrets using the Credstash API.
- Created two separate docker-compose files to support two environments built on top of the base layer. These two environments would mount the local credential and use these to gain access to secrets:
Dev environment: builds with the specific app requirements, but mounts all of the app code for easy code changes. Additionally, the entry point for the container is bash, giving an ssh experience when running the container. This approach saves the need to rebuild the container after each code change, letting the developer quickly iterate and test new code changes.
Test environment: builds with the specific app requirements, and is as consistent with the final image as possible. This means that it copies all of the code into the container image and includes all of the tests. The entry point for the container is set to run using our testing environment, and logs are recorded into a mounted logging volume. - Set up a CI pipeline for the new deployment solution. Each time a pull request is issued or merged, the pipeline builds the production image and pushes it into the internal Artifactory from where it could be deployed.
With these changes, the final Docker image can now be pushed and deployed using public cloud managed solutions.
Impact
- Developing our application is faster and more developer friendly. Code changes can be tested locally with minimal setup and easily deployed to production.
- Build time is greatly reduced as the heavy base layer is almost never changed. This layer can also be used for other projects with similar core dependencies.
- Docker images are tagged at build time and are immutable. Each change is documented and can easily be rolled back simply by changing the image tag to a previous working one.
- The configuration is now part of the built container, so configuration changes are tracked and versioned.
- Secrets are now versioned, encrypted, and stored only in memory. These secrets are centralized in the cloud and can be rotated without redeploying any code.
- Local environments are consistent across developers and across development/production and local/cloud environments.
- The app can now be deployed using any container orchestration tool, which gives us flexibility in the future. Changing out the secret management tool could give us the ability to deploy using any public/private cloud provider.
- A CI pipeline was set up and is used to push images into production, making it easy to roll back if necessary.
Conclusion:
Moving to immutable deployment proved to be exactly what we needed and was immediately noticed. These changes made code and configuration changes easy to make, track, and deal with within a production environment. This application is only one part of our system, and many other components could benefit from a similar setup.