At Salesforce, we use and contribute heavily to Apache HBase™. Now we’ve started building HBase in the public cloud to align with our international infrastructure expansion. The whole HBase stack includes multiple open source BigData services such as Apache Zookeeper, Hadoop HDFS (NameNode and DataNode), HBase (HMaster and RegionServer), and Phoenix. In two parts, this blog post will share our experience in provisioning infrastructure for BigData deployment by both mutable and immutable means in the public cloud.
Mutable vs. Immutable Deployment
First, let’s explain what “mutable” and “immutable” deployment means in this context:
- Mutable. In this world, servers are updated and modified in place. When deploying new OS patches, new application binaries, and/or configuration settings, a running server is “mutated” by applying those changes. Specifically, the server state can be changed during their lifetime. In public cloud, a service deployed by a mutable approach usually runs in virtual machines (VM), mounting local ephemeral or network-attached persistent disks. To better “mutate” such BigData servers in an HBase stack, we can use management tools like Apache Ambari for automation.
- Immutable. In this world, a BigData server is started from a pre-defined image, binaries, and config settings, which cannot be changed during its lifetime. If you insist on changing its state manually, the state change will be ephemeral because, at the next restart, the server will lose the change. If a server needs to be modified or updated, a new server is first built using an image with the new changes to replace the old server completely. In public cloud, a service deployed by an immutable approach usually runs in a container, claiming persistent volumes if needed. At Salesforce, Kubernetes is the cloud-native platform we use to orchestrate immutable containers.
After exploring both options, we chose container-based immutable deployments over Ambari+VM mutable deployments. Though alternatively, we can make VM-based deployment as immutable as possible, container engines in public cloud environments enable us to achieve faster deployment at scale, independent of cloud-specific requirements.
Stateful vs. Stateless Services
Stateful services are services that need to persist data that can be retrieved or reconstructed after a service restarts. If everything in our BigData stack is stateless, either a mutable or immutable deployment approach may work just fine. However, most BigData services in the HBase stack are considered stateful. Being stateful means the servers have the state maintained somewhere (by us or by the application itself) which makes them not the same to clients. For example, if a server is mounting persistent disks in the same availability zone (AZ), we will maintain a state mapping the server to the disks in order to reuse the disks. This is true for the Zookeeper, HDFS NameNode, HDFS DataNode and Ambari nodes. Meanwhile, the binary version and config values are also considered “states” to maintain since the data format might be incompatible during a rolling upgrade. Another example is the application-maintained state. In one of the HBase RegionServer cases, a RegionServer serves a partition of a table space and all clients reading/writing that partition have to connect to the same RegionServer. The server-to-partition mapping is a state to maintain, and HBase HMaster will update this automatically if RegionServer leaves and joins.
Among the whole BigData stack, HDFS is largely stateful, both NameNode and DataNode services. HDFS stores data on multiple disks, replicates the same data block across three different availability zones, serves requests with topology awareness, and automatically repairs data when it’s under- or mis-replicated. Especially, for DataNode disks, we have two choices:
- Local ephemeral disks — Here, each time we’d create a new DataNode server instance with blank disks attached to it. This server will hydrate data from the other two replicas (using HDFS internal block replication). It makes DataNode servers less stateful and easier to deploy. However, it takes significant time and network resources to replicate all data to that DataNode server.
- Network attached disks — In this case, even if a DataNode server instance is restarted, the associated data continues to be persisted in the network-attached disks. This improves the data recovery time after an instance restarts.
We chose to use the second option for better data availability and less recovery cost. So to us, every HDFS server is a “pet,” not a readily replaceable “cattle.” When a server restarts due to operations such as reconfiguration or rolling upgrades, we need to keep its state as-is after the restart. Meanwhile, in batch operations, all HDFS servers should be in the same availability zone. HBase is not as stateful as HDFS from the deployment point of view. For RegionServers, all the state data is in memory and will be recreated from data stored in underlying HDFS disks. In principle, HBase deployment is simpler than HDFS.
VM vs. Container Infrastructure
For both mutable and immutable deployments of BigData stack in public cloud, we need to codify our infrastructure, so it can be versioned, reviewed, and audited. At Salesforce, Terraform is widely used to safely and predictably create, change, and improve cloud resources. We chose it because it is open source, advocates declarative configuration files, and supports multiple public cloud platforms.
- For mutable VM based deployment, we created a Terraform BigData module which provisions VM instances in groups from templates, persistent disks, object stores, and network resources. The VM instances and disks for a cluster are across three AZs for resilience tolerance. Especially, it provisions the cloud infrastructure separately for those BigData services (e.g. NameNode, DataNode, RegionServer, Zookeeper), so each of them can operate and scale separately. After it has been created, we can use the same Terraform module to make changes to existing infrastructure, for example, for scaling out more instances for a service.
- For immutable container-based deployment, Terraform is used to provision Kubernetes clusters and related resources. After the container infrastructure is provisioned, we then apply BigData manifests in Kubernetes defining Pods and a high-level abstract of Pods (mostly Deployment and StatefulSet). BigData services are using persistent disks via Kubernetes primitives like PersistentVolume (PV) and PersistentVolumeClaim (PVC). Managing those disks allocations on-demand across multiple AZs is a built-in functionality in Kubernetes.
Ambari vs. Kubernetes Management
After the infrastructure is provisioned, a management tool will be used to deploy stateful services by coordinating the servers’ state. As in first-party (1P) data-centers, for VM-based mutable deployment, we use Ambari to manage BigData clusters for operations such as deploying a fresh cluster, managing configurations, commissioning and decommissioning servers, issuing and watching rolling upgrades and rollbacks, tracking monitoring and metrics, across all layers of the stack. With our initial efforts in public cloud, we have successfully brought up an Ambari managed cluster. We first apply the latest BigData Terraform module to provision infrastructure (VM, disks, network settings, IAM roles/policies, storage buckets, and permission, etc); and we deploy big data services on top via Ambari API server.
For container-based immutable deployment, there will be no Ambari any longer to manage software/config since a big data service container has all the required binaries and libraries. With this, we lose readily available open-source contributed Hadoop specific config management, process management mechanisms and service-specific rolling upgrade logic. These can be built-in Kubernetes too, but we would have to implement it. We believe the effort of immutable deployment with Kubernetes is worth it in the long run.
Our next part of this blog post covers more details about how we use Terraform, Spinnaker, and Kubernetes to provision infrastructure for HBase deployment in public cloud.