Over my 16 years at Salesforce, one engineering principle has often stood out for me. Engineering for scale is a balancing act between “not too big” and “too big”.
When I started at Salesforce in 2000 as the sole release engineer, we had a relatively small footprint in our data center. I personally knew the IP address of each and every server I released to. My servers were so familiar to me, I treated them like pets. I knew Aruba liked to take its time restarting applications, I knew Bigmama would get my updates done faster than any other server. I knew Bahama liked to deal with smaller datasets and would need that special script to release. (Yes, I had a Beach Boys theme going in those days. How was I to know that years later they’d actually play at our annual company user conference, Dreamforce? That was one of the highlights of my career here.) The point is, all these personal details about the servers were in my head. I didn’t need no stinkin’ automation. I could release a new service to all of production in a couple of hours. It was “not too big” at all!
And then, I decided to take a vacation. Yes, I know, how could I, right? Didn’t I know this was a startup? Everyone was working their behinds off. Vacation, what was that? But Paris called, and I had to go. And I realized that the engineer covering for me didn’t love my pets the same way I did — in fact he was downright allergic to them. He did not want to memorize their idiosyncrasies, he did not want to consult long feeding instructions, he did not want to croon them awake after each release. He needed stinkin’ automation. And suddenly a system that had been working for me, just didn’t work any more. It got “too big”.
The “too big” needed a new solution. I built some automation and a UI for point-and-click deployments to make releases easier for people who weren’t quite as steeped in tribal knowledge. My backup engineer didn’t have to get to know my pets personally any more, the software took care of all those details for him. Kokomo had to make do with an electronic babysitter. The automation and processes let us support the new level of scale. It wasn’t “too big” any more.
Of course, it didn’t stop there; Salesforce kept growing! We added servers, we added services, we added data centers, we expanded. And, the automation and processes expanded right along with us. But, Salesforce was still “not too big”. Everything grew together. There were a few blips along the way, but everything continued to chug along merrily.
Fast forward a few more years: we kept growing — like crazy. Our tools grew more sophisticated, too. We added Splunk for log analysis, Graphite for visualization, Nagios for infrastructure monitoring, and Cacti for network monitoring to name a few. We built several custom monitoring solutions for bespoke applications. We continued to add more automation. It was still “not too big”, everything continued to grow together.
But it was a false sense of security. The processes in place started creaking at the seams as we grew and grew. The data we were collecting from across our systems reached a tipping point, and suddenly there was too much data. There were too many processes and solutions to keep track . There was too much going on for us to keep up with. A system that had been working suddenly didn’t work any more. It all got “too big”! The scale grew beyond our solutions’ capacity.
So, again, we iterated. We added more elegant, robust solutions. We removed redundant systems. We established 100% level SLAs for a smaller set of services. We started focusing on end-to-end solutions, instead of point solutions for each individual area. We started building extensible platforms instead of standalone services.
The solutions were scaled up to meet the need of the day. The ecosystem started working again, and the current resources and processes were able to support the current scale. It wasn’t “too big” any more.
How are we doing today, you ask? We’re still in the same cycle, and we always will be. The important realization about engineering for scale is that there are only two measures of scale. Either the solution meets the current need, or it doesn’t. It works, or it doesn’t work. It’s “not too big” or “too big”.
The tricky bit is to recognize the approach of the tipping point early enough, so that you never reach it.
You can’t predict all your scaling requirements in the months and years to come, but you can come pretty darn close. If you take too much time upfront to build a perfect, distributed, large scale system, you may miss your window to market. Solve for the problems of the day, and make sure you’ll be able to recognize when the tipping point approaches in time! Be prepared for the tipping point, and if you do it right, you’ll never fall over the edge.
What is Salesforce doing to avoid tumbling tail over teakettle over the scale edge? My team, DVA (Diagnostics, Visibility and Analytics), is hard at work building systems that keep things from getting “too big”. We’ll talk more about data collection, streaming, and transformation and visualization services in a future post. If you’re interested in helping us tackle these challenges, send an email to firstname.lastname@example.org.
What are you doing to keep everything from getting “too big”? Tell us in the comments.