So, in the last episode, I talked about how databases are magic. But, I saved one of the most magical topics for last: transactions! That’s what I’m going to talk about today. And I’m going to do it with a little help from my buddy Pat Helland, a Principal Architect here at Salesforce.
You probably have a rough idea of what the word “transaction” means; think about going to the bank.When you stick a $100 bill into the slot on the ATM, you have exactly one result in mind at the end of that interaction: you have $100 less in your pocket, and $100 more in your account.
That’s meant to function as an atomic unit — “atomic” in the ancient greek sense of “uncuttable” (ἄτομος). You want those two events to be inseparable. If you trick the ATM into thinking you gave it $100, but it was really a gum wrapper, you’re now up $100 (party time!). Or, conversely, if the ATM achieves self-consciousness and takes your hundred dollar bill on a gambling spree in Atlantic City instead of crediting your account … well, you’re down $100 (boo).
Neither of those is correct behavior, obviously.
In the messy real world, there are no guarantees. There are a zillion things that could go wrong between any two moments (including, but not limited to, ATMs becoming conscious). Life is fragile and unpredictable; eat dessert first.
In the world of software, on the other hand, it turns out you can actually do a little better than that. Guarantees of correct behavior are not only possible, they’re critical; if you’re going to build complex distributed systems, you need these “primitives” — building blocks with which to assemble larger systems, so you don’t have to second guess yourself at every turn.
That’s where you start getting hooked on ACID.
ACID? Yeah, A.C.I.D. It’s an acronym (those are like catnip for programmers). It stands for “Atomic, Consistent, Isolated and Durable”:
- Atomic means that when you make a change, you get a guarantee from the database that this entire change either happened, or it didn’t. It’ll never end up half-completed.
- Consistent means that the database obeys rules about what states are valid. When you successfully commit a change to the database, the next time anyone looks at that data, they see the latest version you put in there (not an earlier one).
- Isolated means that if more than one operation is going on at the same time, those operations only interact in predictable ways. For instance, that might mean that nobody else can see the data you’re writing until you’ve successfully committed it.
- Durable means that if you say some data is written, it’s safe. You won’t lose it, even if the power goes out the very next nanosecond.
None of these properties seem very weird — I mean, this is how you’d expect a system to behave, right? Not doing any of these things would seem pretty broken.
Well, it turns out that before relational databases came along, most storage systems didn’t guarantee this. They mostly behaved this way, but not 100% of the time. If you wanted a guarantee for those properties, you had to build that into your application on top of the database.
And, that turns out to be pretty hard. Think of what a pain-in-the-acid it would be to consider every permutation of how things could go wrong: if you have a 10 step process, what do you do if the server or network failed after … step 1? Or step 2? And so forth. ACID transactions are a huge aid in simplifying applications, by making sure the storage behaves the way it seems like it should.
And, unlike the fickle real world, a running database can actually guarantee this to 100%, by being a bit pessimistic: a transaction is flipped into “committed” state only once all the work is done and is shown to be correct; anything else is rolled back, as if it never happened. Different databases do this in different ways, but mostly it’s by having a special log (usually called a “write-ahead” or “undo” log) into which it writes all the operations as they happen, so that in the event of a crash, you can automatically replay the recent actions forward or backward to get to a consistent point. You simply can’t have a healthy, running database that somehow voids these guarantees; if it’s running, it’s right.
(Very technical readers may note that there are a lot of pretty fine hairs to split with things like isolation levels in ACID. If you want to geek out on that, this is a good place to start.)
Relational databases were designed to give you this kind of guarantee. You can write a program that updates thousands of records, in hundreds of different tables, and commit or rollback the whole thing. This is beyond useful — it’s PFM. (Pure Freaking Magic.)
At Salesforce, many of our customer-enabling products are powered by a relational database — and transactions are key part of that. As our customers modify data, or use administration tools to modify metadata and settings, they get that ACID guarantee, so things will never end up in a half-committed state. The same goes for when they create more complex business logic (for example, using Apex code); customers don’t need complex “middleware” systems, this is just how it works right out of the box.
This behavior is partly what allows our customers to use Salesforce as a system of record for their most critical data, despite the fact that it’s a platform “as a service”. You’re getting the benefits of database transactional behavior, by virtue of the fact that under the covers is a real live relational database, with transactional guarantees that go all the way down. The default behavior is that operations you perform in a single request get the ACID treatment.
Now, to be clear, explicit transactions like this don’t cover all of the operations within Salesforce. For one thing, when you’re working with extremely large data sets, it simply isn’t practical to do this in bounded time, which is why we also support patterns like primary key chunking and BigObjects. But more importantly, Salesforce doesn’t do everything in a single database. That’s where we come to data on the outside.
As cool as they are, ACID transactions can only reliably operate on the inside, within a single running service. What do I mean by “inside”? Rather than explain it fully, I’d direct you to Pat Helland’s landmark paper:
(If you’re in a rush, The Morning Paper just did a great summary of the paper this week, here.)
In a nutshell: it’s only when data lives in a single database that it can deliver on the transactional promises we’ve been talking about. It lives in the “now” — a fully consistent and ordered series of changes, mediated by that good old magical database.
But as soon as data steps outside those boundaries, it becomes subject to “Einsteinian” physics (that is, relativity): information takes time to travel from service to service, and you can only ever know about its state as of some time in the past. And, importantly, you lose the ability to do the kind of guaranteed ACID transactions we talked about above: they’re too brittle and slow.
Why is that? Because when messages have to travel between different systems (instead of using, for example, shared memory), it becomes much more difficult to make sure everyone is on the same page. Transactions that span machines — “distributed transactions”, as they’re called — are technically possible, but they’re really hard to get right at scale. (Pat’s analogy for this is: imagine you’re getting married over the phone. You say “I Do”, and then the call drops. Are you allowed to date other people??)
Thus, many of our systems operate in a distributed mode, outside the safe walls of the relational database. And, there’s really no other way it could be.
As an example, think about search. We run a process to continually index all the data in Salesforce (otherwise, if you were to do a search for the word “banana”, the system would have to inspect every record individually at the time of your search, which would take forever!). Every time a piece of data is updated, a copy of that new data is delivered to this search system for indexing, so searches are blazing fast.
But technically speaking, that indexing is always operating on a a slightly out-of-date reality — the “past”. In between the instant you updated the value, and the moment when the search indexer gets it (usually a few seconds later), it might have been changed again (which would, of course, prompt another message to the indexer in the future). The search indexer has to assume it doesn’t really know the current truth about the data, only a slightly delayed series of telegrams about it.
Why? Because the search indexer system is “on the outside” — it’s a separate service from the database (in our case, implemented as a cluster running Apache Solr). As powerful and magical as databases are, they can’t be expected to surround every aspect of a large distributed system like Salesforce’s service.
The same distributed nature is true for many other parts of the overall system, too: mobile caches, analytics services, machine learning loops, geographic replication, and more. There’s no practical way to force all of these many (high scale!) systems into one giant box, even in theory. Data relativity is just a fact of life when you build high scale systems.
So how do we keep things from descending into the kind of weird ATM-gone-wild scenarios we talked about above? The answer is to know when to use which.
For the modifications encapsulated in a single customer request, the focal point of our transactional updates is the relational database. To ensure we have enough scale and capacity, we heavily shard our infrastructure into what we call “instances” (which, you’ll recall, I talked about way back in Episode 2 — now you know why it’s so important!) At the heart of our system, updates by a single customer to their standard record-oriented data must all stream through this one point, kept in line by the relational database.
For the rest of the disparate parts of our big distributed system (search, analytics, etc), we follow the the rules of engagement that Pat proposes in DOTOvDOTI: data must be immutable, identifiable, and versioned. And where the data conforms to a schema, that schema should also be versioned, so those messages that flow between services make sense no matter how much things have changed since they were sent.
These rules of thumb are intended to preserve the conceptual correctness of data. If you can simultaneously change data in two different systems at once, neither of those systems can be trusted to be correct. Instead, the path of data mutations has to be a DAG (directed, acyclic graph), and for most data, the relational database is the starting point of that graph.
Hopefully this has given you a little insight into what transactional guarantees are all about, and way we think about using both transactional and non-transactional behavior in coherent ways. In future episodes, we’ll talk more about these streams of data that flow between services, and how one might reason about those to do cool stuff — and to be smarter and more predictive about every customer.
In the meantime, me and my ATM friends are heading to Vegas.
Want moar architecture? Head yourself on over to episode #6.