Skip to main content

OK EVENT LOG

Ian Varley
Apr 21 - 5 min read

When I was just out of college, in the late ’90s, living in San Francisco, one of my musician friends from back east called me up one day.

“Dude.” he said, “You’ve got to check out this record. Seriously.”

The subject of his admiration was the album OK Computer by Radiohead. And boy, he wasn’t kidding. It completely took over my CD walkman, and became the soundtrack for those first years in SoMA, working the graveyard shift at dotcoms, riding the boom and bust cycle. It was a damn fine record.

It was in much the same way that I heard about that famous blog post by Jay Kreps.

What, you’re not hip to Jay’s post? Well, here it is, and you’re welcome to pause now and go read it (though, fair warning, it’s LONG). I got it from one of our engineers in Tampa, and I don’t know where he got it. I passed it along dutifully to my peeps, and we all started to figure things out together.

If that sounds like a strange analogy to make, well … OK, it is. But bear with me, perhaps it’ll make sense.

Event Logs, Explained: No Surprises

The TL;DR (“Too Long; Didn’t Read”) of Jay’s original post is that by applying an old idea (publish/subscribe messaging) in a new way (distributed commit log, as a service), we can simplify all these giant complex systems we’re building. They did it at LinkedIn, and it worked, and they open sourced the result (Apache Kafka).

When I encountered Jay’s piece, I’d been working on distributed systems for years. Reading it was like hearing those first ringing guitar notes of “Airbag”. No doubt in my mind: this was important. If you’re not already a distributed systems thinker, it takes just a little bit of explaining to give you the gist.

In a super-small nutshell (seriously, read Jay’s post for the full deal), here’s how it works. First, a couple technical terms (not too many, I promise):

  • Let’s use the word event to mean “a record of something that happened”. That could be anything, really; a person logging in to a site, taking a picture, sending a text message, whatever. The important thing about an event is that it’s “immutable”: you can’t change or edit it later. It’s a fact, it happened.
  • Let’s also use the word log to mean “a time-ordered list of events”. Each event tells you what happened, and the ordering in the log tells you when they happened, with respect to one another. Logs are append-only; you can never go back and wedge something new into the middle.
  • When you put something onto the log, you’re a publisher, and when you read something from the log, you’re a subscriber. (That’s why systems like this are often called “pub/sub”, for “publish/subscribe”.)

Here’s the punchline: pub/sub systems are awesome because they provide one simple guarantee. When a publisher puts a new event into the log, and the system acknowledges it, then its place in the log (its order) is fixed forever. That means you can have any number of subscribers, and they are guaranteed to all see the same sequence of events, in the same order, without any further coordination between them.

If this doesn’t sound revolutionary to you, consider the alternative. In a database, anybody can (theoretically) change a value at any time. If I ask “What’s this person’s email address?” I’ll get one answer … but if I ask again a split second later, I could get a totally different answer (because they could have just changed their email address). How do I even know to ask again? I don’t.

The underlying issue here is physics. Information (bits, electrons, etc.) takes time to move from one place to another. Mutable systems (like databases) make this your problem (as a software engineer) to keep track of, if you’re doing work with more than one system at a time (which you almost always are).

Pub/sub systems, on the other hand, take all the mystery out of it, because the system is centered around an immutable event log that all the participants have the same view of. Everybody sees history unfold in the same way, and they can react to it at their own pace, without worrying too much about different interleavings of timing. If you’re a subscriber, and you have read up to a given point in that log, you’ve seen everything that happened, in the order it happened.

Now, don’t get me wrong; I loves me some databases. The point isn’t that they’re bad, the point is that they’re a bad metaphor for communication in a distributed system. If multiple parties need to collaborate on some information, a shared immutable log is a radically better building block than point-to-point message passing.

Which brings us to today. Distributed event log systems (like KafkaBookkeeperCDAP, etc.) take this relatively old idea, and bring it forward into the web-scale future as open source projects. They do the hard work of letting you scale that central log up to handle millions of messages per second, by sharing the work among multiple computers and providing fault tolerance for when some of those computers (inevitably) crash. That means that instead of being some niche component of your architecture, they can actually start to be the core of your architecture — the circulatory system, to use Jay’s metaphor.

That’s it! That’s the magic. OK, perhaps it’s not as universal in its appeal as Radiohead, but for software systems folks, this is as good as it gets. (We don’t get out much.)

Salesforce Gets More Productive

It turns out, I wasn’t the only one to get excited about this. Lots of folks within Salesforce engineering also saw the potential for transformation in the way we architect our systems. And they started to build things a little differently. In fact, it was such a common shared phenomenon that we started a weekly “tribe” get-together where people from different teams would share how they were working on incorporating this idea into their systems. From monitoring to data synchronization to product features, it started taking on a life of its own.

So, over the coming months, we’ll tell a few of the stories of what we’ve done, here on Medium. The first one, about how we transformed our Monitoring data transport infrastructure, is here.

Last year, Jay was kind enough to come give a talk about all of this to our engineering teams in San Francisco. It turns out that in addition to being a visionary, he’s a real nice guy, too. And now his new company, Confluent, is hosting the Kafka Summit next week in San Francisco. We’ll be there in force, so if you’re around, come see us.

Want to read more? See you over at episode #4.

Related Architecture Articles

View all