Thoughts on Immutable Databases

March 6, 2015
databases

I came across this talk on Apache Samza this past week:

https://www.youtube.com/watch?v=fU9hR3kiOK0

The author, Martin Kleppmann, also wrote up an excellent post on the same subject, which is well worth your time.

In both, he argues that an stream of immutable facts (a log, essentially) is a better datastructure for databases, rather than the existing model of mutating state in place.

No fact in this stream of facts is ever altered, the stream is only appended to. Database views are then built on top of this stream of facts, and these views automatically update when a new fact is appended to the stream. This allows for the same kind of querying that we all love, but eliminates some of the following headaches:

Migrations. Just create a new view for your log, and when it’s ready, move your app over to the new view.
Testing Features. You can create a whole new set of views into your data for a feature, give only certain users access to them, and if the feature doesn’t pan out, delete them all with no negative consequences.
Real-time updates. Since the foundational datastructure is really an event stream, these events can be propogated to subscribers quite easily. Your app can be notified whenever any of the views it depends on is updated.
Scaling. Since the data is stored as a list of facts, it’s easier to distribute across multiple machines, especially since these facts are never changed.

With that introduction out of the way, (and you really should watch the talk or read the post) I’ve got a few thoughts.

First, this sounds very attractive to me. If I had that much flexibility with my views, I could move much faster and be a lot more ambitious with my changes. After all, since nothing in the underlying data ever changes, we can always go back.

I do have a few concerns though.

The log would get huge. Imagine storing every fact generated by your web app in one or more logs. Those logs would get very, very large! They could be compressed, but this might slow the database down, depending on how often it compresses the data. It’d be like GC.
You’d trade migrations for long waits creating new views. In order to generate a new view, the database would have to scan all the relevant logs, which is to say, terabytes of data. While you’d no longer have to worry about breaking your production web app during the migration, you might have to sit and wait for a long time before your new view could be used.
Validations are essential. Since the facts written to the database are immutable, you must prevent facts from being recorded in an invalid format. Otherwise, you’ll be dealing with that malformed data forever. It’s also worth pointing out that there seem to be no databases currently available that use this architecture. However, there is a project, PipelineDB which seems to be doing just that. Details are sparse, but I’ve applied for the beta program, and if I get in, I’ll post about it.

Despite those concerns, this seems to be a promising new direction for a sector of tech that’s stayed relatively constant for a long time, databases. It’ll be interesting to see how this approach turns out once it starts hitting production!

Read more