How did we get here?
If you’ve been around for a while or worked in a start-up that didn’t think it had to scale to the bazillions from the get-go you’ve probably started out with one database and one monolith. Things were smooth, things were fast. You would care about squeezing the last drop of juice out of that SQL query with some late row binding black magic just because you felt it was your biggest pain at the moment. In which module does this functionality go? What service do I need to call? All RPC just without the Aarrrrrr. If things had grown you might have played around with an app-internal event bus or observers and fought lateral calls, layers, and onions.
We then had to evolve and no longer thought about in what module to put behaviour but in what service. Which domain does this belong to? Do we need to call it alongside anything else like 99% of the time? If you were unlucky you scaled out to several services but without distributing ownership of the data. Your services would wildly r/w „to and fro“ one DB and any table, perhaps all with the same DB user just because, and soon you’d find yourself in a schema gridlock where it was an insane effort to change anything in your DB schema because you didn’t even know who was using it, let alone what all this even meant.
Perhaps you were lucky and one kind of data had exactly one owner.
Now you had these twenty CRUD services devoid of any real meaning. Cheap to write, yet somehow still full of bugs and hard to change. One thing happening in your business would cause eleven calls, a cascade of circuit breakers, reasonable timeouts, and retry strategies you had in place, right? Right?
Alright, this world is kinda boring to live in and despite its simplicity somehow hard to comprehend. We can do better. What if we cared about business processes again? What if we spoke the same language as the business? What if things … happened and we reacted to them? Remember the event bus we fiddled around with earlier? Let’s make the difficulties of complex adaptive distributed systems explicit. Let’s do EDA with ECST.
Now this is a nice place to live in. Yeah, we have to care about order and eventual consistency and choreographies and idempotency and what goes into this event actually? But the things happening in our system are the things happening in the business. Ain’t that neat?
Yet still, you have all these more or less moving parts over and over again. If you fully rely on ECST and your service needs nine different aggregates that’s (hopefully) nine streams or topics to listen to, nine projections, nine evolving schemas to comply with, perhaps nine inboxes (and outboxes on the other side), DLQs, alerting, probably revisions to track, and perhaps dangling references you need to wait for to be resolvable.
Square one?
So, six paragraphs into this piece – what if we went back to square one? It’s 2024 and we have globally distributed, highly scalable databases. What if we got back to essentially having just The Database™? I am not talking about an event store (which, too, is basically the database). Each table would have its owner (writer) and could be a shard or instance. Every reader could read from some replica. Your service would still have its own storage, invisible to others. But alongside it, probably in a system-level namespace you’d also have an integration table. Its schema is the contract you adhere to. If you need to change it you can create a new v2 table, write to both and drop v1 after a while. If you needed to have all the history you’d be free to store it – just add a revision column to your PK.
What would we gain?
No need for outboxes, inboxes, buses, async schemas, queues, topics, and the like. Writes are also guaranteed in-order (so are reads). No dangling references. Instead: actual referential integrity. No translation to a different technology and back again. Your projections could be „actual“ projections – just views on the tables you’re interested in. No replays. We would pretty much not have to bother with the async nature of a distributed system. Sure, there’s replication lag – but if you require the exact version of something you have to referentiate that anyway and it’s rather rarely the case you do. Real DBs are as fast as it gets (not that messaging was slow, even if done via polling and HTTP).
Simple use cases could stay very simple. You read from and write to some tables. Perhaps we’d even realize the translation from and into our private tables as triggers so that we really don’t need to care about the integration. It’ll happen behind the scenes. We’d be living in single-DB monolithic wonderland again.
What would we lose?
Kinda one of the biggest benefits – the actual events. The what relevant thing has happened in the business? We would have to emulate this. If we kept history we could have an event column right next to the revision (or whatever clock). We would then have to poll and scan from our current cursor in time – just as with streams. Or we write a separate feed / journal.
What about push-based delivery? We could probably build something with CDC or triggers and web hooks, basically.
What about commands? Or anything else we’d send as a message? We could use the same thing. Add a command inbox. And if we need a response? The issuer could have a public inbox itself we write to.
What else?
We’d force everyone to use the DB technology the ether provides. Every service could still use whatever storage fits its use case best, but for most services it would probably make sense to just use the very same thing. This could both be an advantage and a danger.
Each and every integration-relevant data we’d have at least twice. But that’s no difference to actual EDAs.
RDBMs would provide us with fine-grained access control – something a lot of EDAs lack.
If you don’t need a projection a delete is an actual delete (whether it’s soft or not). Great when you face GDPR. We think of The Log as a good thing and it is, but sometimes you just have to delete stuff. At the same time schema migrations would be rather easy. Create your v2 table as a copy of v1 and make your changes. Or run an explicit migration if it’s more complex. Here, you can rewrite history (of course you need to see how this affects your feed or history if you have one).
Without any queues in places there’s no buffer. But we don’t actually need it. As in stream-based EDAs where a consumer would just continue reading The Log from its own cursor. So consumers can catch up and don’t get flooded.
The DB would be the central SPOF and bottleneck – just like the bus or your Kafka or what have you.
Conclusion
In my head this plays out quite nicely. You would probably poll a couple of tables that are relevant to you or their „feeds“. Or perhaps there’s just one feed and you just SELECT the ones you need. Perhaps your service has its own instance, replicating from the other instances‘ WALs (The Log ;)). So you would turn the messaging into some sort of federated WAL-based replication. Or you scale your DBs according to your overall load. But then a heavy writer would impact all other services. So we’d probably want to write to our instance and have others read from the WAL. Having just one global feed would then, sadly, be difficult again at certain volumes.
How could we get there?
To me, without having invested too many thoughts into this, the most reasonable thing seems to be having one instance per service. It’s the authority for your tables and replicates from the WALs of the stuff you’re interested in. Of course, you’d want more than just one instance as a fail-over for outages or maintenance. Perhaps even a global cluster. It’d be neat if you could just have some extra instances in your global pool that could dynamically be claimed for such scenarios.
You’d probably want to be able to GRANT READ on some_table* so you don’t have to change permissions each time you push a new schema version.
In conclusion, someone would probably have to sit down and tweak postgres to do those things :)