Do you deal with Data problems at work, or with CPU problems? It’s an absurd question. But engineers used to raise the most hands at the second question, not the first.
Over the past twenty years, a new kind of companies have been founded for whom raw CPU power was no longer a constraint, as tools such as Python exemplify. For the engineers working at these companies, data is a bigger challenge: its amount, its complexity, and its speed of change.
Nowadays, engineers become experts not by mastering algorithms, but rather by mastering the selection of tools and approaches that are the most suitable for the task.
However, most engineers approach their everyday challenges using designs that are inflexible to changing and combining data systems. It is as if most software products rely on the same data systems that they used from the beginning. They have only a hammer at their disposal and are puzzled when they find that not every problem looks like a nail.
Let us consider the following example: Twitter. Now that every company is doing it, take some time to think how you would implement a User class for a social media app.
…
It would probably look a lot like this:
This design pattern is called Active Record. Most engineers are familiar with it, because it pervades Web development, no matter if you use Django, SQLAlchemy, ponyORM, Tortoise, peewe or any other ORM in Python.
Engineers aren’t necessarily wrong when they choose Active Record for problems that aren’t too complex, because it’s easy to build and understand. Yet, it works well only if the object corresponds directly to the database table.
Even though tools like Django or Flask are now the de facto standard for Web development, they became so because they made design choices that emphasized speed of development. They succeeded because they’re quickstart frameworks for developers.
But, past a certain cutover level, most engineers, even experts, working with these frameworks, find themselves unable to make progress. They choose “boring” technology to avoid the unknown unknowns, but they keep falling into challenges that are all too familiar.
The problem with decision making under uncertainty is that it is dominated by information you don’t have yet. What frequently counts is what hasn’t happened—what Steve Jobs once called “connecting the dots”. You can only do that in retrospect.
Raffi Krikorian, who was VP of Platform Engineering at Twitter, has a very interesting talk called Timelines at Scale, where he notes that Twitter’s true scalability challenge is, surprisingly, not the posting of new tweets, but the load that comes from showing the home page to a lot of people who follow a lot of people. He called that fan-out.
There are at least two ways to design Twitter’s architecture. One is the one we’ve come up with, that makes it easy to write into the database.
Another one would be to make it easy to read from it, by maintaining a cache for each user’s home page—like a mailbox of tweets for each recipient user.
It is the rate of write vs read that determines the soundness of the architecture choice—how many tweets are posted vs how many tweets are read per unit of time. In the case of Twitter, scale has been achieved by controlling more granularly how data is written and read.
Solving scalability challenges usually comes down to shifting architectures, so I ask you: How would we shift from the first design to the second?
Active Record makes it impossible to “make it work, then make it right, then make it fast” without a Great Rewrite. In the meantime, the success of the social media app will be choked by the engineers’ inability to move away from an architecture that was a good idea in the beginning, but makes it unfeasible to incorporate new information about how the system is used. We’re bound to The Fail Whale.
This helps us see more clearly the root of the problem: different use cases have different requirements, and in the presence of Active Record, experts can’t shift from one data system to a better one when the use case changes.
Engineers that choose Active Record are optimizing for quick wins, at the cost of rigidity in data storage. They get to develop a simple relational data model fast. But they’re stuck with it.
It is altogether fitting and proper that we choose the Active Record for simple projects.
But, in a larger sense, we cannot dedicate ourselves to solving the impedance mismatch of databases and objects.
We cannot consecrate the mediocre state of a project and avoid touching it, for fear of getting burned.
We cannot hallow the Great Rewrite when the project freezes due to an inconvenient storage decision made at the beginning.
I solemnly declare, that these systems are, and of Right ought to be Free and Independent from all allegiance to early database decisions.
Rather than depending on a design pattern that couples the object and the database design from the beginning, experts aspire to engineer applications that are modular.
They value technical platforms through their migration costs rather than their capabilities.
Decisions about how data is persisted can be changed later, and the system as a whole works as if data resides in memory. That is called persistence ignorance.
In a world where use cases are continually redefined in search for product market fit, the tools used to persist and load data must be redefined just as easily.
My name is Alvaro Duran, and this is Working in Units. My goal is to give expert engineers a blueprint to make systems more modular at the data level.
Working in Units is a way of doing things that makes starts less quick, but addresses a very specific need for expert engineers: the flexibility to change data systems.
In order to create systems that achieve persistence ignorance, we’re going to do something controversial with the relationship between the model and the database.
Rather than declare how we want the database to look like with the Active Record, we’re going to specify the mapping process between the database and the objects.
With this imperative approach, we are going to admit a bit more complexity because of the extra mapping layer, but we are going to keep the database and objects relatively independent from each other.
We are shifting from one design pattern to another, called Data Mapper. Only experts seem to be aware of this lesser known pattern because it relies entirely on our ability to transfer data correctly between two layers.
In exchange, we don’t have to treat model classes as tables on a database, and we can do all sorts of Object Oriented things to them—composition, inheritance, polymorphism, things you thought twice before doing them with Active Record—, with complete control over how that ends up represented in the database.
This mapping defines the relationship. In order to make use of it, we need to create a class for it. We’re going to call that Repository, and it’s going to be responsible for representing objects as a conceptual set. That means that this class will have methods to add and get into/from the database. Whichever ORM we choose to use will be reflected in this class only, rather than scattered in the codebase.
We will end up with a very SQL-specific class that in the end centralizes all your data persistence concerns in one single place. This is helpful because your performance optimization efforts also get centralized.
This Repository pattern on its own is not enough. Because of this layer between objects and the database, we would constantly find ourselves grasping to keep things in sync, deciding at every point whether it makes sense to refresh discrepancies with the database at that point. Figure skating on the ice of concurrency.
In order to tie things over, we can use the Unit of Work pattern. This is a context manager whose responsibility is deciding what to do when a COMMIT or a ROLLBACK happens. It handles the opening of the connection, mitigates concurrency issues, and writes changes efficiently.
But more importantly, the rest of the application never deals with database updates explicitly.
When a User posts a tweet, the operation enters a unit of work that exposes TweetRepository and its post method while handling concurrency. This method adds the tweet to the mailbox for each follower, and the unit of work is responsible for closing the connection.
Eventually, when one of their followers goes to the home page, the endpoint enters another unit of work that exposes the TweetRepository and its get method. This method should have the necessary information to figure out which data system to look for to retrieve the tweets, and the unit of work will close the connection afterwards.
Because that’s the final twist: once approach 2 was implemented, Twitter’s engineers are moving to a hybrid of both approaches. Most users’ tweets will continue to be fanned out to their mailbox at the time when they are posted.
But a small number of users are excepted from this: celebrities.
There are people followed by hundreds of millions, like Elon Musk, Barack Obama or Justin Bieber, and there’s millions of others like me.
Tweets from these widely followed people are fetched, instead from the database, and merged with the user’s home page at read time, because it is very resource intensive to add their tweets to all of their followers’ mailboxes; loading them just-in-time is much more efficient.
This hybrid approach is implemented entirely within the boundaries of the Repository. No other part of the code is aware of this change.
This helps show that this design’s core advantage is the ability to change your mind with regards to data storage without changing too much code.
If you are interested in learning more about this, I would highly recommend three sources:
- Architecture Patterns in Python, where these ideas and more are implemented in the context of Domain Driven Design.
- Designing Data Intensive Applications, which will help you navigate the diverse and fast-changing landscape of technologies for processing and storing data.
- Patterns of Enterprise Application Architecture, where the design patterns used in this talk, and more, are discussed in detail.