Fishing for Abstraction

Zach Wade

When building large systems, there is an inherent conflict between complexity and simplicity. Trend too far to in the "simple" direction and you may find yourself writing business logic in Basic-84. Likewise, when one dives too far into complexity they may find themselves unable to construct effective models — ultimately overwhelming themselves and others.

Yet, we could also argue that complexity and simplicity are not so much in opposition, but rather two sides of the same coin. In particular, we use the concept of "abstraction" to take complexity and present it as simplicity. Let's take some time to understand what we mean by abstraction here, and then begin to explore the practical relationships between complexity and simplicity as it pertains to systems and software.

A Platonic Ideal

Abstraction is the backbone of software. Whether we use the term to describe "physical" systems that impose constraints (think, for instance, an abstract class), or whether we instead refer to so-called zero-cost abstractions, these models allow us to reason about code in a effective, portable manner.

A good abstraction is a bit like the allegory of the cave — the true system exists just out of sight while an immaterial light casts a projection of it for our own vision. The shadows on the wall don't represent the full reality, but they hold a glimmer of truth as seen through one particular perspective. At its core, abstraction is the processes of finding the right angle to shine the light such that the resulting shadows are useful.

Please don't make me write Python

To avoid waxing even more poetic, let's take a look at this in practice. File management lies at the core of nearly every application. Even if the program itself has no need for reading or writing files, the system itself must at least read the executable in order to run it.

Additionally, file-systems house a tremendous amount of complexity. Not only are there many available file-system standards that each employ their own datastructures and layouts, but even within the kernel itself there are several ways of managing file descriptors.

Going back to our previous allegory, there is never a singular abstraction. Instead, abstraction takes the complexity of the system and views it through a certain angle to make it appear simple. Let's explore some of the angles through which we might view our file-system.

The simplest of these might be the kinds of file system APIs you see in high-level interpreted languages such as Python. These APIs tend to look a bit like:

def open(path: str, mode: str) -> File: ...

class File:
    def read(self, n: int) -> bytes: ...
    def write(self, data: bytes) -> int: ...
    def close(self): ...

Even if you've never written a word of Python, I would imagine you can understand the nature of this API. This is for good reason, the abstractions chosen for file-system operations in Python are very simple! They took the wide complexity of the file system and packaged it into something highly digestible.

Does this make hte abstraction inherently good? For many use cases perhaps, but there are many still for which such an abstraction is poorly designed. Let's turn our eyes slightly closer to the flames and see file systems from a different perspective.

What's a u-ring?

When Python goes to implement the above abstractions, it finds itself faced with a different abstraction: that which is provided by the kernel. The kernel has its own notion of files and operations, each of which are handled via dedicated syscall operations:

size_t open(char* path, size_t mode);
size_t read(size_t fd, size_t length, char* buffer);
size_t puts(size_t fd, size_t length, char* buffer);
void close(size_t fd);

If you squint, you'll see that these two abstractions bare a lot of similarity. One would even be forgiven for thinking that the Python variant offers little benefit over the raw kernel operations. However, consider for a moment the alternative set of APIs offered by the io_uring functionality introduced in Linux 5.X. These APIs look very different:

int aio_read(struct aiocb *aiocbp);
int aio_write(struct aiocb *aiocbp);
int aio_suspend(const struct aiocb *const aiocb_list[], int n,
    const struct timespec *restrict timeout);

This abstraction looks so wildly different that it must clearly be for a different purpose than our initial Python abstraction, right? And yet — these APIs were designed explicitly for languages like Python who want to be able to perform high-throughput asynchronous file operations. As a consumer of Python you never need to know about the use of a u-ring, yet a Python maintainer would. Neither abstraction is better or worse, but they serve two fundamentally different purposes.

Designing abstractions

The above example may seem overly contrived. After all, of course file system operations should look different in Python and in C. Yet there's no reason the Python abstraction has to look the way it does. Rather, it was designed that way from the beginning because the developers thought it would be most effective for their users.

A refrain I repeat often is that "code is communication." Certainly it's communication with the computer (i.e. "here's what I want you to do"), but it is just as much communication with your teammates and with yourself. When designing any abstraction, whether expensive or cheap, you should first determine what it is that you are trying to communicate.

Knowing what to withold

At Compass, we rely somewhat heavily on zero or low-cost abstractions. These can be thought of as a purely-communicative; they themselves make use of no actual runtime functionality, but they use various forms of static anaylsis (such as a type system) to impose a specific perspective on their consumer.

Consider the following TypeScript code:

class PgBackedFs {
    constructor(connectionPool: Pool, table: string);

    getConnection(): PgConnection;

    writeToBuffer(bufferId: number, offset: number, data: Buffer): void;
    readFromBuffer(bufferId: number, offset: number, size: number): Buffer;
    syncBuffer(bufferId): Promise<void>;

    open(path: string): Promise<number>;
    read(fd: number, size: number): Promise<Buffer>;
    write(fd: number, offset: number, data: Buffer): Promise<number>;
}

interface FsImpl {
    open(path: string): Promise<number>;
    read(fd: number, size: number): Promise<Buffer>;
    write(fd: number, offset: number, data: Buffer): Promise<number>;
}

function newSftpServer(port: number, fileSystem: FsImpl);

We have code much like this in our codebase in order to create a virtualized SFTP server for syncing customer data. When we initialize the server we pass it an instance of PgBackedFs. Before we do that, however, we widen its type to FsImpl.

I mention this because doing so is unnecessary! We could very easily have the type of fileSystem be PgBackedFs. This would be no less correct and the resulting code no less functional. However, at some point I or another engineer will be editing the SFTP server and will need to add or change a file system operation. By offering them the explicit abstraction of FsImpl I've pre-emptively communcicated to them how they should be using it, not just the raw functionality it offers.

As a principle, consider what the "happy path" for your users looks like, and offer them an abstraction that makes it difficult to do anything else.

Knowing what to offer

The other half of abstraction design is understanding what to provide your consumers. In the above example, we primarily think of interface FsImpl as the abstraction over PgBackedFs. However, PgBackedFs is itself an abstraction over postgres itself. Furthermore, it's a costly abstraction, one that has actual code (although we've omitted it here) associated with it in order make the resulting API function.

How do we choose which methods to include in our final abstraction? We easily could have chosen to add a half dozen others. For instance, why not have a "syncAllBuffers" method that ellides all buffers in the cache? In a similar manner, why not have a bulkRead or bulkWrite method?

Many people are quick to tout that premature abstraction is the root of all evil. In more extreme cases this will be used as an argument against abstraction in general. At the same time, abstraction is merely a tool just as any other piece of software is. This leads to the golden rule of abstraction design: Don't do it until you need it!

In the above case, we didn't implement those methods because they're not needed for the API we had in mind. Similarly, we didn't extend the FsImpl API with additional functionality because the SFTP server doesn't need it. Build what you need, when you need it, and use good abstractions to do it well.

Final Thoughts

Abstractions are neither inherently evil, nor implicitly good. System design will always be about tradeoffs and finding the right abstraction is a large piece of this.

At CompassRx, we tackle complex regulatory and operational tasks by designing systems whose technical abstractions closely mimic our real-world constraints. If these kinds of abstraction design sound interesting to you, send us an email and let's chat!