Creating a Memory Allocator for Fun


Lately I’ve been reading the well known Operating Systems: Three Easy Pieces book. It is one of the best operating systems books available, so it goes without saying that the book is very much worth reading – and it’s even available for free, though you should probably buy it if you can.

A nice part of the book is that it has a list of projects you can work on to deepen your understanding of the concepts. One such project is about writing a memory allocator, i.e. a malloc() and free() replacement. I already had an idea how malloc() worked, having been exposed to its internals when working on some CTF challenges, as well as the book itself. However, I felt very motivated to write my own memory allocator, so I took up the challenge. You can find the code on GitHub if you’re interested, or continue reading below to learn how it works.

First things first

In case you’re not familiar at all with how memory allocation works, in very shallow terms, there’s the stack and the heap. The stack is used for things like local variables, function call arguments, and generally things with a fixed size. The stack size itself is pretty limited – in my macOS device the stack size seems to be about 8MB, and the max stack size is about 64MB (using the command launchctl limit stack). For example, if you have a program with very deep recursion (i.e. too many function calls in the call stack), you’ll get the dreaded stack overflow error.

The heap doesn’t have such problems. You’d have no problem allocating 4GB of memory on the heap, and the OS wouldn’t even flinch (assuming you have plenty of free memory). In a gist, the stack is short-lived memory that is managed for you, whereas the heap is long-lived memory you have to manage by yourself, by using calls such as malloc() and free() which are part of a memory allocation library. Calling malloc(50) allocates 50 bytes on the heap, and gives you a pointer to that allocated memory. Calling free(pointer) will release the occupied memory, and make it available for future requests.

A memory allocation library is basically the heap middleman between you and the OS; it “buys” memory in bulk from the OS, and “sells” it to you in as small amounts as you desire. You can also cut out the middleman and build your own little memory allocation library, which I will be describing next.

A short demo

Before we start diving into its internals, the memory allocation library – which I’ve creatively named TAlloc – offers a very simple API. There’s TAlloc_malloc(), and TAlloc_free(). You would use these two just as you’d use malloc() and free(). Here’s an example:

char *very_large_string = (char *) TAlloc_malloc(10000);
for (int i = 0; i < 10000; ++i) {
  very_large_string[i] = '\0';
}
// do something here...
// finally free the allocated string
TAlloc_free(very_large_string);

As you can see, it’s pretty much the same thing, only with slightly more cumbersome function names.

Cutting out the middleman

Modern operating systems use units called pages for memory management. This is the smallest unit of memory the OS can give out, no matter how small your memory needs. In order to ask the OS directly for memory, we can use the mmap() function, which will make a system call behind the scenes.

TAlloc manages memory in what are called arenas. An arena is basically a large-ish block of memory. Any time memory is requested, a slice is taken from an arena that can accommodate that request.

typedef struct __talloc_state_t {
	talloc_arena_t *arena_head; // the head of the arena linked list
	talloc_arena_t *arena_tail; // the tail of the arena linked list
	size_t minallocsize, pagesize; // the page size
	char initialized; // has the first arena been allocated?
} talloc_state_t;

Above you’ll see the definition of talloc_state_t, which essentially stores the global state of our memory allocator. minallocsize represents the smallest size an arena can have, whereas pagesize simply stores the page size of the system, to avoid multiple system calls. Any time a new arena is created, its size will be at least minallocsize, and it will also be a multiple of pagesize.

You might notice arenas are stored as a linked list. Precisely, they’re stored as a doubly linked list: each arena points to both the previous as well as the next arena in the list.

typedef struct __talloc_arena_t {
	size_t allocated; // total space taken by the arena including space needed for metadata
	size_t max_free_space; // space of the largest free chunk available
	talloc_chunk_t *free_list; // free chunks linked list
	struct __talloc_arena_t *next; // next arena in the list
	struct __talloc_arena_t *prev; // previous arena in the list
} talloc_arena_t;

As I said, any time memory is requested, we take up the requested amount of space from an arena that has enough free space to accommodate the request. Above you can see the talloc_arena_t struct listed, which represents a single arena. prev and next are basically the previous and the next arena in the linked list, which I have already mentioned.

The relationship between the allocator’s state and the arenas

Of special interest in the talloc_arena_t struct are the first three fields. allocated basically tells us what the total space the arena is occupying is. This is more than the usable memory, and includes the space needed for metadata, such as the struct itself. Next we have max_free_space, which tells us the largest chunk of memory we can allocate in this arena. So, if for an arena X, max_free_space is 100, and we want to allocate 80 bytes of memory, then X can fulfill our request. If we want more than 100 bytes, we need to look for another arena, or create a new one if none of the existing arenas has that much free space.

Next, there’s free_list, which, as you might imagine, is yet another linked list. This is a pointer to the head of a linked list which stores free chunks of memory in an arena. When an arena is first initialized, free_list will contain a single chunk which represents all of the available space in the arena. As memory of the arena is allocated and freed, the available space of the arena will become fragmented, and free_list will probably contain multiple chunks instead of a single one.

typedef struct __talloc_chunk_t {
	size_t size; // available size in the chunk
	struct __talloc_chunk_t *next; // next free chunk
} talloc_chunk_t;

The talloc_chunk_t struct is rather simple, containing the size of the chunk in bytes, as well as a pointer to the next free chunk in the list.

An example of a newly initialized arena with a total of 1000 bytes. 56 bytes are subtracted from the total: 40 for the size of talloc_arena_t, and 16 for the size of talloc_chunk_t.

Now perhaps you might be wondering, what about the actual free space? Where is it? Well, the actual free space for each chunk is right after the chunk itself, as shown in the little diagram below:

Simple pointer arithmetic gives us the pointer to the free space

By now we should have a fuzzy picture of how the pieces fall together. We have the state, which holds a list of arenas. For each arena, we have a list of chunks of free memory, with varying sizes. Anytime we receive a request to allocate N bytes of memory, we traverse the arenas linked list until we find an arena that has a max_free_space greater than or equal to N. If there is no such arena, we then attempt to allocate another arena, returning NULL if that’s not possible. Otherwise, if we do find an arena with enough free space, we traverse the list of free chunks, until we find the first chunk that has a size that is no less than N. Once we find that, we remove the chunk from the free chunks list, and return a pointer to the allocated memory back to the requester.

This approach is called first-fit memory allocation. The first chunk that fits the space requirements is the one we use.

Deeper into the rabbit hole

You might have noticed that I haven’t mentioned anything about keeping track of allocated chunks of memory. This is because we don’t.

typedef struct __talloc_header_t {
	size_t size; // size of the allocated memory
	uintptr_t magic; // the magic field which should be equal to TALLOC_MAGIC
} talloc_header_t;

When a chunk gets allocated, we “replace” the talloc_chunk_t header with talloc_header_t. I put replace in quotes because there’s no actual replacement occurring, rather a simple recasting, since both structs have the same memory footprint. We then put a known constant into the magic field for some primitive integrity checking.

An example of an arena with three free chunks

For example, take the arena above. For simplicity, let’s assume this is the only arena we have. It has three chunks in its free chunks list, with the largest chunk having 50 bytes of available space, hence max_free_space‘s value. Now, suppose we want to allocate some bytes for a string:

char *my_string = TAlloc_malloc(30);
// do something with the string here...

Obviously the best choice here would be the third chunk, since it has exactly 30 bytes of free space ready for us to use. However, since we’re using the first fit approach in TAlloc, we’ll actually end up using the second chunk instead. The following diagram shows what happens to the arena after the allocation.

Our example arena after 30 bytes were allocated via TAlloc_malloc()

An interesting thing here is that the second chunk was not used completely. Rather, we simply split the chunk into two, only taking the 30 bytes we needed, leaving behind the excess space.

Notice how the max_free_space is now 30. Also, the second chunk now has 12 bytes of free space, instead of the expected 20. The reason for this, is that we need 8 bytes for the talloc_chunk_t. To the right of the diagram, you will see that my_string points to the free space, and right before it there’s the talloc_header_t header. Once we call TAlloc_free(my_string), we’ll use that header to decide whether this is a valid pointer we’re freeing by verifying the magic. If the magic number checks out, we will then return this now free space to the arena. So let’s suppose we just freed my_string.

Normally, we could just return the freed chunk to the free list and be done with it. But this would lead to fragmentation in a way that would effectively render our arenas unusable. Instead of the chunk with 50 bytes of free memory we originally had, after freeing my_string, we’d end up with two chunks, one 12 bytes, the other 30 bytes. This is where colaescing comes into play.

After my_string was freed, we have two contiguous free chunks

As you can see in the diagram above, we are in a situation where we have two chunks which are contiguous in memory. For reference, I’ve added the location to where my_string was pointing. Since we have two free chunks one after another, with no space in between, we can safely merge them into a single chunk, i.e. we coalesce the two chunks.

talloc_chunk_t *insert_after = arena->free_list;
while (insert_after->next && insert_after->next < chunk) {
	insert_after = insert_after->next;
}
chunk->next = insert_after->next;
insert_after->next = chunk;
TAlloc_coalesce(chunk);
TAlloc_adjust_space_for_new_chunk(arena, chunk);
TAlloc_coalesce(insert_after);
TAlloc_adjust_space_for_new_chunk(arena, insert_after);

To achieve this, we need to keep the linked list sorted. The code above shows only part of the TAlloc_free() code. This is the generic case where the chunk needs to be inserted somewhere in the middle of the linked list.

The chunks are sorted based on their address, with the lowest one being first. This ensures that we can easily determine if coalescing is possible, thus keeping memory fragmentation low. After inserting the chunk, we then take care of colaescing any chunks that are next to one another. We first try to colaesce the inserted chunk with the one that comes right after, and then colaesce the chunk right before the inserted chunk with the inserted one.

TAlloc_coalesce() is a very simple function, which simply checks whether two chunks are next to one another, and then merges them together if so. On the other hand, TAlloc_adjust_space_for_new_chunk() adjusts the arena’s max_free_space if necessary. Both are pretty simple functions; you can consult with the source code if you’d like to see the implementation.

TAlloc_malloc() should be relatively easy to understand now that you’ve read how TAlloc works. Last but not least, I feel like TAlloc_debug_print() deserves a shout-out. Calling it will print out to standard output the current layout of the heap that TAlloc is managing. Here’s an example output:

Arena at 0x102544000, 16384000 bytes, 40 reserved
  Allocated chunk at 0x102544028, 10 bytes, 16 reserved
  Allocated chunk at 0x102544042, 50 bytes, 16 reserved
  Allocated chunk at 0x102544084, 2000 bytes, 16 reserved
  Free chunk at 0x102544864, 16381836 bytes, 16 reserved
Arena at 0x1034e4000, 20004864 bytes, 40 reserved
  Allocated chunk at 0x1034e4028, 20000000 bytes, 16 reserved
  Free chunk at 0x1047f6d38, 4792 bytes, 16 reserved

For each arena, it will print out allocated and free chunks, the sizes, as well as reserved space, which is basically the overhead for any structs. For arenas the total bytes include reserved space; for allocated and free chunks the total bytes represent only the usable space. You might be thinking: “but I thought we don’t store allocated chunks anywhere?”. And you’re right. But the function makes some educated guesses. For a particular address in memory, it ensures that the value where the magic value would be stored is equal to TALLOC_MAGIC. If so, it prints out the details of the allocated chunk. Otherwise, it treats it as a free chunk instead.

Finally…

It goes without saying that TAlloc is merely an experimental implementation meant for educational purposes. It’s probably not safe to use in production, and it’s not really that efficient. It’s also not thread-safe. And I also feel like I should mention that I don’t code in C for a living, so don’t expect TAlloc to be written in idiomatic C code. I don’t know about you, but to me that’s a lot of negatives.

That said, I think TAlloc is a good example to learn the basics of how memory allocators work. And I am truly biased in saying that, because I haven’t really read any other implementations before I rolled my own. Though I do hope you learned something interesting from this post.