Context
In this post I will mostly be dealing with the two implementations available in the newlib libc implementation for ARM platforms with no MMU/OS newlib/libc/stdlib/mallocr.c and newlib/libc/stdlib/nano-mallocr.c. These are used in the nosys.specs and nano.specs linker options respectively.
A Quick Comparison
malloc |
nano-malloc |
|---|---|
| best-fit | first-fit |
| lower fragmentation potential | higher fragmentation potential |
| 128 bins (1 kByte+ RAM overhead) | 1 bin (free_list) |
| 16 byte minimum chunk size | 8 byte minimum chunk size (for alignment) |
| ~2.4 kByte program space | < 128 bytes program space |
What is malloc?
Simply stated, malloc is the routine in the standard C library (libc) to allocate and manage the heap (or “free store”).
Why malloc?
malloc (of one type or another) is required in order to support dynamic memory allocation. In the embedded world, static allocation is typically the memory that is used most heavily. When it is possible to use stack memory, use it. However, there are instances when object sizes are not known at compile time, object sizes may change after instantiation, or the lifetime of a dynamic object must be independent from the scope in which it is instantiated. For this type of object, dynamic memory allocation is needed.
Why Care About Different Implementations of malloc?
Embedded systems are “resource constrained” meaning they don’t have the speed, data, or program space that a “desktop” system has. One of the most important implications of this fact is the lack of virtual memory. When a desktop system runs out of RAM, it can simply swap some of the data currently held in RAM to some other memory (the hard drive, for example). Embedded systems do not have this capability. Therefore, when an embedded system runs out of RAM, bad things happen. The end result is that the implementation of malloc must be fast, efficient, and control fragmentation.
What is the “Standard” Implementation of malloc?
The newlib malloc implementation is based (very closely) on the memory allocator proposed by Doug Lea. The high-level categorization of this allocator is “best-fit with coalescing”.
The glibc version appears to be a bit more complex on a first look while still originating with Doug Lea’s design. It appears to be much more optimized for speed at the expense of complexity and efficiency.
Commonalities
If there are no free chunks large enough to satisfy the memory request, sbrk() is called. This is a system call from the Unix universe with the purpose of increasing the size of the available heap from system memory. Both of the malloc implementations discussed use the same sbrk() and I was shocked to see that it has no check for the end of the heap memory section. This feels like a very serious, obvious, and easily fixed oversight, but I will be looking for some feedback from the community concerning this.
What is Different about nano-malloc?
The high-level categorization of the nano malloc is “first-fit with coalescing”. In general, first-fit is fast and small (in code size), but has the drawbacks of being less RAM efficient and increasing the likelihood of fragmentation. Additionally, the block list is implemented using a singly-linked list rather than doubly-linked lists.
malloc Analysis
Overhead
- Uses bins.
- These bins allows for a faster best-fit algorithm at the expense of RAM.
- Each of the 128 bins is a doubly-linked list of the chunks of a specific size. This requires
mallocto hold an array of pointer pairs (generally 4-bytes each pointer) to the head of each bin. - RAM usage: 1024 bytes
malloc Routine
A simple algorithm can be used to determine which bin should be searched first (“target bin”) for an appropriately-sized free chunk. In the case of bins for sizes less than 512 bytes, all the chunks within each bin is exactly the same size. Therefore, as long as there is a free chunk in the target bin, the first chunk is immediately used. If no free chunk is available i the target bin, the search progresses to the next bin and so-on until a free chunk is found and is split if it is large enough for the remainder to support a minimum-sized chunk. Alternatively, in the case of bins of sizes >= 512 bytes, each bin’s chunks are sorted from smallest to largest. This makes the first chunk that is found which satisfies the memory request size the best-fit option.
free Routine
First, the memory neighboring the to-be-freed chunk is checked for already-free chunks that can be coalesced. The resulting chunk (either the original to-be-freed chunk or a larger, coalesced chunk) is inserted into the appropriate-size bin (in order of size in the case of bin size >= 512 bytes).
nano-malloc Analysis
nano-malloc Routine
The free_list is searched from the beginning for the first block that is large enough. If the block found is large enough to be split with the remaining still large enough to fit a valid, minimum-sized block, that is done. Otherwise, the entire block is removed from the free_list.
One can see how this might lead to additional fragmentation (over a best-fit algorithm) due to the potentially higher likelihood of splitting larger blocks into smaller ones.
nano-free Routine
The free_list is searched from the beginning to find the insertion point for the to-be-freed block (free_list is sorted by address). This leads to the question, “Since free_list is sorted, why not sort it by size instead of address to implement best-fit rather than first-fit?” I imagine one of the biggest reasons is malloc time, but I personally feel that it would be worth it to reduce the potential for fragmentation (I see fragmentation as being the single most influential reason to avoid or eliminate heap use in embedded systems). The worst case malloc would still be iteration through the entire free_list, but best-fit adds the length of the requested memory as a determining factor (unless using a memory pool as in the standard implementation).
Written with StackEdit.