RMalloc is a malloc implementation which means, a dynamic memory allocator .
One reason for this version of malloc was a problem I encountered with one of my other programs when allocating several gigabytes. Because of the used algorithms I deallocated and allocated most of this memory (which was used by relativly small chunks). This behaviour confused all other malloc implementations I tested and led to unacceptable program-performance. So I decided to do it by myself and wrote this lib which performed MUCH better.
The first version was a simple segregated storage allocator without splitting and coalescing. Allthough it was much faster and scaled very well wrt. the number of processors, it suffered from severe fragmentation (as is known for such kind of allocators) when used with a special variant of one of my other projects (see libHmatrix). But it performs very well if the application only uses a limited number of different sizes when allocating blocks.
To reduce this fragmentation I completly rewrote the allocator and added splitting and coalescing. It now works similar to LKmalloc by using private heaps with ownership for each threads. Thereby each memory-chunk is put back into the heap, it was allocated from.
The main differences between rmalloc and LKmalloc are the handling of small blocks, the allocation of memory from the operating system and the mapping of heaps to threads.
For performance reasons small chunk are only allocated from containers, which hold several equal sized chunks. Especially if a lot of small blocks are allocated, this improves the overall performance of the allocator. This also avoids splitting and coalescing these small blocks (which might reduce fragmentation).
Memory from the operating system is not allocated in equal sized stripes (as in LKmalloc) but in dynamically growing chunks via mmap (or sbrk). This reduces the number of calls to these system-functions. The size of each chunk is adjusted according to the memory in use.
Heaps are associated to threads by using thread-private data (via PThread-functions). Thereby the number of heaps is automatically adjusted to the number of concurrent threads (LKmalloc need some initialisation for this).
RMalloc is free software where free is defined by the GNU Lesser General Public License. So share and enjoy ;).
- implemented deallocation of empty data segments if enough free memory is available
- non-empty wilderness are now handled by middle-sized management ensuring good-fit allocation on wilderness (greatly reduces internal fragmentation in certain situations)
- support for 16 byte alignment added (for SSE applications)
- fixed bug when setting size of split blocks
- modified message and error output
- changed Rmalloc tracing to write more data
- removed STAT_FRAGMENTATION (was not used anymore)
- guarded munmap by mutex in mt-mode (seems to be not mt-safe)
- non-handled (large) blocks are now part of statistics
- changed license to LGPL
- ported to Win32
- guarded special functions by defines (mmap, getpagesize, etc.)
- moved trace.fd to trace.file (file IO instead of read/write)
- C++ operators now really throw std::bad_alloc
- handling of NULL pointers in allocation improved; no more ABORT but NULL-returns
- in check_rmalloc: global heap was not checked
- consistency check of wilderness, full/non-empty containers
- alignment of size for heap-allocation
- removed bug in
r_realloc(wrong size for given block)
- NO_OF_CLASSES, NO_OF_SMALL must be even; changed
- reduced internal fragmentation for small chunks by using a special type (sblock_t)
- minor bugfixes for non-pthread mode
- added USE_THR_HACKS to simulate thread-specific-data
- restructured container-management; the empty-list not needed anymore
- removed needless data-segment-structure
- removed needless empty-lists for containers
- another major rewrite: added coalescing similar to LKmalloc
- maintained container-managment for small blocks and coalescing strategy for middle sized blocks
- removed rounding up of size-requests
- no longer memory-transfers between heaps
- major rewrite with containers which can be reused by other size-classes ⇒ led to massive fragmentation
- various code-cleanups
- removed size-class generation (now in extra file)
- replaced “real” thread-private-data by simulated one
- active fragmentation statistics with STAT_FRAGMENTATION (prints “real” consumption)
- moved top_pad into data-segment-structure
- used remaining blocks of each data-segment to serve future requests, thereby reducing fragmentation (left-over-blocks)
- small bugfixes
- top_pad is now heap-specific and adjusted dynamicly (1/TOP_PAD_FRACTION of mem_used)
- changed size-field in nodes to slot; avoided to many calls to size_to_class
- moved handling of sizes larger than biggest size-class to operating-system (via malloc or mmap)
- fixed error-handling if no memory is available from the OS (sends kill to own process instead of exit(1))
- added another trace-method: trace by allocation
- rewritten chunk-handling: heap -> block -> chunk
- exchange of blocks between thread-heaps and global-heaps to reduce overall memory-consumption
- round sizes to next size-class to guarantee, that there is a free chunk in the list (avoid searching)
- increased number of size-classes; changed size_to_class to use a binary search instead of linear
- ok. maybe 10 MB for the DEFAULT_TOP_PAD is better ?
- implemented trace-functionality for thread-heaps
- rearranged code for system-allocation (mmap and malloc had too much in common to be separated)
- removed most messages when mutices were locked
- set MMAP to be default, even when malloc is not overwritten
- changed “r_mallinfo” to “rmallinfo” and added declaration in rmalloc.h
- changed heap handling, now each thread has a private heap; if the thread terminates, this heap is then marked as free and can be used by a new thread
- removed creation of first heap (besides global-heap), this will now be done on demand (in r_malloc)
- DEFAULT_TOP_PAD reduced to 5 MB (should also be enough)
- heap-list is no longer a circular list (just a linear one)
- heaps can now be aligned to the cache-line-size (ALIGN_TO_CACHE_SIZE); off by default
- added wrapper for mallinfo which calls internal r_mallinfo (r_mallinfo uses mallinfo-struct with ulong fields)
- replaced simulated thread-specific-data by real one
- fixed bug/feature when using malloc as system-alloc: only requested chunks of needed size were allocated, not plus DEFAULT_TOP_PAD
- initial public release
These benchmarks are similar to the benchmarks used to show the properties of hoard.
This benchmark allocates a lot of small, equal sized chunks in each thread and tests the parallel efficiency.
The following table shows the time for the different allocators when running sequentially.
|132.2 sec||219.7 sec||131.7 sec||119.8 sec||115.7 sec|
The speedup wrt. the number of processors is shown in the following figure.
This benchmarks simulates the allocation behaviour of a typical server application with multiple threads.
The number of allocations per time on one processor is shown in the following table.
The speedup of the different allocators is shown in the next figure.