* memory allocators, cont. 1. kmalloc/etc. - can cause a lot of fragmentation - searching for a free contiguous unit for a new allocation can be costly, as much as O(n). 2. page based allocators give you N pages, non contiguous, and always in multiple of 4KB (PAGE_SIZE) - useful for process mem allocation, b/c page protection, MMU, all work in pages. - works well for caching disk/filesystem data. Historically, disk sector sizes were 512B. Newer storage devices can read/write in 4KB "sector sizes" (some support both 512B and 4KB). Retrieving 4KB sectors fits perfectly into 4KB pages in memory (page/buffer cache). - but: only works if your data set can be broken into neat 4KB units. * custom memory allocators Try to minimize/eliminate fragmentation, speed up alloc/free of units. Idea: if you need to manage many objects of size X (meaning many alloc/free of such an object), then create a custom allocator for size X. 1. get a few whole pages of memory 2. break those pages into packed units of X e.g., let's say X is 130 bytes long so alloc a 4KB page, and treat as an array of several 130B units how many will fit: 4096/130: 31 will fit with 66 bytes free. Need 31 bits to record which of the 31 allocations are free or used, rounded up to 4 bytes. So use some of the 66 free bytes in this page, as a "header" to record the free list as a bitmap. Let's assume that a '1' means allocated, and 0 means free. Let's also assume that this free list header comes first (first 4 bytes). The 4Kb will be: 1. first 4 bytes is header (free list and anything else you think you need) 2. next 31*130=4030 bytes is for the "data" allocated 3. leftover 62 bytes (wasted). Create a custom API for allocating X, call it alloc_x() and free_x(). alloc_x(void): returns start addr for a "free" unit of X 1. search the bitmap for a '0' bit, shift bits to the left, counting until you find a '0' bit. Say count was 15. Meaning 15th chunk out of 31 allocations is free. 2. calculate starting addr as ret = start_addr_of_4kb_page + 4 (header size) + 130*15 Note, if counting from 0, then the math will need to be 130*(15-1) (fast to calculate, O(1)) free_x(addr of X previously given): 1. find out the offset within the 4KB page where this address belongs to offset is "position" within the bitmap that corresponds to the Nth (e.g., 15th) chunk of X available to allocate out of 31. offset = (addr - start_addr_of_4kb_page - 4) / 130 then turn off bit number "offset" in the 4-byte header (set from 1 to 0) (freeing is also fast, O(1)) Q: should we reset the memory on free? (or on alloc?) A: not by default, but yes if you enable some mem debugging. Q: what if all chunks in my 4KB are in use, and no more free? A1: simple -- return NULL. A2: friendlier -- alloc another 4KB page and break it into another 31 units of X allocations. And so on. But now we'll need some sort of a global list or array of the start addrs of all pages used in this custom allocator. It also means that have to search in O(N) where N is #pages allocated for this custom allocator (so cost of alloc_x increases slightly). Can also keep a counter per page of how many X units are free/used. Can also return whole pages back to the kernel if no X units are used. Finally, can also cap how may pages or X units are available, and throttle heavy users of alloc_x (put them to sleep ala kmalloc). Sometimes when you init an allocator for X, you pre-reserve N pages of memory, in anticipation of needing that many allocations of X. But have to be careful not to over-reserve and wind up not using much of it. Custom allocators are used for popular objects like struct inode and struct dentry. Linux has APIs to creating a pool of N objects of size X (more file systems use it). Q: what if the normal malloc allocator takes mem out of this custom allocator's memory? A: don't let malloc or other allocators in, else defeats purpose of this custom allocator Q: what if need to alloc units more than 4KB?! A: then custom allocators not a suitable method, will have to break data over multiple 4KB pages (order N allocator), or rethink your design (e.g., distributed memory systems). * freeing memory When kernel is under mem pressure, suspend other callers (sleep/wait) and go clean memory. Then ktreads wake up to cleanup: 1. go over page cache: look for readonly pages, and release them. (most effective, b/c freeing 4KB units) 2. go over other caches, looking for objects with refcount 0: release them. 3. then go over "dirty" objects that need to be flushed/synced with underlying I/O media (e.g., dirty pages of a file, need to be sync'ed to the file, then the page can be freed, see address_space_operations ->writepage/s). In Linux, one or more kthreads wake up to clean, called kflushd, pdflush, bdflush (different names over time). The basic technique uses two thresholds: low and high watermark (LW and HW), designated as percentages of whole pagecache that is dirty. E.g., LW=30% and HW=70% (reconfigurable). When kthread wakes up, check to see what fraction of pages are dirty now. 1. If fraction < LW: don't need to do much, pick some number of dirty pages (default 32) and ask f/s or other subsystem to flush them (e.g., ->writepage) asynchronously (note: flags passed to ->writepage say if a/sync requested). 2. If the fraction is >=LW but < HW. Picks N pages (e.g., 32) and asks system to write them synchronously (e.g., blocking new heavy writers). 3. If fraction > HW: may have to take more drastic options. Block all I/O activity for a time. Count how many consecutive times we've been >HW: if above some threshold, and going higher, then take drastic action: invoke the OOMK (Out of Memory Killer). Pick process w/ most mem footprint and KILL it. You can disable OOMK on a per process basis, but risk that whole OS will hang!