* memory allocators, cont.

1. kmalloc/etc.

- can cause a lot of fragmentation

- searching for a free contiguous unit for a new allocation can be costly,
  as much as O(n).

2. page based allocators give you N pages, non contiguous, and always in
multiple of 4KB (PAGE_SIZE)

- useful for process mem allocation, b/c page protection, MMU, all work in
pages.

- works well for caching disk/filesystem data.  Historically, disk sector
sizes were 512B.  Newer storage devices can read/write in 4KB "sector sizes"
(some support both 512B and 4KB).   Retrieving 4KB sectors fits perfectly
into 4KB pages in memory (page/buffer cache).

- but: only works if your data set can be broken into neat 4KB units.

* custom memory allocators

Try to minimize/eliminate fragmentation, speed up alloc/free of units.

Idea: if you need to manage many objects of size X (meaning many alloc/free
of such an object), then create a custom allocator for size X.

1. get a few whole pages of memory
2. break those pages into packed units of X

e.g., let's say X is 130 bytes long
so alloc a 4KB page, and treat as an array of several 130B units
how many will fit: 4096/130: 31 will fit with 66 bytes free.

Need 31 bits to record which of the 31 allocations are free or used, rounded
up to 4 bytes.  So use some of the 66 free bytes in this page, as a "header"
to record the free list as a bitmap.  Let's assume that a '1' means
allocated, and 0 means free.  Let's also assume that this free list header
comes first (first 4 bytes).  The 4Kb will be:

1. first 4 bytes is header (free list and anything else you think you need)
2. next 31*130=4030 bytes is for the "data" allocated
3. leftover 62 bytes (wasted).

Create a custom API for allocating X, call it alloc_x() and free_x().

alloc_x(void): returns start addr for a "free" unit of X

1. search the bitmap for a '0' bit, shift bits to the left, counting until
you find a '0' bit.  Say count was 15.  Meaning 15th chunk out of 31
allocations is free.

2. calculate starting addr as

ret = start_addr_of_4kb_page + 4 (header size) + 130*15
	Note, if counting from 0, then the math will need to be 130*(15-1)
(fast to calculate, O(1))

free_x(addr of X previously given):

1. find out the offset within the 4KB page where this address belongs to

offset is "position" within the bitmap that corresponds to the Nth (e.g.,
15th) chunk of X available to allocate out of 31.

offset = (addr - start_addr_of_4kb_page - 4) / 130

then turn off bit number "offset" in the 4-byte header (set from 1 to 0)
(freeing is also fast, O(1))

Q: should we reset the memory on free? (or on alloc?)
A: not by default, but yes if you enable some mem debugging.

Q: what if all chunks in my 4KB are in use, and no more free?
A1: simple -- return NULL.

A2: friendlier -- alloc another 4KB page and break it into another 31 units
of X allocations.  And so on.  But now we'll need some sort of a global list
or array of the start addrs of all pages used in this custom allocator.  It
also means that have to search in O(N) where N is #pages allocated for this
custom allocator (so cost of alloc_x increases slightly).  Can also keep a
counter per page of how many X units are free/used.  Can also return whole
pages back to the kernel if no X units are used.  Finally, can also cap how
may pages or X units are available, and throttle heavy users of alloc_x (put
them to sleep ala kmalloc).

Sometimes when you init an allocator for X, you pre-reserve N pages of
memory, in anticipation of needing that many allocations of X.  But have to
be careful not to over-reserve and wind up not using much of it.

Custom allocators are used for popular objects like struct inode and struct
dentry.  Linux has APIs to creating a pool of N objects of size X (more file
systems use it).

Q: what if the normal malloc allocator takes mem out of this custom
allocator's memory?
A: don't let malloc or other allocators in, else defeats purpose of this
custom allocator

Q: what if need to alloc units more than 4KB?!
A: then custom allocators not a suitable method, will have to break data
over multiple 4KB pages (order N allocator), or rethink your design (e.g.,
distributed memory systems).

* freeing memory

When kernel is under mem pressure, suspend other callers (sleep/wait) and go
clean memory.  Then ktreads wake up to cleanup:

1. go over page cache: look for readonly pages, and release them. (most
effective, b/c freeing 4KB units)

2. go over other caches, looking for objects with refcount 0: release them.

3. then go over "dirty" objects that need to be flushed/synced with
underlying I/O media (e.g., dirty pages of a file, need to be sync'ed to the
file, then the page can be freed, see address_space_operations
->writepage/s).

In Linux, one or more kthreads wake up to clean, called kflushd, pdflush,
bdflush (different names over time).  The basic technique uses two
thresholds: low and high watermark (LW and HW), designated as percentages of
whole pagecache that is dirty.  E.g., LW=30% and HW=70% (reconfigurable).

When kthread wakes up, check to see what fraction of pages are dirty now.

1. If fraction < LW: don't need to do much, pick some number of dirty pages
(default 32) and ask f/s or other subsystem to flush them (e.g.,
->writepage) asynchronously (note: flags passed to ->writepage say if a/sync
requested).

2. If the fraction is >=LW but < HW.  Picks N pages (e.g., 32) and asks
system to write them synchronously (e.g., blocking new heavy writers).

3. If fraction > HW: may have to take more drastic options.  Block all I/O
activity for a time.  Count how many consecutive times we've been >HW: if
above some threshold, and going higher, then take drastic action: invoke
the OOMK (Out of Memory Killer).  Pick process w/ most mem footprint and
KILL it.  You can disable OOMK on a per process basis, but risk that whole
OS will hang!