The layers are coming, the layers are coming
(an abstract)
Erez Zadok
Appeared in the 2008 Linux Storage and Filesystem Workshop, co-located
with USENIX FAST (February 2008)
There is one layered (stackable) file system in mainline; eCryptfs.
Another file system, Unionfs, is in -mm, with many users and several
distros using it. Also, there are over a dozen other stackable file
systems developed for Linux which have been getting some use over the
past years: gzipfs, tracefs/replayfs, antivirusfs, n/crypfs,
versionfs, i3fs, cachefs, and more. See www.filesystems.org.
With the increase in both the number and the popularity of layered
file systems, several shortcomings in the Linux VFS/MM API have become
apparent. Furthermore, there is a significant overlap in
functionality among the existing layered file systems. This presents
an opportunity to consolidate much of the layered file system code
into the VFS/MM API.
This position statement is that the VFS and MM should be enhanced to
better support layered file systems. We consider mainline adoption of
these solutions to be feasible in a 1-2 year time frame. Erez Zadok
and Mike Halcrow met in person in October 2007 and outlined a few such
problems and possible solutions (some of those solutions were tested
briefly). We believe that these solutions can be reach mainline
easily within 1-2 years. We would like to discuss those problems and
the possible solutions at LSF'08 and hopefully reach a consensus among
the attendees as to which approaches are the most suitable. Along
with the description of each problem, we plan on presenting actual
sample code snippets.
Examples of existing deficiencies include the following:
- Cache coherency between layers is challenging, and has many
smaller issues to solve. A few of them that we believe can be
solved within a couple of years are:
- mmap operations have to keep data pages at the upper and lower
file system layers, wasting memory. A struct page is not able to
cleanly passed from an upper layer to a lower layer, because the
page->mapping->host points to one inode structure, and changing
this pointer can be risky or racy; a clean API may be useful to
allow pages to point to several inodes (either directly from one
mapping->host, or as a linked chain of sorts). We suggest that
access to the mapping->host be regulated by a generic wrapper
which is able to invoke a chain/stack of callbacks so that
everyone in the stack can be informed.
- Processes can modify lower inodes, but the upper file system
isn't notified of such changes. This is a similar to how NFS
(v2/v3) handles cache coherency: checking the mtime of a file or
parent directory, to determine if the attribute cache is valid,
and invalidating dentries accordingly. Unionfs uses a similar
mtime-based method to detect changes to lower objects. Here too,
we would like to propose that every access to the inode's
m/a/ctime go through a vfs_* wrapper which can inform those at
upper layers who wish to know as soon as a lower object has been
modified.
Note: similar wrappers may be useful to regular access to other data
structure fields. Although this would require changing every mainline
file system as well as the VFS, the changes are simple enough that we
believe they won't affect stability at all, as they are the least
intrusive changes that would allow stackable file systems to perform
better cache coherency. We believe that most of the stack-awareness
modifications can be made to the common VFS code (esp. fs/stack.c).
To solve the cache coherency cleanly, most likely there would have to
be a way to map lower objects to upper objects. For example, a lower
inode can be mapped to an upper one if each struct inode had an extra
field to point to an upper inode, if any; or a list could be used if
the mapping is not one-to-one. Alternatively, the superblock
structure could house a whole hash-table of inode mappings. Either
way, the vfs will need at least a couple of generic functions that
allow a file system to register (or de-register) a stacked object
(this is called "interpose" in ecryptfs/unionfs).
- Unionfs adds the notion of a file revalidation operation. It is
needed because the lower file could be modified directly by users,
but also due to branch-management (adding/removing directories
to/from the union). This is also related to cache coherency. We
would like to propose adding an optional file->revalidate method
which would be called by the VFS only if the file system defines
file->revalidate, and right before the VFS is about to access
other file methods such as ->read, ->write, etc.
- eCryptfs saves a struct file inside its inode, so that
ecryptfs_writepage could call vfs_write. Unionfs calls the lower
->writepage directly, but then it has to deal with
AOP_WRITEPAGE_ACTIVATE. An suggestion is to get rid of
AOP_WRITEPAGE_ACTIVATE entirely (already proposed by Hugh
Dickins). Both Unionfs and eCryptfs use vfs_read for ->readpage
and vfs_write for ->commit_write. We would like to propose a set
of generic wrappers that stackable file system can use by default
for their address_space ops, thus eliminating duplicate efforts
(which would also be applicable for the newer write_begin/end
API). Another proposal is that ->writepage(s) should also be
passed a dentry and/or struct file: this would allow that last
address_space operation to be used with vfs_read/write as well.
- Stackable file systems use lookup_one_len to get the dentries of
lower files, because it is simple. But lookup_one_len doesn't
pass a nameidata structure, which can cause problems such as not
being able to cross filesystems (and bind mounts) on the lower
file system. An alternative is to call vfs_path_lookup, which has
a more complex interface, requiring the file system to keep track
of vfsmount's and nameidata's. We would like to propose a simple
way to lookup lower files such that the VFS, not the file system,
would hopefully keep track of the vfsmount's and nameidata's.
In this talk we discuss some issues that relate to developing stackable
file systems in the Linux kernel, and propose solutions for them.