The layers are coming, the layers are coming
(an abstract)

Erez Zadok

Appeared in the 2008 Linux Storage and Filesystem Workshop, co-located with USENIX FAST (February 2008)

There is one layered (stackable) file system in mainline; eCryptfs. Another file system, Unionfs, is in -mm, with many users and several distros using it. Also, there are over a dozen other stackable file systems developed for Linux which have been getting some use over the past years: gzipfs, tracefs/replayfs, antivirusfs, n/crypfs, versionfs, i3fs, cachefs, and more. See www.filesystems.org. With the increase in both the number and the popularity of layered file systems, several shortcomings in the Linux VFS/MM API have become apparent. Furthermore, there is a significant overlap in functionality among the existing layered file systems. This presents an opportunity to consolidate much of the layered file system code into the VFS/MM API.

This position statement is that the VFS and MM should be enhanced to better support layered file systems. We consider mainline adoption of these solutions to be feasible in a 1-2 year time frame. Erez Zadok and Mike Halcrow met in person in October 2007 and outlined a few such problems and possible solutions (some of those solutions were tested briefly). We believe that these solutions can be reach mainline easily within 1-2 years. We would like to discuss those problems and the possible solutions at LSF'08 and hopefully reach a consensus among the attendees as to which approaches are the most suitable. Along with the description of each problem, we plan on presenting actual sample code snippets.

Examples of existing deficiencies include the following:

Cache coherency between layers is challenging, and has many smaller issues to solve. A few of them that we believe can be solved within a couple of years are:
- mmap operations have to keep data pages at the upper and lower file system layers, wasting memory. A struct page is not able to cleanly passed from an upper layer to a lower layer, because the page->mapping->host points to one inode structure, and changing this pointer can be risky or racy; a clean API may be useful to allow pages to point to several inodes (either directly from one mapping->host, or as a linked chain of sorts). We suggest that access to the mapping->host be regulated by a generic wrapper which is able to invoke a chain/stack of callbacks so that everyone in the stack can be informed.
- Processes can modify lower inodes, but the upper file system isn't notified of such changes. This is a similar to how NFS (v2/v3) handles cache coherency: checking the mtime of a file or parent directory, to determine if the attribute cache is valid, and invalidating dentries accordingly. Unionfs uses a similar mtime-based method to detect changes to lower objects. Here too, we would like to propose that every access to the inode's m/a/ctime go through a vfs_* wrapper which can inform those at upper layers who wish to know as soon as a lower object has been modified.
Note: similar wrappers may be useful to regular access to other data structure fields. Although this would require changing every mainline file system as well as the VFS, the changes are simple enough that we believe they won't affect stability at all, as they are the least intrusive changes that would allow stackable file systems to perform better cache coherency. We believe that most of the stack-awareness modifications can be made to the common VFS code (esp. fs/stack.c).
To solve the cache coherency cleanly, most likely there would have to be a way to map lower objects to upper objects. For example, a lower inode can be mapped to an upper one if each struct inode had an extra field to point to an upper inode, if any; or a list could be used if the mapping is not one-to-one. Alternatively, the superblock structure could house a whole hash-table of inode mappings. Either way, the vfs will need at least a couple of generic functions that allow a file system to register (or de-register) a stacked object (this is called "interpose" in ecryptfs/unionfs).
Unionfs adds the notion of a file revalidation operation. It is needed because the lower file could be modified directly by users, but also due to branch-management (adding/removing directories to/from the union). This is also related to cache coherency. We would like to propose adding an optional file->revalidate method which would be called by the VFS only if the file system defines file->revalidate, and right before the VFS is about to access other file methods such as ->read, ->write, etc.
eCryptfs saves a struct file inside its inode, so that ecryptfs_writepage could call vfs_write. Unionfs calls the lower ->writepage directly, but then it has to deal with AOP_WRITEPAGE_ACTIVATE. An suggestion is to get rid of AOP_WRITEPAGE_ACTIVATE entirely (already proposed by Hugh Dickins). Both Unionfs and eCryptfs use vfs_read for ->readpage and vfs_write for ->commit_write. We would like to propose a set of generic wrappers that stackable file system can use by default for their address_space ops, thus eliminating duplicate efforts (which would also be applicable for the newer write_begin/end API). Another proposal is that ->writepage(s) should also be passed a dentry and/or struct file: this would allow that last address_space operation to be used with vfs_read/write as well.
Stackable file systems use lookup_one_len to get the dentries of lower files, because it is simple. But lookup_one_len doesn't pass a nameidata structure, which can cause problems such as not being able to cross filesystems (and bind mounts) on the lower file system. An alternative is to call vfs_path_lookup, which has a more complex interface, requiring the file system to keep track of vfsmount's and nameidata's. We would like to propose a simple way to lookup lower files such that the VFS, not the file system, would hopefully keep track of the vfsmount's and nameidata's.

In this talk we discuss some issues that relate to developing stackable file systems in the Linux kernel, and propose solutions for them.

The layers are coming, the layers are coming (an abstract)

Erez Zadok

Appeared in the 2008 Linux Storage and Filesystem Workshop, co-located with USENIX FAST (February 2008)

The layers are coming, the layers are coming
(an abstract)