The implementation of Cryptfs proceeded based on the design with one notable exception: memory mapped operations complicated the code further and exposed some general problems with the vnode interface; these problems are discussed in detail in Section 3.4.
Our first implementation concentrated on the Solaris 2.5.1 operating system for several reasons:
Our next two implementations were for the Linux 2.0 and FreeBSD 3.0 operating systems. We chose these two for the following reasons:
By implementing Cryptfs for these three systems we hoped to prove that practical non-trivial stackable file systems are portable to sufficiently different Unix operating systems. The discussion in the rest of this section concentrates mostly on Solaris, unless otherwise indicated. In Section 3.5 we specifically discuss the differences in implementation between Linux and Solaris; Section 3.6 discusses the differences for the FreeBSD port.
We began the implementation by creating a ``wrapper'' file system called ``wrapfs'' that was initially very similar to the Solaris loopback file system (lofs). Lofs passes all Vnode/VFS operations to the lower layer, but it only interposes on directory vnodes. Wrapfs interposes on every vnode, and makes identical copies of data blocks and pages in its own layer. The reason for this data copy was to make Wrapfs identical to Cryptfs with the exception of the actual encryption of bytes; this allowed us to measure the cost of full stacking separately from the cost of encryption.
We used a reference implementation of the Blowfish encryption algorithm1 and ported it to the kernel. Porting Blowfish to the kernel was easy. Most encryption code (written in C) uses simple arithmetic manipulations and uses very few system calls or C library functions. That made the Blowfish code highly portable.
Next we applied the encryption algorithm to the read and write vnode operations. As per the design, we perform encryption on whole blocks of size matching the Ultra-SPARC native page size (8192 bytes.) Whenever a read for a range of bytes is requested, we compute the extended range of bytes up to the next page boundary, and apply the operation to the interposed file system using the extended range. Upon successful completion, the exact number of bytes requested are returned to the caller of the vnode operation. Writing a range of bytes is more complicated than reading. Within one page, bytes depend on previous bytes, so we have to read and decode parts of pages before writing other parts of them.
Throughout the rest of this paper we will refer to the interposing (wrapping) vnode as V, and to the interposed (hidden or wrapped) vnode as V'. We use F to represent a file at the interposer's level and F' at the interposed one; P and P' refer to memory mapped pages at these two levels, respectively. The following example2, depicted in Figure 2, shows what happens when a process asks to write bytes of an existing file from byte 9000 until byte 25000. Let us also assume that the file in question has a total of 4 pages (32768) worth of bytes in it.
Files opened for appending only (using O_APPEND) do not provide the vnode interface write function any information regarding the real size of the file and where writing begins. If the size of the file before an append attempt is requested is not an exact multiple of a page size, data corruption will occur, since we will begin a new encryption sequence not on a page boundary.
The way we solve this problem is by detecting when a file is opened with an append flag on, turn off that flag before the open operation is passed on to V', and replace it with flags that indicate to V' that the file was opened for normal reading and writing. We save the initial state of the file opened, so that any other operation on V would be able to tell that this file was originally opened only for appending.
Whenever we write bytes to a file that was opened in append-only mode, we first apply the vnode getattr operation to find the true size of the file, and add that to the file offsets being written to. The interposed layer's vnode V' does not know that a file has been opened for append-only at the layer above it. For example, an append operation of 430 bytes to a file with an existing size of 260 bytes is converted into a write operation of bytes 261-690 of the whole file, and the procedure described in Section 3.2 continues unchanged.
Every operation that deals with file names (for example ``lookup'') was modified such that the file name F is encrypted and uuencoded first. The modified file name F' is passed on to V'.
A complication we faced here was the ``readdir'' vnode operation. Readdir is implemented in the kernel as a restartable function. A user process calls the readdir C library call, which is translated to repeated calls to the getdents(2) system call, passing it a buffer of a given size. The buffer is filled by the kernel with enough bytes representing files in a directory being read, but no more. If the kernel has more bytes to offer the process (i.e. the directory has not been completely read) it will set a special EOF flag to false. As long as the C library call sees that the flag is false, it must call getdents(2) again. Each time it does so, it will read more bytes starting at the file offset of the opened directory as was left off during the last read.
The important issue with respect to directory reading is how to continue reading the directory from exactly the offset it was left off the last time. This is accomplished by recording the last position and ensuring that it is returned to us upon the next invocation. The way we implemented readdir was as follows:
The caller of readdir asks to read at most N bytes. When we uudecode and decrypt the file names we reduce the size of the file name -- converting every 4 byte sequence into 3. On average, file names are shortened by 25% from the encrypted and uuencoded forms. This means that the total number of bytes we pass back to the caller is less than N. That is all right, because the specifications for directory reading operations call for reading at most N bytes.
The only side effect of reducing the number of bytes returned to the caller is a small inefficiency in the overall number of times readdir must be called. Since we are returning less than N bytes, it is possible that the buffer supplied by the user process has enough space to fill in a few more directory entries. When taking into account large directories, it is possible that several getdents(2) system call invocations could be saved. With fewer context switches, overall performance could be improved. We did not deem this loss of performance significant since most Unix directories are not very large and the additional code to completely fill N bytes worth of data was going to complicate Cryptfs, increase the overhead, and inevitably lengthen the time required to develop it. (See Section 4.1 for an evaluation of the performance of Cryptfs.)
Next we added support for multi-user keys as described in Section 2.1. Keys are set by a new ioctl(2) added to Cryptfs, and may be reset or removed by an authorized user within the same session. The user is prompted for a passphrase and the tool converts it into keys passed to Cryptfs.
We noticed a problem with listing directories containing file names that were encrypted using different keys. Decryption of the directory data occurs not using the original key but the key for the real UID of that session. When a string S is encrypted using key K1 but decrypted using a different key K2, the result is a completely different string that may contain any number of characters that are illegal in file names such as a forward-slash ``/'' or nulls. The latter confuse string operations that expect their input to be null terminated. We solved this problem by adding a 2-byte checksum to the encrypted strings and one more byte indicating original unencrypted length, before we uuencode them. After adding these 3 bytes, we uuencode the sequence using our special uuencoding algorithm. When strings are read from the interposed-upon file system during directory reading operations, we first uudecode them, then decrypt them, and validate the checksum. If it does not match, then we know this file name was encrypted using a different key and we skip the listing of that file altogether.
This solution not only avoids the possible data mishandling of file names, but also has a side effect of a slightly increased security ``through obscurity.'' When a user lists a directory they only get to see files that they encrypted with the proper key. All other files are invisible and inaccessible to that user.
To be able to execute binaries we had to implement memory-mapping vnode functions. We had to decide if and how to cache memory mapped pages. In order to improve performance, we decided that Cryptfs shall have its own copy of cached decrypted pages in memory, and that we would leave the encrypted pages in the interposed-upon file system untouched.
When a page fault occurs, the kernel calls the vnode operation getpage. This function retrieves one or more pages from a file. We followed other implementations in Solaris and created a function that retrieves a single page -- getapage; this function gets called repeatedly by one of the Paged Vnodes interface functions: pvn_getpages. The Paged Vnodes interface functions are not an explicit part of the vnode interface, but are there to assist it when dealing with memory mapped files. This unfortunate but necessary use of operating system specific interfaces exposed portability problems in our stackable file system.
The implementation of getapage appeared simple:
Similarly, putpage was written using the Paged Vnodes interface functions. In practice we also had to carefully handle two additional details, to avoid deadlocks and data corruption. First, pages contain several types of locks, and these locks had to be held and released in the right order and at the right time. Secondly, the MMU keeps mode bits indicating status of pages in hardware, especially the referenced and modified bits. We had to update and synchronize the hardware version of these bits with their software version kept in the pages' flags. For a file system to have to know and handle all of these low-level details further blurred the distinction between the file system and the VM system.
To save space, some Unix file systems support files with ``holes'' -- pages that are not allocated on disk because they contain all zeros. When such a page is read into memory, the VM system fills it with zeros; only when it is modified does the page get physical disk space allocated. When we perform getapage on V', we expect to get a page that was previously encrypted with our algorithm and key. When we get a zero-filled page, we have no way of knowing that it was zero-filled and therefore should not be decrypted. (For a given key, there is a valid sequence of bytes that is not all-zeros, which when encrypted, will result in a sequence of all-zero bytes of the same length.) Our solution was to avoid holes in files. The only way to create a file with holes is to open it for write/append, lseek(2) past the end of the file, and write something there. We detected this condition in write and forced a write of all zeros (after encrypting them.)
When we began the Solaris work we referred to the implementation of other file systems such as lofs. Linux did not have one as part of standard distributions, but we were able to locate a prototype one and used it in our port. Also, the Linux Vnode/VFS interface contains a different set of functions and data structures than Solaris, but it operates in a similar fashion.
In Linux, much of the common file system code has been extracted and moved to a generic (higher) level. Many generic file system functions exist that can be used by default if the file system does not define its own version thereof. This leaves the file system developer to deal with only the core issues of the file system. For example, the way Solaris moves data between user and kernel space is via structures called User I/Os (UIO). These structures contain various fields that must be updated carefully and consistently. Linux simplifies data movement by passing vnode functions such as read and write a simple allocated (char *) buffer and an integer describing how many bytes to read into or write out of the buffer passed.
Memory mapped operations are also easier in Linux. The vnode interface in Solaris includes functions that must be able to manipulate one or more pages. In Linux, the file system handles one page at a time, leaving page clustering and multiple-page operations to the higher and more generic code. The Linux 2.0 kernel, however, does not include a putpage vnode operation (version 2.1 does.) We had to implement it using write. The write vnode operation uses different and somewhat incompatible arguments than putpage. We had to create the missing information and pass it to the write() operation. Since we store information vital for stacking in arguments passed to us, and we had to ``fake'' some of it here, this limited us to a single level of stacking file systems.
Linux's way of copying between user and kernel memory is a bit outmoded. Some operations default to manipulating user memory and some manipulate kernel memory. Operations that need to manipulate a different context have to set it before and restore it to its previous state afterwards. Solaris simply passes flags to these functions to tell them if to operate on kernel or user memory.
Directory reading was surprisingly simpler in Linux. In Solaris, we had to read a number of raw bytes from the interposed-upon file system, and parse them into chunks of sizeof(struct dirent), set the proper fields in this structure, and append the file name bytes to the end of the structure. In Linux, we provided the kernel with a callback function for iterating over the entries in a directory. This function was called by higher level code and asked us to simply process one file name at a time.
There were only two caveats to the portability of the Linux code. First, Linux keeps a list of exported kernel symbols (in kernel/ksyms.c) available to loadable modules. In order to make Cryptfs a loadable module, we had to export additional symbols to the rest of the kernel, for functions mostly related to memory mapping. Secondly, most of the structures used in the file system (inode, super_block, and file) include a private field into which file system specific opaque data could be placed, which we used to store information pertinent for stacking. We had to add a private field to only one structure which was missing it, the vm_area_struct, which represents custom per-process virtual memory manager page-fault handlers. Since Cryptfs is the first fully stackable file system for Linux, we feel that these changes are small and acceptable, given that more stackable file systems are likely to be developed.
FreeBSD 3.0 is based on BSD-4.4Lite. We chose it as the third port because it represents another major section of Unix operating systems -- the BSD ones. FreeBSD's vnode interface is very similar to Solaris' and the port was straightforward. FreeBSD's version of the loopback file system is called ``nullfs'' -- a useful template for writing stackable file systems. Two major deficiencies (bugs) in nullfs required attention. First, writing large files resulted in some data pages getting zero-filled on disk; this forced us to perform all writes synchronously. Secondly, memory mapping through nullfs panics the kernel, so we had to implement MMAP functions ourselves. We implemented getpages and putpages using read and write, respectively, because calling the lower-level's page functions resulted in a UFS pager error. (When we chose the latest snapshot of FreeBSD 3.0, we knew we were dealing with unstable code, and expected to face bugs.)