Next: 4 Evaluation Up: Discovery and Hot Replacement Previous: 2 Background

Subsections

3 Design

The key issues we see in this work are:

1.: Is a switching mechanism really needed? Why not use the same file systems no matter where you are?
2.: When and how to switch from one replica to another.
3.: How to ensure that the new file system is an acceptable replacement for the old one.
4.: How to ensure consistency if updates are applied across different replicas.
5.: Fault tolerance: how to protect a client from server unavailability.
6.: Security: NFS is designed for a local ``workgroup'' environment in which the space of user IDs is centrally controlled.

These issues are addressed below.

3.1 Demonstrating the Need

We contend that adaptive client-server matchups are desirable because running file system operations over many network hops is bad for performance in three ways: increased latency, increased failures, and decreased scalability. It is hard to ascertain exact failure rates and load on shared resources without undertaking a full-scale network study; however, we were able to gather some key data to support our claim. We performed a simple study to measure how latency increases with distance.

First, we used the traceroute program to gather <hop-count, latency> data points measured between a host at Columbia and several other hosts around the campus, city, region, and continent. Latencies were measured by a Columbia host, which is a Sun-4/75 equipped with a microsecond resolution clock. The cost of entering the kernel and reading the clock is negligible, and so the measurements are accurate to a small fraction of a millisecond.

Next, we mounted NFS file systems that are exported Internet-wide by certain hosts. We measured the time needed to copy 1MB from these hosts using a 1KB block size. A typical result is plotted in Figure 1. Latency jumps by almost two orders of magnitude at the tenth hop, which represents the first host outside Columbia.

**Figure 1:** NFS Read Latency vs. Network Hop Count
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{latency-vs-hops.ps} \rule{\linewidth}{1pt}\end{figure}$

3.2 When to Switch

We have modified the kernel so that rfscall() measures the latency of every NFS lookup and maintains a per-filesystem data structure storing a number of recently measured latencies.

We chose to time the lookup operation rather than any other operation or mixture of operations for two reasons. The first is that lookup is the most frequently invoked NFS operation. We felt other calls would not generate enough data points to accurately characterize latency. The second reason is that lookup exhibits the least performance variability of the common NFS operations. Limiting variability of measured server latencies is important in our work, since we want to distinguish transient changes in server performance from long-term changes.

(At the outset of our work, we measured variances in the latency of the most common NFS operations and discovered huge swings, shown in Figure 2, even in an extended LAN environment that has been engineered to be uniform and not to have obvious bottlenecks. The measured standard deviations were 1027 msec for all NFS operations, 2547 msec for read, and 596 msec for lookup.)

**Figure 2:** Variability and Latency of NFS operations
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{nfs-op-variability.ps} \rule{\linewidth}{1pt}\end{figure}$

After addition of each newly measured lookup operation, the median latency is computed over the last 30 and 300 calls. We compute medians because medians are relatively insensitive to outliers. We take a data point no more than once per second, so during busy times these sampling intervals correspond to 30 seconds and 5 minutes, respectively. This policy provides insurance against anomalies like ping-pong switching between a pair of file systems: a file system can be replaced no more frequently than every 5 minutes.

The signal to switch is when, at any moment, the short-term median latency exceeds the long-term median latency by a factor of 2. Looking for a factor of two difference between short-term and long-term medians is our attempt to detect a change in performance which is substantial and ``sudden,'' yet not transient. The length of the short-term and long-term medians as well as the ratio used to signal a switch are heuristics chosen after experimentation in our environment. All these parameters can be changed from user level through a debugging system call that we have added.

3.3 Locating a Replacement

When a switch is triggered, rfscall() starts a non-blocking RPC out to our user-level process that performs replacement, nfsmgrd. The call names the guilty file server, the root of the file system being sought, the kernel architecture, and any mount options affecting the file system. Nfsmgrd uses these pieces of information to compose and broadcast an RLP request. The file system name keys the search, while the server name is a filter: the search must not return the same file server that is already in use.

The RLP message is received by the nfsmgrd at other sites on the same broadcast subnet. To formulate a proper response, an nfsmgrd must have a view of mountable file systems stored at its site and also mounted file systems that it is using -- either type could be what is being searched for. Both pieces of information are trivially accessible through /etc/fstab, /etc/exports, and /etc/mtab.

The nfsmgrd at the site that originated the search uses the first response it gets; we suppose that the speed with which a server responds to the RLP request gives a hint about its future performance. (The Sun Automounter [2] makes the same assumption about replicated file servers.) If a read-only replacement file system is available, nfsmgrd instructs Amd to mount it and terminates the out-of-kernel RPC, telling the kernel the names of the replacement server and file system. The flow of control is depicted in Figure 3.

**Figure 3:** Flow of Control During a Switch
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{flow-ctl-switch.ps} \rule{\linewidth}{1pt}\end{figure}$

3.4 Using the Replacement

Once a replacement file system has been located and mounted, all future attempts to open files on the replaced file system will be routed to the replacement whenever they can be. Also, in all cases for which it is possible, open files on the replaced file system will be switched to their equivalents on the replacement. We describe these two cases in Sections 3.4.2 and 3.4.3, respectively.

3.4.1 Relevant Changes to Kernel Data Structures

In order to accommodate file system replacement, we have added some fields to three important kernel data structures: struct vfs, which describes mounted file systems; struct vnode, which describes open files; and struct file, which describes file descriptors.

The fields added to struct vfs, excluding debugging fields, are:

The field vfs_replaces is valid in the vfs structure of the replacement file system; it points to the vfs structure of the file system being replaced.
The field vfs_replaced_by is valid in the replaced file system's vfs struct; it points to the vfs structure of the replacement file system.
When a replacement file system is mounted, our altered version of nfs_mount() sets the replaced and replacement file systems pointing to each other.
The field vfs_nfsmgr_flags is valid for any NFS file system. One flag indicates whether the file system is managed by nfsmgrd; another indicates whether a file system switch is in progress.
The field vfs_median_info contains almost all of the pertinent information about the performance of the file system, including the 300 most recent nfs_lookup() response times.
The field vfs_dft is the Duplicate File Table (DFT). This per-filesystem table lists which files in the replacement file system have been compared to the corresponding file on the original file system mounted by Amd. Only equivalent files can be accessed on the replacement file system. The mechanism for making comparisons is described in Section 3.4.2.
The size of the DFT is fixed (but changeable) so that new entries inserted will automatically purge old ones. This is a simple method to maintain ``freshness'' of entries.
The DFT is a hash table whose entries contain a file pathname relative to the mount point, a pointer to the vfs structure of the replacement file system, and an extra pointer for threading the entries in insertion order. This data structure permits fast lookups keyed by pathname and quick purging of older entries.

The only field added to struct vnode is v_last_used, which contains the last time that rfscall() made a remote call on behalf of this vnode. This information is used in ``hot replacement,'' as described in Section 3.4.3.

The only field added to struct file is f_path, which contains the relative pathname from the mount point to the file for which the descriptor was opened. Different entries may have different pathnames for the same file if several hard links point to the file.

3.4.2 After Replacement: Handling New Opens

When Amd mounts a file system it makes a symlink from the desired location of the file system to the mount point. For example, /u/foo would be a symlink pointing to the real mount point of /n/bar/u/foo; by our local convention, this would indicate that server bar exports /u/foo. Users and application programs know only the name /u/foo.

The information that bar exports a proper version of /u/foo is placed in Amd's mount-maps by system administrators who presumably ensure that the file system bar:/u/foo is a good version of whatever /u/foo should be. Therefore, we regard the information in the client's Amd mount-maps as authoritative, and consider any file system that the client might mount and place at /u/foo as a correct and complete copy of the file system. We call this file system the master copy, and use it for comparison against the replacement file systems that our mechanism locates and mounts.

The new open algorithm is shown in Figure 4. After a replacement has been mounted, whenever name resolution must be performed for any file on the replaced file system, the file system's DFT is first searched for the relative pathname. If the DFT indicates that the replacement file system has an equivalent copy of the file, then that file is used.

**Figure 4:** New Open Algorithm
$\begin{figure} \rule{\linewidth}{1pt} \begin{verbatim} open() { examine vfs_rep... ...ame resolution on master copy; }\end{verbatim}\rule{\linewidth}{1pt}\end{figure}$

If the DFT contains an entry for the pathname, then the file on the replacement file system has already been compared to its counterpart on the master copy. A field in the DFT tells if the comparison was successful or not. If not, then the rest of the pathname has to be resolved on the master copy. If the comparison was successful, then the file on the replacement file system is used; in that case, name resolution continues at the root of the replacement file system.

If the DFT contains no entry for the pathname, then it is unknown whether the file on the replacement file system is equivalent to the corresponding file on the master copy.

To test equivalence, au_lookuppn() calls out of the kernel to nfsmgrd, passing it the two host names, the name of the file system, and the relative pathname to be compared. A partial DFT entry is constructed, and a flag in it is turned on to indicate that there is a comparison in progress and that no other process should initiate the same comparison.

**Figure 5:** Flow of Control During File Comparison
$\begin{figure} \rule{\linewidth}{1pt} \epsfxsize=\linewidth \epsffile{flow-ctl-compare.ps} \rule{\linewidth}{1pt}\end{figure}$

Nfsmgrd then applies, at user level, whatever tests might be appropriate to determine whether the two files are equivalent. This flow of control is depicted in Figure 5. Presently, we are performing file checksum comparison: nfsmgrd calls a checksumd daemon on each of the file servers, requesting the checksum of the file being compared. Checksumd, which we have written for this work, computes MD4 [18] file checksums on demand and then stores them for later use; checksums can also be pre-computed and stored.

Nfsmgrd collects the two checksums, compares them, and responds to the kernel, telling au_lookuppn() which pathname to use, always indicating the file on the replacement file system if possible. Au_lookuppn() completes the construction of the DFT entry, unlocks it, and marks which vfs is the proper one to use whenever the same pathname is resolved again.

In this fashion, all new pathname resolutions are re-directed to the replacement file system whenever possible.

Note that the master copy could be unmounted (e.g., Amd by default unmounts a file system after a few minutes of inactivity), and this would not affect our mechanism. The next use of a file in that file system would cause some master copy to be automounted, before any of our code is encountered.

3.4.3 After Replacement: Handling Files Already Open

When a file system is replaced, it is possible that some files will be open on the replaced file system at the moment when the replacement is mounted. Were the processes with these open files to continue to use the replaced file system, several negative consequences might ensue. First, since the replacement is presumed to provide faster response, the processes using files open on the replaced file systems experience worse service. Second, since the total number of mounted file systems grows as replacements happen, the probability rises that some file system eventually becomes unavailable and causes processes to block. Further, the incremental effect of each successive file system replacement operation is reduced somewhat, since files that are open long-term do not benefit from replacement. Finally, kernel data structures grow larger as the number of mounted file systems climbs. Motivated by these reasons, we decided to switch open files from the replaced file system to the replacement file system whenever the file on the replacement file system is equivalent to that on the master copy.

Although this idea might at first seem preposterous, it is not, since we restrict ourselves to read-only file systems. We assume that files on read-only file systems change very infrequently and/or are updated with care to guard against inconsistent reads. Whether operating conditions uphold this assumption or not, the problem of a file being updated while being read exists independently of our work, and our work does not increase the danger.

We allow for a replacement file system to be itself replaced. This raises the possibility of creating a ``chain'' of replacement file systems. Switching vnodes from the old file system to its replacement limits this chain to length two (the master copy and the current replacement) in steady state.

The ``hot replacement'' code scans through the global open file table, keying on entries by vfs. Once an entry is found that uses the file system being replaced, a secondary scan locates all other entries using the same vnode. In a single entry into the kernel (i.e., ``atomically''), all file descriptors pointing to that vnode are switched, thereby avoiding complex questions of locking and reference counting.

Hot replacement requires knowing pathnames. Thanks to our changes, the vfs structure records the pathname it is mounted on and identifies the replacement file system; also, the relative pathname of the file is stored in the file table entry. This information is extracted, combined with the host names, and passed out to nfsmgrd to perform comparison, as described above. If the comparison is successful, the pathname on the replacement file system is looked up, yielding a vnode on the replacement file system. This vnode simply replaces the previous vnode in all entries in the open file table. This results in a switch the next time a process uses an open file descriptor.

Hot replacement is enabled by the statelessness of NFS and by the vfs/vnode interfaces within the kernel. Since the replaced server keeps no state about the client, and since the open file table knows only a pointer to a vnode, switching this pointer in every file table entry suffices to do hot replacement.

An interesting issue is at which time to perform the hot replacement of vnodes. Since each file requires a comparison to determine equivalence, switching vnodes of all the open files of a given file system could be a lengthy process. The four options we considered are:

1.: Switch as soon as a replacement file system is mounted (the early approach).
2.: Switch only if/when an RPC for that vnode hangs (the late approach).
3.: Switch if/when the vnode is next used (the ``on-demand'' approach).
4.: Switch whenever a daemon instructs it to (the ``flexible'' approach).

The decision to switch earlier or later is affected by the tradeoff that early switching more quickly switches files to the faster file system and improves fault tolerance by reducing the number of file systems in use, but possibly wastes effort. Vnode switching is a waste in all cases when a vnode exists that will not be used again. Early switching also has the disadvantage of placing the entire delay of switching onto the single file reference that is unlucky enough to be the next one.

We chose the ``flexible'' approach of having a daemon make a system call into the kernel which then sweeps through the open file table and replaces some of the vnodes which can be replaced. We made this choice for three reasons. First, we lacked data indicating how long a vnode lingers after its final use. Second, we suspected that such data, if obtained, would not conclusively decide the question in favor of an early or late approach. Third, the daemon solution affords much more flexibility, including the possibility of more ``intelligent'' decisions such as making the switch during an idle period.

We emphasize that the system call into the kernel switches ``some'' of the vnodes, since it may be preferable to bound the delay imposed on the system by one of these calls. Two such bounding policies that we have investigated are, first, switching only N vnodes per call, and, second, switching only vnodes that have been accessed in the past M time units. Assuming that file access is bursty (a contention supported by statistics [14]), the latter policy reduces the amount of time wasted switching vnodes that will never be used again. We are currently using this policy of switching only recently used vnodes; this policy makes use of the v_last_used field that we added to the vnode structure.

3.5 Security

The NFS security model is the simple uid/gid borrowed from UNIX, and is appropriate only in a ``workgroup'' situation where there is a central administrative authority. Transporting a portable computer from one NFS user ID domain to another presents a security threat, since processes assigned user ID X in one domain can access exported files owned by user ID X in the second domain.

Accordingly, we have altered rfscall() so that every call to a replacement file system has its user ID and group ID both mapped to ``nobody'' (i.e., value -2). Therefore, only world-readable files on replacement file systems can be accessed.

3.6 Code Size

Counting blank lines, comments, and debugging support, we have written close to 11,000 lines of C. More than half is for user-level utilities: 1200 lines for the RLP library and daemon, 3200 for nfsmgrd, 700 lines for checksumd, and 1200 lines for a control utility (called nfsmgr_ctl). New kernel code totals 4000 lines, of which 800 are changes to SunOS, mostly in the NFS module. The remaining 3200 lines comprise the four modules we have added: 880 lines to deal with storing and computing medians; 780 lines are the ``core NFS management'' code, which performs file system switching, pathname storage and replacement, and out-of-kernel RPC; 540 lines to manage the DFT; and 1000 lines to support the nfsmgr_ctl system call.

The nfsmgr_ctl system call allows query and control over almost all data structures and parameters of the added facility. We chose a system call over a kmem program for security. This facility was used heavily during debugging; however, it is meant also for system administrators and other interested users who would like to change these ``magic'' variables to values more suitable for their circumstances.

Next: 4 Evaluation Up: Discovery and Hot Replacement Previous: 2 Background

Erez Zadok
12/6/1997