The FSL-er's Guide to File System and Storage Benchmarking

File systems and Storage Laboratory (FSL)

Version 2 (May 4, 2007)

This page provides a set of guidelines to consider when evaluating the performance of a file system. This information was collected by Avishay Traeger, Nikolai Joukov, Charles P. Wright, and Erez Zadok. Our motivation is to improve the quality of performance evaluations presented in papers. If you are interested in more information, see the File Systems Benchmarking Project home page. If you have any comments on this page, please send them to Erez Zadok.

The two underlying themes are:

  1. Explain what you did in as much detail as possible:

    For example, if you decided to create your own benchmark, please detail what you have done. If you are replaying traces, describe where they came from, how they were captured, and how you are replaying them (what tool? what speed?). This can help others understand and validate your results.

  2. In addition to saying what you did, say why you did it that way:

    For example, while it is important to note that you are using ext2 as a baseline for your analysis, it is just as important (or perhaps even more important) to discuss why it is a fair comparison. Similarly, it is useful for the reader to know why you ran that random-read benchmark so that they know what conclusions to draw from the results.

1. Choosing The Benchmark Configurations

1.1 Pose questions that will reveal the performance characteristics of the system

Some examples are "how does my system compare to current similar systems?," "how does my system behave under its expected workload?," and "what are the causes of my performance improvements or overheads?"

1.2 Decide on what baseline systems, system configurations, and benchmarks should be used to best answer the questions posed

This will produce a set of <system, configuration, benchmark> tuples that will need to be run. It is desirable for the researcher to have a rough idea what the expected results might be for each configuration at this point; if the actual results differ from these expectations, then the causes of the deviations should be investigated.

Since a system's performance is generally more meaningful when compared to the performance of existing technology, one should find existing systems that provide fair and interesting comparisons. For example, for benchmarking an encryption storage device, it would be useful to compare the performance to other encrypted storage devices, a traditional device, and perhaps some alternate implementations (user-space, file system, etc.).

The system under test may have several configurations that will need to be evaluated in turn. In addition, one may create artificial configurations where a component of the system is removed to determine its overhead. For example, in an encryption file or storage system, you can use a null cipher (copy data only) rather than encrypt, to isolate the overhead of encryption. Determining the cause of overheads may also be done using profiling techniques. Showing this incremental breakdown of performance numbers helps the reader to better understand a system's behavior.

1.3 Choose the benchmarks

There are three main types of benchmarks: Useful file system benchmarks should highlight the high-level as well as the low-level performance. Therefore, we recommend using at least one macro-benchmark or trace to show a high-level view of performance, along with several micro-benchmarks to highlight more focused views. In addition, there are several workload properties that might be considered:

2. Choosing The Benchmarking Environment

The state of the system during the benchmark's runs can have a significant effect on results. After determining an appropriate state, it should be created accurately and reported along with the results.
  • The state of the system's caches can affect the code-paths that are tested and thus affect benchmark results. It is not always clear if benchmarks should be run with ``warm'' or ``cold'' caches. On one hand, real systems do not generally run with completely cold caches. On the other hand, a benchmark that accesses too much cached data may be unrealistic as well. Because requests are mainly serviced from memory, the file or storage system will not be adequately exercised. Further, not bringing the cache back to a consistent state between runs can cause timing inconsistencies. If cold-cache results are desired, caches should be cleared before each run. This can be done by allocating and freeing large amounts of memory, remounting the file system, reloading the storage driver, or rebooting. We have found that rebooting is more effective than the other methods [4]. When working in an environment with multiple machines, the caches on all necessary machines must be cleared. This helps create identical runs, thus ensuring more stable results. If, however, warm cache results are desired, this can be achieved by running the experiment N+1 times, and discarding the first run's result.
  • Most modern disks use Zoned Constant Angular Velocity (ZCAV) to store data. In this design, the cylinders are divided into zones, where the number of sectors in a cylinder increases with the distance from the center of the disk. Because of this, the transfer rate varies from zone to zone [2]. It has been recommended to minimize ZCAV effects by creating a partition of the smallest possible size on the outside of the disk [1]. However, this makes results less realistic, and may not be appropriate for all benchmarks (for example, long seeks may be necessary to show the effectiveness of the system). We recommend simply specifying the location of the test partition in the paper, to help reproducibility.
  • Most file system and storage benchmarks are run on an empty system, which could make the results different than a real-world setting. A system may be aged by running a workload based on system snapshots [3]. Some other ways to age a system before running a benchmark are to run a long-term workload, copy an existing raw image, or to replay a trace before running the benchmark. It should be noted that for some systems and benchmarks, aging is not a concern. For example, aging will not have any effect when replaying a block-level trace on a traditional storage device, since the benchmark will behave identically regardless of the disk's contents.
  • To ensure the reproducibility of the results, all non-essential services and processes should be stopped before running the benchmark. These processes can cause anomalous results (outliers) or higher than normal standard deviations for a set of runs. However, processes such as cron will coexist with the system when used in the real world, and so it must be understood that these results are measured in a sterile environment. Ideally, we would be able to demonstrate performance with the interactions of other processes present. However, this is difficult because the set of processes is specific to a machine's configuration. Instead, we recommend using multi-threaded workloads because they more accurately depict a real system that normally has several active processes. In addition, we recommend to ensure that no users log into the test machines during a benchmark run, and to also ensure that no other traffic is consuming your network bandwidth while running benchmarks that involve the network.
  • 3. Running The Benchmarks

    We recommend four important guidelines to running benchmarks properly. First, one should ensure that every benchmark run is identical. Second, each test should be run several times to ensure accuracy, and standard deviations or confidence levels should be computed to determine the appropriate number of runs. Third, tests should be run for a long enough period of time, so that the system reaches steady state for the majority of the run. Fourth, the benchmarking process should preferably be automated using scripts or available tools such as Auto-pilot to minimize mistakes associated with manual repetitive tasks.

    4. Presenting The Results

    Once results are obtained, they should be presented appropriately so that accurate conclusions may be derived from them. Aside from the data that is presented, the benchmark configurations and environment should be accurately described. Proper graphs should be displayed, with error bars, where applicable.

    5. Validating Results

    Other researchers may wish to benchmark your software for two main reasons:
    1. to reproduce your results or confirm them, or
    2. to compare their system to yours.
    First, it is considered good scientific practice to provide enough information for others to validate your results. This includes detailed hardware and software specifications about the testbeds. Although it is usually not practical to include such large amounts of information in a conference paper, it can be published in an online appendix. Whereas it can be difficult for a researcher to accurately validate another's results without the exact testbed, it is still possible to see if the results generally correlate.

    Second, there may be a case where a researcher creates a system that has similar properties to yours (e.g., they are both encryption file systems). It would be logical for the researcher to compare the two systems. However, if your paper showed an X% overhead over ext2, and the new file system has a Y% overhead over ext2, no claim can be made about which of the two file systems is better because the benchmarking environment is different. The researcher should benchmark both research file systems using a setup that is as similar as possible to that of the original benchmark. This way both file systems are tested under the same conditions. Moreover, since they are running the benchmark in the same way that you did, no claim can be made that they chose a specific case in which their file system might perform better.

    To help solve these two issues:

    References

    [1] Ellard, D. and Seltzer, M. 2003. NFS Tricks and Benchmarking Traps. In Proceedings of the Annual USENIX Technical Conference. USENIX Association, San Antonio, TX, 101–114.

    [2] Meter, R. V. 1997. Observing the Effects of Multi-Zone Disks. In Proceedings of the Annual USENIX Technical Conference. USENIX Association, Anaheim, CA, 19–30.

    [3] Smith, K. A. and Seltzer, M. I. 1997. File System Aging — Increasing the Relevance of File System Benchmarks. In Proceedings of the 1997 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM SIGOPS, Seattle, WA, 203–213.

    [4] Wright, C.P., Joukov, N., Kulkarni, D., Miretskiy, Y., and Zadok, E. 2005. Auto-pilot: A Platform for System Software Benchmarking. In Proceedings of the Annual USENIX Technical Conference, FREENIX Track. USENIX Association, Anaheim, CA, 175–187.