We have found that some of the most commonly used benchmarks are flawed, and many research papers do not provide a clear enough picture of file system performance. We believe that a good performance evaluation should use micro-benchmarks to highlight both the good and bad qualities of a file system, as well as general-purpose benchmarks or traces to give an idea about how it would perform under expected and realistic workloads. Nevertheless, care should be taken to ensure that general-purpose benchmarks indeed accurately reflect the real-world workloads. In addition, benchmarks should scale well, and results should be reproducible and comparable across papers.
In this project, we survey file system benchmarks used in many recent research papers. We found that no single benchmark adequately measures file system performance. We show how some commonly acceptable and widely used benchmarks and benchmarking techniques can easily conceal overheads, unfairly over-emphasize overheads, or can in general emphasize or de-emphasize many of the file system's properties. We offer suggestions on how to create and conduct benchmarks so that they provide a more fair and accurate picture of file system performance.
Primarily in this project, we describe our views on the future of file system benchmarking. To that end, we have been developing several technologies: fine-grained file system tracing, efficient file system replaying, automated file system benchmarking tools, and low-overhead detailed file system behavior visualization tools.
| # | Title (click for html version) | Formats | Published In | Date | Comments |
| 1 | Unifying Biological Image Formats with HDF5 | BibTeX | Communications of the ACM | Oct 2009 | |
| 2 | Notes on a Nine Year Study of File System and Storage Benchmarking | BibTeX | Byte and Switch | Jul 2009 | |
| 3 | A Nine Year Study of File System and Storage Benchmarking | PS PDF BibTeX | ACM Transactions on Storage (TOS) | May 2008 | Online data appendix |
| # | Title (click for html version) | Formats | Published In | Date | Comments |
| 1 | A Context Aware Block Layer: The Case for Block Layer Deduplication | PDF BibTeX | Stony Brook U. CS TechReport FSL-12-04 | May 2012 | M.S. Thesis |
| 2 | A Nine Year Study of File System and Storage Benchmarking | PS PDF BibTeX | Stony Brook U. CS TechReport FSL-07-01 | May 2007 | Online data appendix |
| 3 | Versatile File System Tracing with Tracefs | PS PDF BibTeX | Stony Brook U. CS TechReport FSL-04-05 | Aug 2004 | M.S. Thesis |
| # | Name (click for home page) | Program | Member Since |
| 1 | Vasily Tarasov | PhD | Jan 2008 |
| 2 | Deepak Jain | MS | Sep 2012 |
| 3 | Karthikeyani Palanisami | MS | May 2012 |
| 4 | Sagar Trehan | MS | Sep 2012 |
| # | Name (click for home page) | Program | Period | Current Location |
| 1 | Nikolai Joukov | PhD | Jan 2004 - Dec 2006 | Research Staff Member, Storage and Data Services Research group, IBM T. J. Watson Research Center (Hawthorne, NY) |
| 2 | Avishay Traeger | PhD | Sep 2003 - Aug 2008 | Research Staff Member, Storage Systems and Performance Management group, IBM Tel Aviv Research Lab (Tel-Aviv, Israel) |
| 3 | Charles P. Wright | PhD | May 2003 - May 2006 | Software Developer, Eladian Partners, LLC (New York, NY) |
| 4 | Akshat Aranya | MS | May 2003 - Aug 2004 | Associate Research Staff Member, NEC Labs America (Princeton, New Jersey) |
| 5 | Sujay Godbole | MS | Sep 2008 - Dec 2009 | Member of Technical Staff, Core Storage Group (ESX), Vmware Inc. (Cambridge, MA) |
| 6 | Koundinya Santhosh Kumar | MS | Sep 2010 - Dec 2011 | Software Engineer, Fusion-IO (Alviso, CA) |
| 7 | Amar Mudrankit | MS | Jan 2011 - May 2012 | Software Engineer, Advanced Development Group at Fusion-IO (San Jose, CA) |
| 8 | Gyumin Sim | MS | Jan 2010 - Dec 2010 | Software Engineer, Data Center Power Team Google (Mountain View, CA) |
| 9 | Tim Wong | BS | Dec 2004 - Jun 2005 | Associate, Volatility Arbitrage, Global Asset Allocation, Applied Quantitative Research (Greenwich, CT) |
| # | Sponsor | Amount | Period | Type | Title (click for award abstract) |
| 1 | NetApp Advanced technlogy Group | $40,000 | 2011 | Sole PI | Dedup Workload Modeling, Synthetic Datasets, and Scalable Benchmarking |
| 2 | NSF HECURA | $760,253 | 2006-2009 | PI | File System Tracing, Replaying, Profiling, and Analysis on HEC Systems |
| 3 | NSF Trusted Computing (TC) | $400,000 | 2003-2006 | Sole PI | A Layered Approach to Securing Network File Systems |