File System Benchmarking Tools and Techniques

Benchmarking is critical when evaluating performance, but is especially difficult for file and storage systems. Complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, systems have different features and optimizations, so no single benchmark is always suitable. The large variety of workloads that these systems experience in the real world also add to this difficulty.

We have found that some of the most commonly used benchmarks are flawed, and many research papers do not provide a clear enough picture of file system performance. We believe that a good performance evaluation should use micro-benchmarks to highlight both the good and bad qualities of a file system, as well as general-purpose benchmarks or traces to give an idea about how it would perform under expected and realistic workloads. Nevertheless, care should be taken to ensure that general-purpose benchmarks indeed accurately reflect the real-world workloads. In addition, benchmarks should scale well, and results should be reproducible and comparable across papers.

In this project, we survey file system benchmarks used in many recent research papers. We found that no single benchmark adequately measures file system performance. We show how some commonly acceptable and widely used benchmarks and benchmarking techniques can easily conceal overheads, unfairly over-emphasize overheads, or can in general emphasize or de-emphasize many of the file system's properties. We offer suggestions on how to create and conduct benchmarks so that they provide a more fair and accurate picture of file system performance.

Primarily in this project, we describe our views on the future of file system benchmarking. To that end, we have been developing several technologies: fine-grained file system tracing, efficient file system replaying, automated file system benchmarking tools, and low-overhead detailed file system behavior visualization tools.

Journal Articles:

# Title (click for html version) Formats Published In Date Comments
1 Unifying Biological Image Formats with HDF5 BibTeX Communications of the ACM Oct 2009  
2 Notes on a Nine Year Study of File System and Storage Benchmarking BibTeX Byte and Switch Jul 2009  
3 A Nine Year Study of File System and Storage Benchmarking PS PDF BibTeX ACM Transactions on Storage (TOS) May 2008 Online data appendix

Conference and Workshop Papers:

# Title (click for html version) Formats Published In Date Comments
1 Virtual Machine Workloads: The Case for New Benchmarks for NAS PDF BibTeX 11th USENIX Conference on File and Storage Technologies (FAST 2013) Feb 2013  
2 Generating Realistic Datasets for Deduplication Analysis PS PDF BibTeX 2012 USENIX Annual Technical Conference (ATC 2012) Jun 2012  
3 Extracting Flexible, Replayable Models from Large Block Traces PS PDF BibTeX Tenth USENIX Conference on File and Storage Technologies (FAST 2012) Feb 2012  
4 Benchmarking File System Benchmarking: It *IS* Rocket Science PS PDF BibTeX 13th USENIX Workshop in Hot Topics in Operating Systems (HotOS XIII) May 2011  
5 Accurate and Efficient Replaying of File System Traces PS PDF BibTeX Fourth USENIX Conference on File and Storage Technologies (FAST 2005) Dec 2005  
6 Auto-pilot: A Platform for System Software Benchmarking PS PDF BibTeX Usenix Technical Conference, FREENIX Track Apr 2005  
7 Tracefs: A File System to Trace Them All PS PDF BibTeX Third USENIX Conference on File and Storage Technologies (FAST 2004) Apr 2004  

Technical Reports:

# Title (click for html version) Formats Published In Date Comments
1 A Context Aware Block Layer: The Case for Block Layer Deduplication PDF BibTeX Stony Brook U. CS TechReport FSL-12-04 May 2012 M.S. Thesis
2 A Nine Year Study of File System and Storage Benchmarking PS PDF BibTeX Stony Brook U. CS TechReport FSL-07-01 May 2007 Online data appendix
3 Versatile File System Tracing with Tracefs PS PDF BibTeX Stony Brook U. CS TechReport FSL-04-05 Aug 2004 M.S. Thesis

Current Students:

# Name (click for home page) Program Member Since
1 Sonam Mandal MS Jun 2013

Past Students:

# Name (click for home page) Program Period Current Location
1 Nikolai Joukov PhD Jan 2004 - Dec 2006 Research Staff Member, Storage and Data Services Research group, IBM T. J. Watson Research Center (Hawthorne, NY)
2 Vasily Tarasov PhD Jan 2008 - Nov 2013 Research Staff Member, Scale-out Storage Software, IBM Research - Almaden (San Jose, USA)
3 Avishay Traeger PhD Sep 2003 - Aug 2008 R&D, Stratoscale (Herzeliya, Israel)
4 Charles P. Wright PhD May 2003 - May 2006 Application Software Developer, Walleye Software (New York, NY)
5 Akshat Aranya MS May 2003 - Aug 2004 Associate Research Staff Member, NEC Labs America (Princeton, New Jersey)
6 Sujay Godbole MS Sep 2008 - Dec 2009 Member of Technical Staff, Core Storage Group (ESX), Vmware Inc. (Cambridge, MA)
7 Deepak Jain MS Sep 2012 - Dec 2013 Member of Technical Staff, Project FVP - Engineering, Pernixdata Inc (San Jose, USA)
8 Koundinya Santhosh Kumar MS Sep 2010 - Dec 2011 Software Engineer, Fusion-IO (Alviso, CA)
9 Amar Mudrankit MS Jan 2011 - May 2012 Software Engineer, Advanced Development Group at Fusion-IO (San Jose, CA)
10 Karthikeyani Palanisami MS May 2012 - Jun 2013 Member of Technical Staff, Project MARS - Engineering, NetApp Inc (Sunnyvale, USA)
11 Gyumin Sim MS Jan 2010 - Dec 2010 Software Engineer, Data Center Power Team Google (Mountain View, CA)
12 Sagar Trehan MS Sep 2012 - Dec 2013 Member of Technical Staff, CASL Performance Group - Engineering, Nimble Storage Inc (San Jose, USA)
13 Tim Wong BS Dec 2004 - Jun 2005 Associate, Volatility Arbitrage, Global Asset Allocation, Applied Quantitative Research (Greenwich, CT)

Sponsors:

# Sponsor Amount Period Type Title (click for award abstract)
1 NSF Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) $444,267 2013-2016 Lead-PI BIGDATA: Small: DCM: Collaborative Research: An efficient, versatile, scalable, and portable storage system for scientific data containers
2 NetApp Advanced technlogy Group $40,000 2011 Sole PI Dedup Workload Modeling, Synthetic Datasets, and Scalable Benchmarking
3 NSF HECURA $760,253 2006-2009 PI File System Tracing, Replaying, Profiling, and Analysis on HEC Systems
4 NSF Trusted Computing (TC) $400,000 2003-2006 Sole PI A Layered Approach to Securing Network File Systems


(Last updated: Tue Apr 1 16:53:19 EDT 2014)