File System and Storage Benchmarking Tools and Techniques

Benchmarking is critical when evaluating performance, but is especially difficult for file and storage systems. Complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, systems have different features and optimizations, so no single benchmark is always suitable. The large variety of workloads that these systems experience in the real world also add to this difficulty.

We have found that some of the most commonly used benchmarks are flawed, and many research papers do not provide a clear enough picture of file system performance. We believe that a good performance evaluation should use micro-benchmarks to highlight both the good and bad qualities of a file system, as well as general-purpose benchmarks or traces to give an idea about how it would perform under expected and realistic workloads. Nevertheless, care should be taken to ensure that general-purpose benchmarks indeed accurately reflect the real-world workloads. In addition, benchmarks should scale well, and results should be reproducible and comparable across papers.

In this project, we survey file system benchmarks used in many recent research papers. We found that no single benchmark adequately measures file system performance. We show how some commonly acceptable and widely used benchmarks and benchmarking techniques can easily conceal overheads, unfairly over-emphasize overheads, or can in general emphasize or de-emphasize many of the file system's properties. We offer suggestions on how to create and conduct benchmarks so that they provide a more fair and accurate picture of file system performance.

We describe our views on the future of file system benchmarking. To that end, we have been developing several technologies: fine-grained file system tracing, efficient file system replaying, automated file system benchmarking tools, and low-overhead detailed file system behavior visualization tools.

This project now expands into evaluating new multi-dimensional optimization techniques for storage systems, with Big Data data sets as one key scientific case study; often, we use HDF5 based data sets of large sizes (many gigs to a few terabytes).

Journal Articles:

# Title (click for html version) Formats Published In Date Comments
1 Performance and resource utilization of fuse user-space file systems PDF BibTeX ACM Transactions on Storage (TOS) May 2019 FUSE Article Online Appendices
2 Cluster and Single-Node Analysis of Long-Term Deduplication Patterns PDF BibTeX ACM Transactions on Storage (TOS) May 2018  
3 vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O PDF BibTeX ACM Transactions on Storage (TOS) Sep 2017  
4 Filebench: A Flexible Framework for File System Benchmarking BibTeX ;login: The USENIX Magazine Mar 2016  
5 Unifying Biological Image Formats with HDF5 BibTeX Communications of the ACM Oct 2009  
6 Notes on a Nine Year Study of File System and Storage Benchmarking BibTeX Byte and Switch Jul 2009  
7 A Nine Year Study of File System and Storage Benchmarking PS PDF BibTeX ACM Transactions on Storage (TOS) May 2008 Online data appendix

Conference and Workshop Papers:

# Title (click for html version) Formats Published In Date Comments
1 Towards Better Understanding of Black-box Auto-Tuning: A Comparative Analysis for Storage Systems PDF BibTeX 2018 USENIX Annual Technical Conference (ATC 2018) Jul 2018 Data Set released as part of this paper.
2 On the Performance Variation in Modern Storage Stacks PDF BibTeX 15th USENIX Conference on File and Storage Technologies (FAST 2017) Feb 2017  
3 To FUSE or Not to FUSE: Performance of User-Space File Systems PDF BibTeX 15th USENIX Conference on File and Storage Technologies (FAST 2017) Feb 2017 See sources related to paper
4 vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O PDF BibTeX 15th USENIX Conference on File and Storage Technologies (FAST 2017) Feb 2017 Nominated for best paper award
5 A Long-Term User-Centric Analysis of Deduplication Patterns PDF BibTeX 32nd IEEE Conference on Mass Storage Systems and Technologies (MSST 2016) May 2016  
6 Using Hints to Improve Inline Block-Layer Deduplication PDF BibTeX 14th USENIX Conference on File and Storage Technologies (FAST 2016) Feb 2016  
7 Parametric Optimization of Storage Systems PDF BibTeX 7th USENIX Workshop in Hot Topics in Storage and File Systems (HotStorage 2015) Jul 2015  
8 Newer Is Sometimes Better: An Evaluation of NFSv4.1 PDF BibTeX International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2015) Jun 2015  
9 Dmdedup: Device-mapper Deduplication Target PDF BibTeX 2014 Ottawa Linux Symposium Jul 2014  
10 Virtual Machine Workloads: The Case for New Benchmarks for NAS PDF BibTeX 11th USENIX Conference on File and Storage Technologies (FAST 2013) Feb 2013  
11 Generating Realistic Datasets for Deduplication Analysis PS PDF BibTeX 2012 USENIX Annual Technical Conference (ATC 2012) Jun 2012  
12 Extracting Flexible, Replayable Models from Large Block Traces PS PDF BibTeX Tenth USENIX Conference on File and Storage Technologies (FAST 2012) Feb 2012  
13 Benchmarking File System Benchmarking: It *IS* Rocket Science PS PDF BibTeX 13th USENIX Workshop in Hot Topics in Operating Systems (HotOS XIII) May 2011  
14 Accurate and Efficient Replaying of File System Traces PS PDF BibTeX Fourth USENIX Conference on File and Storage Technologies (FAST 2005) Dec 2005  
15 Auto-pilot: A Platform for System Software Benchmarking PS PDF BibTeX Usenix Technical Conference, FREENIX Track Apr 2005  
16 Tracefs: A File System to Trace Them All PS PDF BibTeX Third USENIX Conference on File and Storage Technologies (FAST 2004) Apr 2004  

Technical Reports:

# Title (click for html version) Formats Published In Date Comments
1 A Practical Auto-Tuning Framework for Storage Systems PDF BibTeX Stony Brook U. CS TechReport FSL-19-01 Jan 2019 Ph.D. Dissertation Defense
2 A Practical, Real-Time Auto-Tuning Framework for Storage Systems PDF BibTeX Stony Brook U. CS TechReport FSL-18-01 Apr 2018 Ph.D. Dissertation Proposal
3 Parametric Optimization of Storage Systems PDF BibTeX Stony Brook U. CS TechReport FSL-16-01 Jan 2016 Ph.D. Research Proficiency Exam (RPE)
4 Design and Implementation of an Open-Source Deduplication Platform for Research PDF BibTeX Stony Brook U. CS TechReport FSL-15-03 Dec 2015 Ph.D. Research Proficiency Exam (RPE)
5 A Context Aware Block Layer: The Case for Block Layer Deduplication PDF BibTeX Stony Brook U. CS TechReport FSL-12-04 May 2012 M.S. Thesis
6 A Nine Year Study of File System and Storage Benchmarking PS PDF BibTeX Stony Brook U. CS TechReport FSL-07-01 May 2007 Online data appendix
7 Versatile File System Tracing with Tracefs PS PDF BibTeX Stony Brook U. CS TechReport FSL-04-05 Aug 2004 M.S. Thesis

Current Students:

# Name (click for home page) Program Member Since
1 Umit Ibrahim Akgun PhD Sep 2017
2 Tyler Estro PhD May 2018

Past Students:

# Name (click for home page) Program Period Current Location
1 Ming Chen PhD May 2012 - Apr 2017 Software Engineer, Datadog, Datadog (New York, New York)
2 Nikolai Joukov PhD Jan 2004 - Dec 2006 Research Staff Member, Storage and Data Services Research group, IBM T. J. Watson Research Center (Hawthorne, NY)
3 Sonam Mandal PhD Jun 2013 - Dec 2016 Staff Software Engineer, LinkedIn (Sunnyvale, CA)
4 Wei Su PhD May 2019 - Dec 2021 Performance and Capacity Engineer, Facebook (Menlo Park, CA)
5 Vasily Tarasov PhD Jan 2008 - Nov 2013 Research Staff Member, Scale-out Storage Software, IBM Research - Almaden (San Jose, USA)
6 Avishay Traeger PhD Sep 2003 - Aug 2008 Senior Principal Software Engineer, Red Hat (Raanana, Israel)
7 Charles P. Wright PhD May 2003 - May 2006 Partner, Senior Software Architect, Illumon (New York, NY)
8 Prafful Agarwal MS Jan 2018 - Dec 2019  
9 Akshat Aranya MS May 2003 - Aug 2004 Software Development Engineer III, AWS Elemental, Elemental Technologies (Portland, OR)
10 Akshay Aurora MS Jan 2019 - Dec 2019 Software Engineer, Databricks (San Francisco, CA)
11 Geetika Babu Bangera MS Jan 2017 - Dec 2017 Member Technical Staff, Software, NetApp, Inc. (Sunnyvale, CA)
12 Arvind Chaudhary MS Sep 2014 - Dec 2015 Member of Technical Staff, CNA group, VMware Inc. (Palo Alto, CA)
13 Udit Kaushik Chitalia MS May 2014 - May 2015 Software Engineer, Twitter (San Francisco, CA)
14 Tyler Estro MS Sep 2017 - Apr 2018 PhD candidate Stony Brook University, Computer Science Department
15 Sujay Godbole MS Sep 2008 - Dec 2009 Member of Technical Staff, Core Storage Group (ESX), Vmware Inc. (Cambridge, MA)
16 Darshan Godhia MS Jan 2017 - Dec 2017 Software Engineer, Youtube Engineering, Google (San Bruno, CA)
17 Shivanshu Goswami MS Aug 2016 - Dec 2017 TBA
18 Mayur Jadhav MS Jun 2017 - May 2018 GPGPU Machine Learning Engineer, Intel, Intel (Folsom, CA)
19 Pragesh Jagnani MS Jan 2019 - Dec 2019 Software Development Engineer, Amazon Selection and Catalog Systems, Amazon (Seattle, WA)
20 Deepak Jain MS Sep 2012 - Dec 2013 Member of Technical Staff, Project FVP - Engineering, Pernixdata Inc. (San Jose, USA)
21 Mehul Jain MS Jan 2019 - Dec 2019 Software Engineer 2, Twitter (San Francisco, CA)
22 Farhaan Jalia MS Jan 2017 - Dec 2017 Member of Technical Staff II, Cloud Native Group, VMware Inc. (Bellevue, WA)
23 Sagar Jeevan MS Jun 2019 - Dec 2019 Software Engineer II, Dell Technologies (Isilon) (Seattle, WA)
24 Aneesh Joshi MS Aug 2019 - May 2020 Member of Technical Staff, Core Data Path, Nutanix, Inc. (San Jose, CA)
25 Shobhit Khandelwal MS Jan 2019 - Dec 2019 Member of Technical Staff, Pure Storage, Pure Storage (Mountain View, CA)
26 Koundinya Santhosh Kumar MS Sep 2010 - Dec 2011 Senior Development Software Engineer, Advanced Software Development and Performance, SanDisk (Milpitas, CA)
27 Noopur Anil Maheshwari MS Aug 2017 - Dec 2018 Software Engineer, HPE (Nimble Storage) (Sunnyvale, CA)
28 Manu Mathew MS Jan 2018 - Dec 2018 Software Engineer, SolidFire Element OS, NetApp ( Raleigh, NC)
29 Amar Mudrankit MS Jan 2011 - May 2012 Software Engineer, Advanced Development Group at Fusion-IO (San Jose, CA)
30 Ritika Nevatia MS Sep 2018 - Dec 2019 Software Engineer, iCloud, Apple Inc. (Seattle, WA)
31 Dongju Ok MS Sep 2014 - May 2016 Software Engineer, Application Team, Commvault Systems Inc. (Tinton Falls, NJ)
32 Karthikeyani Palanisami MS May 2012 - Jun 2013 Member of Technical Staff, Project MARS - Engineering, NetApp Inc (Sunnyvale, USA)
33 Nidhi Panpalia MS Jan 2017 - Dec 2017 Development Engineer, AWS Lambda, Amazon (Seattle, WA)
34 Dhanashri Patil MS Jan 2018 - Dec 2018 Senior Software Engineer, Dell Technologies (Isilon) (Seattle, WA)
35 Dhivahar Perumal MS Sep 2018 - May 2019 Software Engineer, Data Services Team (CASL), Nimble Storage - HPE (San Jose, USA)
36 Vinothkumar Raja MS Sep 2016 - Dec 2017 Software Engineer, Pure Storage Inc. (Mountain View, CA)
37 Venkatakrishnan Rajagopalan MS Jan 2016 - Dec 2016 Member of the Technical Staff, VMware Inc. (Palo Alto, CA)
38 Vineeth Ramesh MS Jan 2018 - Dec 2018 Software Engineer, Dialpad, Dialpad (San Francisco, CA)
39 Rahul Rane MS Jan 2018 - Dec 2018 Software Engineer, HPE (Nimble Storage) (Sunnyvale, CA)
40 Shubhi Rani MS Sep 2015 - Dec 2016 Member of Technical Staff, VMware Inc. (Palo Alto, CA)
41 Nehil Shah MS Jan 2017 - Dec 2017 Software Development Engineer, Amazon AWS Infrastructure - Enterprise Networking, Amazon (Seattle, WA)
42 Krapi Ravindra Shah MS Jan 2019 - Dec 2019 Assistant VP, Data Platforms, Tradeweb Markets LLC. (Jersey City, NJ)
43 Rushabh Shah MS Jan 2017 - Dec 2017 Software Engineer, Facebook Inc. (Menlo Park, CA)
44 Mukul Sharma MS Aug 2016 - Dec 2017 Member of the Technical Staff, Core Data Path, Nutanix (San Jose, CA)
45 Varun Shastry MS Sep 2014 - Dec 2015 Member of Technical Staff, Disaster Recovery Team, Nutanix Inc. (San Jose, CA)
46 Siddesh Shinde MS Jan 2018 - Dec 2018 Member of Technical Staff, Core Data Path, Cohesity Inc (San Jose, CA)
47 Gyumin Sim MS Jan 2010 - Dec 2010 Software Engineer, Data Center Power Team Google (Mountain View, CA)
48 Nilesh Somani MS May 2018 - Dec 2019 Senior Software Engineer, Storage Team, Robin Systems Inc. (San Jose, CA)
49 Jatin Sood MS Jan 2019 - Dec 2019  
50 Aayush Sureka MS Sep 2018 - Dec 2019 Principal Member of Technical Staff, Transactions group, Oracle Database (Redwood Shores, CA)
51 Sagar Trehan MS Sep 2012 - Dec 2013 Member of Technical Staff, CASL Performance Group - Engineering, Nimble Storage Inc (San Jose, USA)
52 Maryia Maskaliova BS/MS May 2017 - May 2018 Software Engineer, Android Performance
53 Leixiang Wu BS/MS Feb 2015 - May 2017 Software Development Engineer, Amazon Prime Video, Amazon (Seattle, WA)
54 Amrith Arunachalam BS May 2018 - Dec 2018  
55 Abraham Spitalny BS Jul 2019 - Dec 2019 Software Development Engineer I, Ads, Amazon (New York, NY)
56 Kevin Sun BS May 2017 - May 2018  
57 Tim Wong BS Dec 2004 - Jun 2005 Associate, Volatility Arbitrage, Global Asset Allocation, Applied Quantitative Research (Greenwich, CT)
58 Yinuo Zhang BS Aug 2019 - May 2020  
59 Henry Nelson HS Sep 2015 - Aug 2017 CS undergraduate at CMU

Sponsors:

# Sponsor Amount Period Type Title (click for award abstract)
1 NSF Computer and Network Systems (CNS) Core (Medium) $823,142 2019-2023 PI CNS Core: III: Medium: Collaborative Research: Optimizing and Understanding Large Parameter Spaces in Storage Systems
2 NSF Formal Methods in the Field (FMitF) $748,300 2019-2022 PI FMitF: Track I: NLP-Assisted Formal Verification of the NFS Distributed File System Protocol
3 NSF CISE Research Infrastructure (CRI) $129,867 2017-2020 PI Collaborative Research: CI-SUSTAIN: National File System Trace Repository
4 Microsoft Corporation $20,000 2016-2017 Sole-PI Microsoft Azure Cloud Credits
5 NSF Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA) $444,267 2013-2016 Lead-PI BIGDATA: Small: DCM: Collaborative Research: An efficient, versatile, scalable, and portable storage system for scientific data containers
6 NetApp Advanced technlogy Group $40,000 2011 Sole PI Dedup Workload Modeling, Synthetic Datasets, and Scalable Benchmarking
7 NSF HECURA $760,253 2006-2009 Lead-PI File System Tracing, Replaying, Profiling, and Analysis on HEC Systems
8 NSF Trusted Computing (TC) $400,000 2003-2006 Sole PI A Layered Approach to Securing Network File Systems