File System and Storage Benchmarking Tools and Techniques

Benchmarking is critical when evaluating performance, but is especially difficult for file and storage systems. Complex interactions between I/O devices, caches, kernel daemons, and other OS components result in behavior that is rather difficult to analyze. Moreover, systems have different features and optimizations, so no single benchmark is always suitable. The large variety of workloads that these systems experience in the real world also add to this difficulty.

We have found that some of the most commonly used benchmarks are flawed, and many research papers do not provide a clear enough picture of file system performance. We believe that a good performance evaluation should use micro-benchmarks to highlight both the good and bad qualities of a file system, as well as general-purpose benchmarks or traces to give an idea about how it would perform under expected and realistic workloads. Nevertheless, care should be taken to ensure that general-purpose benchmarks indeed accurately reflect the real-world workloads. In addition, benchmarks should scale well, and results should be reproducible and comparable across papers.

In this project, we survey file system benchmarks used in many recent research papers. We found that no single benchmark adequately measures file system performance. We show how some commonly acceptable and widely used benchmarks and benchmarking techniques can easily conceal overheads, unfairly over-emphasize overheads, or can in general emphasize or de-emphasize many of the file system's properties. We offer suggestions on how to create and conduct benchmarks so that they provide a more fair and accurate picture of file system performance.

We describe our views on the future of file system benchmarking. To that end, we have been developing several technologies: fine-grained file system tracing, efficient file system replaying, automated file system benchmarking tools, and low-overhead detailed file system behavior visualization tools.

This project now expands into evaluating new multi-dimensional optimization techniques for storage systems, with Big Data data sets as one key scientific case study; often, we use HDF5 based data sets of large sizes (many gigs to a few terabytes).

Journal Articles:

#	Title (click for html version)	Formats	Published In	Date	Comments
1	Performance and resource utilization of fuse user-space file systems	PDF BibTeX	ACM Transactions on Storage (TOS)	May 2019	FUSE Article Online Appendices
2	Cluster and Single-Node Analysis of Long-Term Deduplication Patterns	PDF BibTeX	ACM Transactions on Storage (TOS)	May 2018
3	vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O	PDF BibTeX	ACM Transactions on Storage (TOS)	Sep 2017
4	Filebench: A Flexible Framework for File System Benchmarking	BibTeX	;login: The USENIX Magazine	Mar 2016
5	Unifying Biological Image Formats with HDF5	BibTeX	Communications of the ACM	Oct 2009
6	Notes on a Nine Year Study of File System and Storage Benchmarking	BibTeX	Byte and Switch	Jul 2009
7	A Nine Year Study of File System and Storage Benchmarking	PS PDF BibTeX	ACM Transactions on Storage (TOS)	May 2008	Online data appendix

Conference and Workshop Papers:

#	Title (click for html version)	Formats	Published In	Date	Comments
1	Towards Better Understanding of Black-box Auto-Tuning: A Comparative Analysis for Storage Systems	PDF BibTeX	2018 USENIX Annual Technical Conference (ATC 2018)	Jul 2018	Data Set released as part of this paper.
2	On the Performance Variation in Modern Storage Stacks	PDF BibTeX	15th USENIX Conference on File and Storage Technologies (FAST 2017)	Feb 2017
3	To FUSE or Not to FUSE: Performance of User-Space File Systems	PDF BibTeX	15th USENIX Conference on File and Storage Technologies (FAST 2017)	Feb 2017	See sources related to paper
4	vNFS: Maximizing NFS Performance with Compounds and Vectorized I/O	PDF BibTeX	15th USENIX Conference on File and Storage Technologies (FAST 2017)	Feb 2017	Nominated for best paper award
5	A Long-Term User-Centric Analysis of Deduplication Patterns	PDF BibTeX	32nd IEEE Conference on Mass Storage Systems and Technologies (MSST 2016)	May 2016
6	Using Hints to Improve Inline Block-Layer Deduplication	PDF BibTeX	14th USENIX Conference on File and Storage Technologies (FAST 2016)	Feb 2016
7	Parametric Optimization of Storage Systems	PDF BibTeX	7th USENIX Workshop in Hot Topics in Storage and File Systems (HotStorage 2015)	Jul 2015
8	Newer Is Sometimes Better: An Evaluation of NFSv4.1	PDF BibTeX	International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS 2015)	Jun 2015
9	Dmdedup: Device-mapper Deduplication Target	PDF BibTeX	2014 Ottawa Linux Symposium	Jul 2014
10	Virtual Machine Workloads: The Case for New Benchmarks for NAS	PDF BibTeX	11th USENIX Conference on File and Storage Technologies (FAST 2013)	Feb 2013
11	Generating Realistic Datasets for Deduplication Analysis	PS PDF BibTeX	2012 USENIX Annual Technical Conference (ATC 2012)	Jun 2012
12	Extracting Flexible, Replayable Models from Large Block Traces	PS PDF BibTeX	Tenth USENIX Conference on File and Storage Technologies (FAST 2012)	Feb 2012
13	Benchmarking File System Benchmarking: It IS Rocket Science	PS PDF BibTeX	13th USENIX Workshop in Hot Topics in Operating Systems (HotOS XIII)	May 2011
14	Accurate and Efficient Replaying of File System Traces	PS PDF BibTeX	Fourth USENIX Conference on File and Storage Technologies (FAST 2005)	Dec 2005
15	Auto-pilot: A Platform for System Software Benchmarking	PS PDF BibTeX	Usenix Technical Conference, FREENIX Track	Apr 2005
16	Tracefs: A File System to Trace Them All	PS PDF BibTeX	Third USENIX Conference on File and Storage Technologies (FAST 2004)	Apr 2004

Technical Reports:

#	Title (click for html version)	Formats	Published In	Date	Comments
1	A Practical Auto-Tuning Framework for Storage Systems	PDF BibTeX	Stony Brook U. CS TechReport FSL-19-01	Jan 2019	Ph.D. Dissertation Defense
2	A Practical, Real-Time Auto-Tuning Framework for Storage Systems	PDF BibTeX	Stony Brook U. CS TechReport FSL-18-01	Apr 2018	Ph.D. Dissertation Proposal
3	Parametric Optimization of Storage Systems	PDF BibTeX	Stony Brook U. CS TechReport FSL-16-01	Jan 2016	Ph.D. Research Proficiency Exam (RPE)
4	Design and Implementation of an Open-Source Deduplication Platform for Research	PDF BibTeX	Stony Brook U. CS TechReport FSL-15-03	Dec 2015	Ph.D. Research Proficiency Exam (RPE)
5	A Context Aware Block Layer: The Case for Block Layer Deduplication	PDF BibTeX	Stony Brook U. CS TechReport FSL-12-04	May 2012	M.S. Thesis
6	A Nine Year Study of File System and Storage Benchmarking	PS PDF BibTeX	Stony Brook U. CS TechReport FSL-07-01	May 2007	Online data appendix
7	Versatile File System Tracing with Tracefs	PS PDF BibTeX	Stony Brook U. CS TechReport FSL-04-05	Aug 2004	M.S. Thesis

Current Students:

#	Name (click for home page)	Program	Member Since
1	Tyler Estro	PhD	May 2018

Past Students:

#	Name (click for home page)	Program	Period	Current Location
1	Umit Ibrahim Akgun	PhD	Sep 2017 - Dec 2022
2	Ming Chen	PhD	May 2012 - Apr 2017	Software Engineer, Datadog, Datadog (New York, New York)
3	Nikolai Joukov	PhD	Jan 2004 - Dec 2006	Research Staff Member, Storage and Data Services Research group, IBM T. J. Watson Research Center (Hawthorne, NY)
4	Sonam Mandal	PhD	Jun 2013 - Dec 2016	Staff Software Engineer, LinkedIn (Sunnyvale, CA)
5	Wei Su	PhD	May 2019 - Dec 2021	Performance and Capacity Engineer, Facebook (Menlo Park, CA)
6	Vasily Tarasov	PhD	Jan 2008 - Nov 2013	Research Staff Member, Scale-out Storage Software, IBM Research - Almaden (San Jose, USA)
7	Avishay Traeger	PhD	Sep 2003 - Aug 2008	Senior Principal Software Engineer, Red Hat (Raanana, Israel)
8	Charles P. Wright	PhD	May 2003 - May 2006	Partner, Senior Software Architect, Illumon (New York, NY)
9	Prafful Agarwal	MS	Jan 2018 - Dec 2019
10	Akshat Aranya	MS	May 2003 - Aug 2004	Software Development Engineer III, AWS Elemental, Elemental Technologies (Portland, OR)
11	Akshay Aurora	MS	Jan 2019 - Dec 2019	Software Engineer, Databricks (San Francisco, CA)
12	Geetika Babu Bangera	MS	Jan 2017 - Dec 2017	Member Technical Staff, Software, NetApp, Inc. (Sunnyvale, CA)
13	Arvind Chaudhary	MS	Sep 2014 - Dec 2015	Member of Technical Staff, CNA group, VMware Inc. (Palo Alto, CA)
14	Udit Kaushik Chitalia	MS	May 2014 - May 2015	Software Engineer, Twitter (San Francisco, CA)
15	Tyler Estro	MS	Sep 2017 - Apr 2018	PhD candidate Stony Brook University, Computer Science Department
16	Sujay Godbole	MS	Sep 2008 - Dec 2009	Member of Technical Staff, Core Storage Group (ESX), Vmware Inc. (Cambridge, MA)
17	Darshan Godhia	MS	Jan 2017 - Dec 2017	Software Engineer, Youtube Engineering, Google (San Bruno, CA)
18	Shivanshu Goswami	MS	Aug 2016 - Dec 2017	TBA
19	Mayur Jadhav	MS	Jun 2017 - May 2018	GPGPU Machine Learning Engineer, Intel, Intel (Folsom, CA)
20	Pragesh Jagnani	MS	Jan 2019 - Dec 2019	Software Development Engineer, Amazon Selection and Catalog Systems, Amazon (Seattle, WA)
21	Deepak Jain	MS	Sep 2012 - Dec 2013	Member of Technical Staff, Project FVP - Engineering, Pernixdata Inc. (San Jose, USA)
22	Mehul Jain	MS	Jan 2019 - Dec 2019	Software Engineer 2, Twitter (San Francisco, CA)
23	Farhaan Jalia	MS	Jan 2017 - Dec 2017	Member of Technical Staff II, Cloud Native Group, VMware Inc. (Bellevue, WA)
24	Sagar Jeevan	MS	Jun 2019 - Dec 2019	Software Engineer II, Dell Technologies (Isilon) (Seattle, WA)
25	Aneesh Joshi	MS	Aug 2019 - May 2020	Member of Technical Staff, Core Data Path, Nutanix, Inc. (San Jose, CA)
26	Shobhit Khandelwal	MS	Jan 2019 - Dec 2019	Member of Technical Staff, Pure Storage, Pure Storage (Mountain View, CA)
27	Koundinya Santhosh Kumar	MS	Sep 2010 - Dec 2011	Senior Development Software Engineer, Advanced Software Development and Performance, SanDisk (Milpitas, CA)
28	Noopur Anil Maheshwari	MS	Aug 2017 - Dec 2018	Software Engineer, HPE (Nimble Storage) (Sunnyvale, CA)
29	Manu Mathew	MS	Jan 2018 - Dec 2018	Software Engineer, SolidFire Element OS, NetApp ( Raleigh, NC)
30	Amar Mudrankit	MS	Jan 2011 - May 2012	Software Engineer, Advanced Development Group at Fusion-IO (San Jose, CA)
31	Ritika Nevatia	MS	Sep 2018 - Dec 2019	Software Engineer, iCloud, Apple Inc. (Seattle, WA)
32	Dongju Ok	MS	Sep 2014 - May 2016	Software Engineer, Application Team, Commvault Systems Inc. (Tinton Falls, NJ)
33	Karthikeyani Palanisami	MS	May 2012 - Jun 2013	Member of Technical Staff, Project MARS - Engineering, NetApp Inc (Sunnyvale, USA)
34	Nidhi Panpalia	MS	Jan 2017 - Dec 2017	Development Engineer, AWS Lambda, Amazon (Seattle, WA)
35	Dhanashri Patil	MS	Jan 2018 - Dec 2018	Senior Software Engineer, Dell Technologies (Isilon) (Seattle, WA)
36	Dhivahar Perumal	MS	Sep 2018 - May 2019	Software Engineer, Data Services Team (CASL), Nimble Storage - HPE (San Jose, USA)
37	Vinothkumar Raja	MS	Sep 2016 - Dec 2017	Software Engineer, Pure Storage Inc. (Mountain View, CA)
38	Venkatakrishnan Rajagopalan	MS	Jan 2016 - Dec 2016	Member of the Technical Staff, VMware Inc. (Palo Alto, CA)
39	Vineeth Ramesh	MS	Jan 2018 - Dec 2018	Software Engineer, Dialpad, Dialpad (San Francisco, CA)
40	Rahul Rane	MS	Jan 2018 - Dec 2018	Software Engineer, HPE (Nimble Storage) (Sunnyvale, CA)
41	Shubhi Rani	MS	Sep 2015 - Dec 2016	Member of Technical Staff, VMware Inc. (Palo Alto, CA)
42	Nehil Shah	MS	Jan 2017 - Dec 2017	Software Development Engineer, Amazon AWS Infrastructure - Enterprise Networking, Amazon (Seattle, WA)
43	Krapi Ravindra Shah	MS	Jan 2019 - Dec 2019	Assistant VP, Data Platforms, Tradeweb Markets LLC. (Jersey City, NJ)
44	Rushabh Shah	MS	Jan 2017 - Dec 2017	Software Engineer, Facebook Inc. (Menlo Park, CA)
45	Mukul Sharma	MS	Aug 2016 - Dec 2017	Member of the Technical Staff, Core Data Path, Nutanix (San Jose, CA)
46	Varun Shastry	MS	Sep 2014 - Dec 2015	Member of Technical Staff, Disaster Recovery Team, Nutanix Inc. (San Jose, CA)
47	Siddesh Shinde	MS	Jan 2018 - Dec 2018	Member of Technical Staff, Core Data Path, Cohesity Inc (San Jose, CA)
48	Gyumin Sim	MS	Jan 2010 - Dec 2010	Software Engineer, Data Center Power Team Google (Mountain View, CA)
49	Nilesh Somani	MS	May 2018 - Dec 2019	Senior Software Engineer, Storage Team, Robin Systems Inc. (San Jose, CA)
50	Jatin Sood	MS	Jan 2019 - Dec 2019
51	Aayush Sureka	MS	Sep 2018 - Dec 2019	Principal Member of Technical Staff, Transactions group, Oracle Database (Redwood Shores, CA)
52	Sagar Trehan	MS	Sep 2012 - Dec 2013	Member of Technical Staff, CASL Performance Group - Engineering, Nimble Storage Inc (San Jose, USA)
53	Maryia Maskaliova	BS/MS	May 2017 - May 2018	Software Engineer, Android Performance
54	Leixiang Wu	BS/MS	Feb 2015 - May 2017	Software Development Engineer, Amazon Prime Video, Amazon (Seattle, WA)
55	Amrith Arunachalam	BS	May 2018 - Dec 2018
56	Abraham Spitalny	BS	Jul 2019 - Dec 2019	Software Development Engineer I, Ads, Amazon (New York, NY)
57	Kevin Sun	BS	May 2017 - May 2018
58	Tim Wong	BS	Dec 2004 - Jun 2005	Associate, Volatility Arbitrage, Global Asset Allocation, Applied Quantitative Research (Greenwich, CT)
59	Yinuo Zhang	BS	Aug 2019 - May 2020
60	Henry Nelson	HS	Sep 2015 - Aug 2017	CS undergraduate at CMU

Sponsors:

#	Sponsor	Amount	Period	Type	Title (click for award abstract)
1	NSF Computer and Network Systems (CNS) Core (Medium)	$823,142	2019-2025	Lead PI	CNS Core: III: Medium: Collaborative Research: Optimizing and Understanding Large Parameter Spaces in Storage Systems
2	NSF Formal Methods in the Field (FMitF)	$748,300	2019-2025	Lead PI	FMitF: Track I: NLP-Assisted Formal Verification of the NFS Distributed File System Protocol
3	NSF CISE Research Infrastructure (CRI)	$129,867	2017-2020	Lead PI	Collaborative Research: CI-SUSTAIN: National File System Trace Repository
4	Microsoft Corporation	$20,000	2016-2017	Sole PI	Microsoft Azure Cloud Credits
5	NSF Core Techniques and Technologies for Advancing Big Data Science & Engineering (BIGDATA)	$444,267	2013-2016	Lead PI	BIGDATA: Small: DCM: Collaborative Research: An efficient, versatile, scalable, and portable storage system for scientific data containers
6	NetApp Advanced technlogy Group	$40,000	2011	Sole PI	Dedup Workload Modeling, Synthetic Datasets, and Scalable Benchmarking
7	NSF HECURA	$760,253	2006-2009	Lead PI	File System Tracing, Replaying, Profiling, and Analysis on HEC Systems
8	NSF Trusted Computing (TC)	$400,000	2003-2006	Sole PI	A Layered Approach to Securing Network File Systems