Future of File Systems

The Future of File Systems is almost here. Object stores, deduplication engines, and industry-proven advances in snapshotting, filing, and versioning are all pointing toward the next advance in how we think about file system data. We are approaching the point where the differences between the IP protocols and the various FS protocols will become less distinct. Currently there are two major modes by which file system data is referred to: (1) segment name, and (2) object/file name. Standards will be drafted to refer to data in these two ways, with various specializations, and the next era of file system storage will begin.

The IFSP (Inter-File-System Protocol) will be a variation on the IP packet, with an extended header, much in the same lines as TCP/IP and UDP. There will be two major kinds of IFSP packets: (1) object UID to segment UID mappings, and (2) segment UID to segment data mappings. These simple mapping relations will provide the fundamental building blocks to construct a graph relationship between objects of indefinite size, which are the basic building blocks behind an object store, which is the basic building block of almost all storage systems in use today from FSes to source repositories, to deduplicating filers, to wide-area-network streaming compressors. The new standard (IFSP) will also be known as NFSv5 jokingly for those in the know. It will of course support a host of commonly used addressing mechanisms for data (e.g., (object UID and page offset), and (object UID and segment offset) just to name a few).

The new protocol will become slowly adopted as existing storage and data transfer mechanisms start to plug IFSP support into their app. Some of the early adopters will include BitTorrent, Linux's NFS client/server, most hand-held synchronization protocols, git, and surprisingly Apple, who was waiting for a standard way to store deduplicated data blocks in their TimeMachine in the OS X Puma release (2013). Later adopters will include on-line backup websites like Mozy On-Line and Carbon Copy. Next will come Amazon S4 (successor to S3), and Microsoft's cloud storage products. The Linux community will roll out cxt, the caching-enabled extended file system, which will only support traditional FS performance and power workload curves but can fetch known but un-cached files on demand. Cxt3 will give users the ability to specify different kinds of storage policies and block formats for caching IFSP objects depending on the type of workload (web server, file server, mail server).

Eventually users will get used to taking leases on objects to keep them cached locally, rather than "saving files" which by then (2019) will seem as out-dated as sending jobs to the computing facility by truck seems to us. Data and meta-data will be sitting cached on user nodes across the internet, and P2P services will tap those caches to achieve fast distributed downloads of common objects (files) and segments. Users will no longer have L1, L2, RAM, and Disk, but L1, L2, RAM, Disk, near pool, less near pool, far pool, very far pool, and cloud back-up. Interconnected services like Google's gmail, calendars, lattitude, and more will intercommunicate with objects cached on the user's local node using IFSP instead of elaborate AJAX protocols, browser plug-ins, and Gears. Most of these users will have their local Google caches (along with other remote services caches) automatically included in their streaming back-up subscription, having these cached IFSP objects tranmitted to their back-up provider (not necessarily Google).

Users will expect different levels of data integrity, and existing logging and journaling mechanisms will not go away. Different local-node caches, as well as remote services, will provide different failure recovery features. Common work flows (data integrity, max performance, caching only, security, etc...) will be codified as storage configuration procedures that users can run to ensure all links in the storage hierarchy from RAM down to the back-up provider and all caches in-between, will support the desired work flow using two-phase commit, different levels of isolation, consistency, and parodying.

In summmary, the future of file systems is standardizing the thing that all storage mechanisms have in common, the basic operators used for updating/accessing object stores over a bus (or network) and letting users install different client-side object store caches (FSes) and subscribe to different storage service providers for different workloads and work flows. Deduplication and wide-area-network compression will become standard, and everyone will be managing their data as if it were a distributed repository, with different local nodes being in sync, or needing to be synced. There will be strong incentives for existing similar systems (git, gmail, google docs, NFS, dedup, backup, provenance logs, etc...) to get on the bandwagon and all speak IFSP. This will be demanded by users who will want the increased interoperability which will open the door to new work flows, decreased IT costs, decreased devel costs, and new kinds of markets and service niches.

Last updated: Mon Oct 5 15:31:40 EDT 2009