In this project we are developing tools and techniques to produce realistic, scalable dedupable data sets, taking actual workloads into account. We are analyzing dedupability properties of several different data sets we have access to; we are developing and releasing tools for anyone to analyze their own data sets without violating privacy. We are building models that describe the important inherent properties of those data sets. We are able to generate data synthetically that follows these models; we are generating data sets far larger than their originals, but faithfully modeling the original data.
Our preliminary prototype work is promising: we are developing tools to chunk and hash backup and online data sets, then extract key properties such as the distributions of duplicate hashes. We are building Markov Models to represent the dedupability of backup data-sets over time.