Frequency based chunking for data de-duplication

Guanlin Lu, Yu Jin, David H.C. Du

Research output: Chapter in Book/Report/Conference proceedingConference contribution

51 Scopus citations

Abstract

A predominant portion of Internet services, like content delivery networks, news broadcasting, blogs sharing and social networks, etc., is data centric. A significant amount of new data is generated by these services each day. To efficiently store and maintain backups for such data is a challenging task for current data storage systems. Chunking based deduplication (dedup) methods are widely used to eliminate redundant data and hence reduce the required total storage space. In this paper, we propose a novel Frequency Based Chunking (FBC) algorithm. Unlike the most popular Content-Defined Chunking (CDC) algorithm which divides the data stream randomly according to the content, FBC explicitly utilizes the chunk frequency information in the data stream to enhance the data deduplication gain especially when the metadata overhead is taken into consideration. The FBC algorith consists of two components, a statistical chunk frequency estimation algorithm for identifying the globally appeared frequent chunks, and a two-stage chunking algorithm which uses these chunk frequencies to obtain a better chunking result. To evaluate the effectiveness of the proposed FBC algorithm, we conducted extensive experiments on heterogeneous datasets. In all experiments, the FBC algorithm persistently outperforms the CDC algorithm in terms of achieving a better dedup gain or producing much less number of chunks. Particularly, our experiments show that FBC produces 2.5 ∼ 4 times less number of chunks than that of a baseline CDC which achieving the same Duplicate Elimination Ratio (DER). Another benefit of FBC over CDC is that the FBC with average chunk size greater than or equal to that of CDC achieves up to 50% higher DER than that of a CDC algorithm.

Original languageEnglish (US)
Title of host publicationProceedings - 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2010
Pages287-296
Number of pages10
DOIs
StatePublished - 2010
Event18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2010 - Miami Beach, FL, United States
Duration: Aug 17 2010Aug 19 2010

Publication series

NameProceedings - 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2010

Other

Other18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2010
CountryUnited States
CityMiami Beach, FL
Period8/17/108/19/10

Keywords

  • Bloom filter
  • Content-defined Chunking
  • Data Deduplication
  • Frequency based Chunking
  • Statistical chunk frequency estimation

Fingerprint Dive into the research topics of 'Frequency based chunking for data de-duplication'. Together they form a unique fingerprint.

Cite this