PBCCF: Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints

Yaobin Qin, Xianbo Zhang, David J. Lilja

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Deduplication provides significant benefits for accelerating large-scale storage systems, particularly backup systems, by eliminating the redundancy of the streaming data. Given the extraordinary growth of data, modern deduplication backup systems are challenged with the task of effectively and efficiently identifying data duplicates while having limited memory for fingerprint indexing. Based on our observation about an enterprise backup system, for the newly created client, there are no historical backups so that the prefetching algorithm has no reference basis to perform effective fingerprint prefetching. The generic prefetching approach such as Progressive Sampling requires large memory to maintain the prefetching performance. In our paper, we discovered the backup content correlation exists among the backups from some different clients based on the study of the real-world dataset. We propose a fingerprint prefetching algorithm, prefetching backup content correlated fingerprint (PBCCF) to improve the prefetching performance, by applying lightweight machine learning and statistical techniques to discover the backup patterns and generalize their features only using the high-level meta data. The experimental results reveal that PBCCF succeeds at identifying the highly correlated backups and fingerprints to maintain a good deduplication rate while significantly saving memory compared to the Progressive Sampling.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE 38th International Conference on Computer Design, ICCD 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages146-154
Number of pages9
ISBN (Electronic)9781728197104
DOIs
StatePublished - Oct 2020
Event38th IEEE International Conference on Computer Design, ICCD 2020 - Hartford, United States
Duration: Oct 18 2020Oct 21 2020

Publication series

NameProceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors
Volume2020-October
ISSN (Print)1063-6404

Conference

Conference38th IEEE International Conference on Computer Design, ICCD 2020
Country/TerritoryUnited States
CityHartford
Period10/18/2010/21/20

Bibliographical note

Publisher Copyright:
© 2020 IEEE.

Keywords

  • Backup systems
  • Deduplication
  • Fingerprint prefetching
  • Machine learning
  • Partial indexing

Fingerprint

Dive into the research topics of 'PBCCF: Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints'. Together they form a unique fingerprint.

Cite this