PBCCF: Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints

Yaobin Qin, Xianbo Zhang, David J. Lilja

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deduplication provides significant benefits for accelerating large-scale storage systems, particularly backup systems, by eliminating the redundancy of the streaming data. Given the extraordinary growth of data, modern deduplication backup systems are challenged with the task of effectively and efficiently identifying data duplicates while having limited memory for fingerprint indexing. Based on our observation about an enterprise backup system, for the newly created client, there are no historical backups so that the prefetching algorithm has no reference basis to perform effective fingerprint prefetching. The generic prefetching approach such as Progressive Sampling requires large memory to maintain the prefetching performance. In our paper, we discovered the backup content correlation exists among the backups from some different clients based on the study of the real-world dataset. We propose a fingerprint prefetching algorithm, prefetching backup content correlated fingerprint (PBCCF) to improve the prefetching performance, by applying lightweight machine learning and statistical techniques to discover the backup patterns and generalize their features only using the high-level meta data. The experimental results reveal that PBCCF succeeds at identifying the highly correlated backups and fingerprints to maintain a good deduplication rate while significantly saving memory compared to the Progressive Sampling.

Original languageEnglish (US)
Title of host publicationProceedings - 2020 IEEE 38th International Conference on Computer Design, ICCD 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages146-154
Number of pages9
ISBN (Electronic)9781728197104
DOIs
StatePublished - Oct 2020
Event38th IEEE International Conference on Computer Design, ICCD 2020 - Hartford, United States
Duration: Oct 18 2020Oct 21 2020

Publication series

NameProceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors
Volume2020-October
ISSN (Print)1063-6404

Conference

Conference38th IEEE International Conference on Computer Design, ICCD 2020
CountryUnited States
CityHartford
Period10/18/2010/21/20

Bibliographical note

Funding Information:
This work was supported in part by the Center for Research in Intelligent Storage (CRIS), which is supported by National Science Foundation grant no. IIP-1439622 and member companies. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. The project would not be possible without the invaluable assistance of Mark DuChene and Raymond Gilson in collecting the data. Finally, we would like to thank all other members of Veritas’ scrum teams for their significant feedback to the project.

Funding Information:
This work was supported in part by the Center for Research in Intelligent Storage (CRIS), which is supported by National Science Foundation grant no. IIP-1439622 and member companies. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.

Publisher Copyright:
© 2020 IEEE.

Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.

Keywords

  • Backup systems
  • Deduplication
  • Fingerprint prefetching
  • Machine learning
  • Partial indexing

Fingerprint Dive into the research topics of 'PBCCF: Accelerated Deduplication by Prefetching Backup Content Correlated Fingerprints'. Together they form a unique fingerprint.

Cite this