Deduplication provides significant benefits for accelerating large-scale storage systems, particularly backup systems, by eliminating the redundancy of the streaming data. Given the extraordinary growth of data, modern deduplication backup systems are challenged with the task of effectively and efficiently identifying data duplicates while having limited memory for fingerprint indexing. Based on our observation about an enterprise backup system, for the newly created client, there are no historical backups so that the prefetching algorithm has no reference basis to perform effective fingerprint prefetching. The generic prefetching approach such as Progressive Sampling requires large memory to maintain the prefetching performance. In our paper, we discovered the backup content correlation exists among the backups from some different clients based on the study of the real-world dataset. We propose a fingerprint prefetching algorithm, prefetching backup content correlated fingerprint (PBCCF) to improve the prefetching performance, by applying lightweight machine learning and statistical techniques to discover the backup patterns and generalize their features only using the high-level meta data. The experimental results reveal that PBCCF succeeds at identifying the highly correlated backups and fingerprints to maintain a good deduplication rate while significantly saving memory compared to the Progressive Sampling.
|Original language||English (US)|
|Title of host publication||Proceedings - 2020 IEEE 38th International Conference on Computer Design, ICCD 2020|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||9|
|State||Published - Oct 2020|
|Event||38th IEEE International Conference on Computer Design, ICCD 2020 - Hartford, United States|
Duration: Oct 18 2020 → Oct 21 2020
|Name||Proceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors|
|Conference||38th IEEE International Conference on Computer Design, ICCD 2020|
|Period||10/18/20 → 10/21/20|
Bibliographical noteFunding Information:
This work was supported in part by the Center for Research in Intelligent Storage (CRIS), which is supported by National Science Foundation grant no. IIP-1439622 and member companies. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. The project would not be possible without the invaluable assistance of Mark DuChene and Raymond Gilson in collecting the data. Finally, we would like to thank all other members of Veritas’ scrum teams for their significant feedback to the project.
This work was supported in part by the Center for Research in Intelligent Storage (CRIS), which is supported by National Science Foundation grant no. IIP-1439622 and member companies. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
© 2020 IEEE.
Copyright 2020 Elsevier B.V., All rights reserved.
- Backup systems
- Fingerprint prefetching
- Machine learning
- Partial indexing