Semantic data de-duplication for archival storage systems

Liu Chuanyi, Ju Dapeng, Gu Yu, Zhang Youhui, Wang Dongsheng, David H Du

Research output: Chapter in Book/Report/Conference proceedingConference contribution

8 Scopus citations

Abstract

In archival storage systems, there is a huge amount of duplicate data or redundant data, which occupy significant extra equipments and power consumptions, largely lowering down resources utilization (such as the network bandwidth and storage) and imposing extra burden on management as the scale increases. So Data De-duplication, the goal of which is to minimize the duplicate data in the inter-file level, has been receiving broad attention both in academic and industry in recent years. In this paper, Semantic Data De-duplication (SDD) is proposed, which makes use of the semantic information in the I/O path (such as file type, file format, application hints and filesystem metadata) of the archival files to direct the dividing a file into Semantic Chunks (SC). While the main goal of SDD is to maximally reduce the interfile level duplications, directly storing variable SCes into disks will result in a lot of fragments and involve a high percentage of random disk accesses, which is very inefficient. So an efficient data storage scheme is also designed and implemented: SCes are further packaged into fixed sized Objects, which are actually the storage units in the storage devices, so as to speed up the I/O performance as well as ease the data management. Primary experiments have demonstrated that SDD can further reduce the storage space compared with current methods (from 20% to near 50% according to different datasets), and largely improves the writing performance (about 50%-70% in average).

Original languageEnglish (US)
Title of host publication13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008
DOIs
StatePublished - 2008
Event13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008 - Hsinchu, Taiwan, Province of China
Duration: Aug 4 2008Aug 6 2008

Publication series

Name13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008

Other

Other13th IEEE Asia-Pacific Computer Systems Architecture Conference, ACSAC 2008
Country/TerritoryTaiwan, Province of China
CityHsinchu
Period8/4/088/6/08

Fingerprint

Dive into the research topics of 'Semantic data de-duplication for archival storage systems'. Together they form a unique fingerprint.

Cite this