Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform sophisticated analyses of their data. Recent years have seen the introduction of a number of data-warehousing engines, from both established database vendors as well as new players. The engines themselves are relatively easy to use and come with a good set of end-user tools. However, there is one key stumbling block to the rapid development of data warehouses, namely that of warehouse population. Specifically, problems arise in populating a warehouse with existing data since it has various types of heterogeneity. Given the lack of good tools, this task has generally been performed by various system integrators, e.g., software consulting organizations which have developed in-house tools and processes for the task. The general conclusion is that the task has proven to be labor-intensive, error-prone, and generally frustrating, leading a number of warehousing projects to be abandoned mid-way through development. However, the picture is not as grim as it appears. The problems that are being encountered in warehouse creation are very similar to those encountered in data integration, and they have been studied for about two decades. However, not all problems relevant to warehouse creation have been solved, and a number of research issues remain. The principal goal of this paper is to identify the common issues in data integration and data-warehouse creation. We hope this will lead: 1) developers of warehouse creation tools to examine and, where appropriate, incorporate the techniques developed for data integration, and 2) researchers in both the data integration and the data warehousing communities to address the open research issues in this important area.
|Number of pages
|IEEE Transactions on Knowledge and Data Engineering
|Published - 1999
Bibliographical noteFunding Information:
This work is supported, in part, by the U.S. Department of Transportation through Grant No. USDOT/DTRS93-G-0017 to the University of Minnesota.
- Attribute value conflict
- Data integration
- Data mining
- Data warehouse
- Entity identification