Learning to Match the Schemas of Data Sources: A Multistrategy Approach

Authors: 
Doan, A.; Domingos, P.; Halevy, A.
Author: 
Doan, A
Domingos, P
Halevy, A
Year: 
2003
Venue: 
VLDB 2003
URL: 
http://citeseer.ist.psu.edu/doan03learning.html
Citations: 
253
Citations range: 
100 - 499

The problem of integrating data from multiple data sources—either on the Internet or within enterprises—has received much attention in the database and AI communities. The focus has been on building data integration systems that provide a uniform query interface to the sources. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the query interface and the source schemas. Examples of mappings are ldquoelement location maps to addressrdquo and ldquoprice maps to listed-pricerdquo. We propose a multistrategy learning approach to automatically find such mappings. The approach applies multiple learner modules, where each module exploits a different type of information either in the schemas of the sources or in their data, then combines the predictions of the modules using a meta-learner. Learner modules employ a variety of techniques, ranging from Naive Bayes and nearest-neighbor classification to entity recognition and information retrieval. We describe the LSD system, which employs this approach to find semantic mappings. To further improve matching accuracy, LSD exploits domain integrity constraints, user feedback, and nested structures in XML data. We test LSD experimentally on several real-world domains. The experiments validate the utility of multistrategy learning for data integration and show that LSD proposes semantic mappings with a high degree of accuracy.