Schema Matching using Duplicates

Authors: 
Bilke, A.; Naumann, F.
Author: 
Bilke, A
Naumann, F
Year: 
2005
Venue: 
ICDE, 2005
URL: 
http://www.dit.unitn.it/~p2p/RelatedWork/Matching/dublicatesICDE05.pdf
Citations: 
154
Citations range: 
100 - 499
AttachmentSize
Bilke2005SchemaMatchingusing.pdf139.12 KB

Most data integration applications require a matching
between the schemas of the respective data sets. We show
how the existence of duplicates within these data sets can be
exploited to automatically identify matching attributes. We
describe an algorithm that first discovers duplicates among
data sets with unaligned schemas and then uses these duplicates
to perform schema matching between schemas with
opaque column names.
Discovering duplicates among data sets with unaligned
schemas is more difficult than in the usual setting, because
it is not clear which fields in one object should be compared
with which fields in the other. We have developed a new
algorithm that efficiently finds the most likely duplicates in
such a setting. Now, our schema matching algorithm is able
to identify corresponding attributes by comparing data values
within those duplicate records. An experimental study
on real-world data shows the effectiveness of this approach.