TY - GEN
T1 - A heterogeneous field matching method for record linkage
AU - Minton, Steven N.
AU - Nanjo, Claude
AU - Knoblock, Craig A.
AU - Michalowski, Martin
AU - Michelson, Matthew
PY - 2005
Y1 - 2005
N2 - Record linkage is the process of determining that two records refer to the same entity. A key subprocess is evaluating how well the individual fields, or attributes, of the records match each other. One approach to matching fields is to use hand-written domain-specific rules. This "expert systems" approach may result in good performance for specific applications, but it is not scalable. This paper describes a new machine learning approach that creates expert-like rules for field matching. In our approach, the relationship between two field values is described by a set of heterogeneous transformations. Previous machine learning methods used simple models to evaluate the distance between two fields. However, our approach enables more sophisticated relationships to be modeled, which better capture the complex domain specific, common-sense phenomena that humans use to judge similarity. We compare our approach to methods that rely on simpler homogeneous models in several domains. By modeling more complex relationships we produce more accurate results.
AB - Record linkage is the process of determining that two records refer to the same entity. A key subprocess is evaluating how well the individual fields, or attributes, of the records match each other. One approach to matching fields is to use hand-written domain-specific rules. This "expert systems" approach may result in good performance for specific applications, but it is not scalable. This paper describes a new machine learning approach that creates expert-like rules for field matching. In our approach, the relationship between two field values is described by a set of heterogeneous transformations. Previous machine learning methods used simple models to evaluate the distance between two fields. However, our approach enables more sophisticated relationships to be modeled, which better capture the complex domain specific, common-sense phenomena that humans use to judge similarity. We compare our approach to methods that rely on simpler homogeneous models in several domains. By modeling more complex relationships we produce more accurate results.
UR - http://www.scopus.com/inward/record.url?scp=33750728576&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33750728576&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2005.7
DO - 10.1109/ICDM.2005.7
M3 - Conference contribution
AN - SCOPUS:33750728576
SN - 0769522785
SN - 9780769522784
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 314
EP - 321
BT - Proceedings - Fifth IEEE International Conference on Data Mining, ICDM 2005
T2 - 5th IEEE International Conference on Data Mining, ICDM 2005
Y2 - 27 November 2005 through 30 November 2005
ER -