Object Identification by Statistical Methods
published: Feb. 25, 2007, recorded: October 2004, views: 6139
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Numerical data fusion or merging of overlapping data files becomes a hard problem if no global unique identifying keys exist in the corresponding data sets. Typical examples are the linkage of address files supplied from different sources for commercial purposes - a money making area-, the merging of special offers in various media (cf. duplicate detection), or an administrative record census (ARC) as planed in Germany, where several autonomous, heterogeneous registers are to be merged. We present a three-step procedure consisting of the steps conversion of attributes, comparison of values of a pair of objects, and classification ('matching problem') of pairs either as "same" or "matched and "not same" or "not matched". We pay special attention to the quality and the efficiency of the methodology. We briefly discuss questions like correctness and completeness as well as pre-selection techniques like 'blocking' to reduce the computational complexity of pairwise comparisons. The approach is illustrated on data from carefully composed benchmark data sets. We assume some basic knowledge in computer science and classification (supervised learning).
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !