http://lattes.cnpq.br/2027297399918127; ARAÚJO, Diego Fernandes de.
Abstract:
Several methods of Entity Resolution (ER) have been developed both at academia and industry over the years, with the aim to identify duplicate entities (e.g.records) in datasets. To evaluate the efficacy of such methods, it is necessary to compare their results with a ground-truth, which consists of a document containing all known duplicate record pairs in a dataset. In general, the generation of ground-truths for real datasets is done manually from the inspection of all combinations of pairs of records in a dataset. However, this is subject to error and presents quadratic complexity, with respect to the size(s) of the dataset(s), requiring a long time to be performed. In this context, some works present (semi) automatic approaches for the generation of ground-truths for the ER task. However, such approaches are either not applicable to several domains or still require a considerable manual effort. In this work, we propose GTGenERAL, a semiautomatic approach which combines results from multiple algorithms of ER together with Active Learning to generate ground-truths employing reduced manual effort. Experiments using real datasets show that, with great manual effort reduction, GTGenERAL is able to generate ground-truths close to those generated by the state-of-the-art approach, while substantially reducing the manual effort undertaken in the process.