Araújo, T. B.; http://lattes.cnpq.br/1503278831971137; ARAÚJO, Tiago Brasileiro.
Abstract:
The Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowl- edge bases or identify similarities between data (entities). To avoid the quadratic cost of the Entity Resolution task, blocking (or indexing) techniques are widely applied as a prepro- cessing step. In this context, semistructured data and large data sources (Big Data) emerge as the major challenges faced by blocking techniques. Regarding semistructured data, the challenge is related to the fact that such data do not share the same scheme, difficulting the application of traditional blocking techniques. In this context, schema-agnostic blocking techniques are applied. For Big Data scenarios, blocking techniques and distributed com- puting should be applied to improve the efficiency of the RE task. In this sense, this work proposes a distributed execution model for blocking semistructured data in the context of large data sources, capable of dealing with different needs of application profiles faced by the ER task. These application profiles are related to the needs and characteristics inherent to each application, such as how the data are managed (i.e., batch or streaming), data quality and prioritization of effectiveness/efficiency. Furthermore, the present work also proposes new blocking techniques that can be integrated into the proposed model. Such blocking techniques address open challenges in the literature, such as parallel blocking techniques, incremental processing, and streaming data blocking. The blocking techniques proposed in this work were evaluated experimentally with the objective of measuring efficiency and effectiveness against the state-of-the-art ones, using real data sources. Based on the experi- mental results, it is possible to highlight that the novel blocking techniques presented better results when compared to the state-of-the-art blocking techniques. Therefore, the proposed techniques can be hosted to the proposed execution model, so that they can address different necessities inherent to the application profiles.