MESTRE, D. G.; http://lattes.cnpq.br/7207365006541712; MESTRE, Demetrio Gomes.
Resumo:
Entity Matching (EM), i.e., the task of identifying all entities referring to the same realworld object, is an important and difficult task for data sources integration and cleansing. A major difficulty for this task performance, in the Big Data era, is the quadratic nature of
the task execution. To minimize the workload and still maintain high levels of matching
quality, for both single or multiple data sources, the indexing (blocking) methods were
proposed. Such methods work by partitioning the input data into blocks of similar entities,
according to an entity attribute, or a combination of them, commonly called “blocking key”,
and restricting the EM process to entities that share the same blocking key (i.e., belong to
the same block). In spite to promote a considerable decrease in the number of comparisons executed, indexing methods can still generate large amounts of comparisons, depending on the size of the data sources involved and/or the number of entities per index (or block). Thus, to further minimize the execution time, the EM task can be performed in parallel using programming models such as MapReduce and Spark. However, the effectiveness and scalability of MapReduce and Spark-based implementations for data-intensive tasks depend on the data assignment made from map to reduce tasks, in the case of MapReduce, and the data assignment between the transformation operations, in the case of Spark. The robustness of this assignment strategy is crucial to achieve skewed data handling (large sets of data can cause memory bottlenecks) and balanced workload distribution among all nodes of the distributed infrastructure. Thus, considering that studies about approaches that perform the efficient execution of adaptive indexing EM methods, in batch or real-time modes, in the context of parallel computing are an open gap according to the literature, this work proposes a set of parallel approaches capable of performing efficient adaptive indexing EM approaches using MapReduce and Spark in batch or real-time modes. The proposed approaches are compared to state-of-the-art ones in terms of performance using real cluster infrastructures and data sources. The results carried so far show evidences that the performance of the proposed approaches is significantly increased, enabling a
decrease in the overall runtime while preserving the quality of similar entities detection.