NASCIMENTO FILHO, D. C.; http://lattes.cnpq.br/3151296501932443; NASCIMENTO FILHO, Dimas Cassimiro do.
Resumo:
In the era of Big Data, in which the scale of the data provides many challenges for classical
algorithms, the task of assessing the quality of datasets may become costly and complex.
For this reason, business managers may opt to outsource the data quality monitoring for a
specific cloud service for this purpose. In this context, this work proposes approaches for
reducing the costs generated from solutions for the data deduplication problem, which aims
to detect duplicate entities in datasets, in the context of a service for data quality monitoring. This work investigates the deduplication task due to its importance in a variety of contexts and its high complexity. We propose a high-level architecture of a service for data quality monitoring, which employs provisioning algorithms that use heuristics and machine learning techniques. Furthermore, we propose approaches for the adoption of incremental data quality algorithms and heuristics for controlling the size of the blocks produced in the indexing phase of the investigated problem. Four different experiments have been conducted to evaluate the effectiveness of the proposed provisioning algorithms, the heuristics for incremental record linkage and the heuristics to control block sizes for entity resolution. The results of the experiments show a range of options covering different tradeoffs, which involves: infrastructure costs of the service and the amount of SLA violations over time. In turn, the empirical evaluation of the proposed heuristics for incremental record linkage also presented a number of patterns in the results, which involves tradeoffs between the runtime of the heuristics and the obtained efficacy results. Lastly, the evaluation of the heuristics proposed to control block sizes have presented a large number of tradeoffs regarding execution time, amount of pruning approaches and the obtained efficacy results. Besides, the efficiency results of these heuristics may vary significantly, depending of the adopted pruning strategy. The results from the conducted experiments support the fact that different approaches (associated with cloud computing provisioning and the employed data quality algorithms) adopted by a data quality service may produce significant influence over the generated service costs, and thus, the final costs forwarded to the service customers.