SARMENTO NETO, G. A.; http://lattes.cnpq.br/1872447954071124; SARMENTO NETO, Geraldo Abrantes.
Resumo:
The generation of large amounts of data, also known as Big Data, is becoming very common
both in the academy and in the enterprises environments. In that context, it is essential
that applications responsible for processing Big Data exploit high-performance distributed
infrastructures (such as cluster), commonly present in those environments, through the deploying
of such applications on data-intensive scalable supercomputing (DISC) systems such
as the popular Hadoop. Regarding the configuration of that platform, there is a considerable
amount of parameters to be adjusted by users who do not know how to set them, resulting
in a Hadoop poorly configured and performing below of its real potential. This work proposes
a process to help in Hadoop efficient configuration by using empirical techniques to
analyze subspaces of parameters of this platform, and the application of statistical foundations
to verify the relevance of such parameters, obtaining the optimized values according
to the subspace of parameters considered. Aiming the process instantiation, we performed a
case study in order to obtain proper settings with a positive impact on the response time of a
representative application in this context. The validation was performed through a comparison
between the proposed process and some existing solutions in which we observed that
the former had a significant advantage regarding same environment and workload used in
the instantiation stage. Although the average completion time of the process has been higher
than the other solutions, we presented scenarios which the use of the proposed process is
more advantageous (and feasible) than the use of other solutions. This happens due to its
flexibility, since it has no constraints on the subspace of selected parameters and metrics
possible to be analyzed.