DUARTE, A. N.; http://lattes.cnpq.br/1982919735990024; DUARTE, Alexandre Nóbrega.
Resumen:
A lot of research effort has being put to find better mechanisms for fault treatment on
grid computing aiming at improving the reliability of such infrastructures. Ideally, a grid
user should be able to submit a set of tasks to remote execution, wait until the execution is
concluded, and then retrieve the results of its execution in the very same way she would do if
using a single high-performance machine. In practice, however, this is not what is happening
to users of the larger grid infrastructures available nowadays. It is not rare to observe high
failure rates on tasks submitted for execution in a grid infrastructure. Grid users see their
tasks failing and receive no feedback from the grid middleware that could possibly help
them to figure out why their tasks failed. Must of the time the user is not even able to tell
if the task failed due to a problem inside the user application or due to some faulty service
located somewhere in the grid.This thesis proposes and evaluates a mechanism based on the
utilization of automatic software tests to detect failures and to diagnose their causes during
the execution of applications on this kind of infrastructure. Experimental results showed a
success rate of 93.99% ± 5, 63%, with a 95% cofidence level, or 93.99% ± 7.52%. with
a 99% confidence level, for the dianosis of a tool implemented according to the proposed
mechanism.