BATALHA, M. S. G.; http://lattes.cnpq.br/2274155965398680; BATALHA, Marcela Santana Guimarães.
Resumo:
In distributed systems, process or communication channel failures, if not properly treated,
may result in the complete interruption of applications and, therefore, the related services.
In general, the failure treatment, by masking faults or failure recovery, requires in the first
place the diagnosis of the faulty components, by identifying and informing them to the
operational processes, in a consistent manner. This thesis presents the development of a Fault Tolerant Diagnosis Service based on CORBA (SDF) for distributed systems where timeouts can not be used for a precise indication of component failures or diagnosis. The service SDF is distributed, fault tolerant and integrates concepts of management, diagnosis and failure detection in asynchronous systems. In order to establish a diagnosis we use adaptive timeouts, operational system calls and a variation of the two-phase commit protocol that guarantees a diagnosis coherent view between the managed processes. The Diagnosis Service includes a visualisation tool that presents monitored process states and history of the machines where they are running. It was implemented and tested over a JAVA/CORBA environment on the LaSiD/UFBA computer network.