OLIVEIRA, E. W. A.; http://lattes.cnpq.br/2894993072804612; OLIVEIRA, Ely Wagner Aguiar de.
Résumé:
In the intention to extend its services, several sectors of our society, as banks, hospitals
and industries, increase even more its degree of dependence of the correct functioning of
distributed systems. In parallel to this reality, is the fact that, as any other computational
environment, distributed systems are exposed to the occurrence of faults. If not properly
treated, these faults can prevent the distributed system of completing its tasks. It challenges its designers and developers, to handle the increasing demand for dependability in distributed systems, even more exposed to the occurrence of faults. Since faults can not be totaly prevented, systems must use fault tolerance mechanisms. Such mechanisms allow faults to be detected and treated, without interrupting system functioning. Unreliable failure detectors are an important abstraction to suport the implementation of fault tolerant protocols on asynchronous distributed systems. Several classes of failure detectors with varying semantics have been proposed, such as the class of Perfect Failure Detectors, which is the strongest one. This work presents the design and implementation of Delphus, a perfect failure detection service with quality of service. It is a new operational system service, implemented as a module for Linux generic versions. The access to the service is provided by APIs implemented in both C and Java. An extra communication channel is used solely to convey service messages. The Delphus is presented as a cheap and important tool to support the implementation of fault tolerance mechanisms, without demanding the adoption of strong restrictions in the system.