Abstract
Future Exascale computing systems with a large number of
processors, memory elements and interconnection links, are expected
to experience multiple, complex faults, which affect both applications
and operating-runtime systems. A variety of algorithms, frameworks and
tools are being proposed to realize and verify resilient computations that
are guaranteed to produce correct results on failure-prone computing
systems. We analytically show that certain classes of resilient computation
problems in presence of a general class of faults are undecidable,
that is, no algorithms exist for solving them. We first show that verification
of resilient computations in general is an undecidable problem.
We then illustrate a specific class of faults that can create infinite loops
or non-halting conditions in otherwise terminating computations, whose
detection in general is undecidable. We then show certain resilient computation
problems to be undecidable by using reductions from the loop
detection and halting problems under two formulations, namely, an abstract
programming language and Turing machines, respectively. These
two reductions highlight different failure effects: the former represents
program and data corruption, and the latter illustrates incorrect program
execution. These results call for broad-based, well-characterized
resilience approaches that complement purely computational solutions,
by integrating additional methods such as hardware monitors, co-designs,
and system- and application-specific diagnosis codes.