Skip to main content
SHARE
Publication

On Undecidability Aspects of Resilient Computations and Implications to Exascale...

by Nageswara S Rao
Publication Type
Conference Paper
Book Title
Proceedings of Resilience 2014
Publication Date
Conference Name
Seventh Workshop on Resiliency in High Performance Computing with Clouds, Grids, and Clusters
Conference Location
Porto, Portugal
Conference Sponsor
Euro-Par 2014
Conference Date

Future Exascale computing systems with a large number of
processors, memory elements and interconnection links, are expected
to experience multiple, complex faults, which affect both applications
and operating-runtime systems. A variety of algorithms, frameworks and
tools are being proposed to realize and verify resilient computations that
are guaranteed to produce correct results on failure-prone computing
systems. We analytically show that certain classes of resilient computation
problems in presence of a general class of faults are undecidable,
that is, no algorithms exist for solving them. We first show that verification
of resilient computations in general is an undecidable problem.
We then illustrate a specific class of faults that can create infinite loops
or non-halting conditions in otherwise terminating computations, whose
detection in general is undecidable. We then show certain resilient computation
problems to be undecidable by using reductions from the loop
detection and halting problems under two formulations, namely, an abstract
programming language and Turing machines, respectively. These
two reductions highlight different failure effects: the former represents
program and data corruption, and the latter illustrates incorrect program
execution. These results call for broad-based, well-characterized
resilience approaches that complement purely computational solutions,
by integrating additional methods such as hardware monitors, co-designs,
and system- and application-specific diagnosis codes.