Skip to main content
SHARE
Publication

On Undecidability Aspects of Resilient Computations and Implications to Exascale...

by Nageswara S Rao
Publication Type
Conference Paper
Book Title
Euro-Par 2014: Parallel Processing Workshops
Publication Date
Page Numbers
511 to 522
Volume
8805
Conference Name
Euro-Par 2014: Parallel Processing Workshops: Resilience 2014
Conference Location
Porto, Portugal
Conference Date
-

Future Exascale computing systems with a large number of processors, memory elements and interconnection links, are expected to experience multiple, complex faults, which affect both applications and operating-runtime systems. A variety of algorithms, frameworks and tools are being proposed to realize and/or verify the resilience properties of computations that guarantee correct results on failure-prone computing systems. We analytically show that certain resilient computation problems in presence of general classes of faults are undecidable, that is, no algorithms exist for solving them. We first show that the membership verification in a generic set of resilient computations is undecidable. We describe classes of faults that can create infinite loops or non-halting computations, whose detection in general is undecidable. We then show certain resilient computation problems to be undecidable by using reductions from the loop detection and halting problems under two formulations, namely, an abstract programming language and Turing machines, respectively. These two reductions highlight different failure effects: the former represents program and data corruption, and the latter illustrates incorrect program execution. These results call for broad-based, well-characterized resilience approaches that complement purely computational solutions using methods such as hardware monitors, co-designs, and system- and application-specific diagnosis codes.