Skip to main content
SHARE
Publication

Self-stabilizing Connected Components...

by Piyush K Sao, Christian Engelmann, Srinivas Eswar, Oded Green, Richard Vuduc
Publication Type
Conference Paper
Book Title
2019 IEEE/ACM 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS)
Publication Date
Page Numbers
50 to 59
Conference Name
The 9th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS) 2019
Conference Location
Denver, Colorado, United States of America
Conference Sponsor
ACM
Conference Date

For the problem of computing the connected components of a graph, this paper considers the design of algorithms that are resilient to transient hardware faults, like bit Hips. More specifically, it applies the technique of self-stabilization. A system is self-stabilizing if, when starting from a valid or invalid state, it is guaranteed to reach a valid state after a finite number of steps. Therefore on a machine subject to a transient fault, a self-stabilizing algorithm could recover if that fault caused the system to enter an invalid state. We give a comprehensive analysis of the valid and invalid states during label propagation and derive algorithms to verify and correct the invalid state. The self-stabilizing label-propagation algorithm performs O (V log V) additional computation and requires O (V) additional storage over its conventional counterpart (and, as such, does not increase asymptotic complexity over conventional label propagation). When run against a battery of simulated fault injection tests, the self-stabilizing label propagation algorithm exhibits more resilient behavior than a triple modular redundancy (TMR) based fault-tolerant algorithm in 80% of cases. From a performance perspective, it also outperforms TMR as it requires fewer iterations in total. Beyond the fault tolerance properties of self-stabilizing label-propagation, we believe, they are useful from the theoretical perspective; and may have other use-cases.