Skip to main content
SHARE
Publication

3-Dimensional Root Cause Diagnosis via Co-analysis...

by Ziming Zheng, Zhiling Lan, Li Yu, Terry R Jones
Publication Type
Conference Paper
Publication Date
Conference Name
9th International Conference on Autonomic Computing
Conference Location
San Jose, California, United States of America
Conference Sponsor
IEEE/ACM
Conference Date
-

With the growth of system size and complexity, reliability has become a major concern for large-scale systems. Upon the occurrence of failure, system administrators typically trace the events in Reliability, Availability, and Serviceability (RAS) logs for root cause diagnosis. However, RAS log only contains limited diagnosis information. Moreover, the manual processing is time-consuming, error-prone, and not
scalable. To address the problem, in this paper we present an automated root cause diagnosis mechanism for large-scale HPC systems. Our mechanism examines multiple logs to provide a 3-D fine-grained root cause analysis. Here, 3-D means that our analysis will pinpoint the failure layer, the time, and the location of the event that causes the problem.
We evaluate our mechanism by means of real logs collected from a production IBM Blue Gene/P system at Oak Ridge National Laboratory. It successfully identifies failure layer information for 219 failures during 23-month period. Furthermore, it effectively identifies the triggering events with time and location information, even when the triggering events occur hundreds of hours before the resulting failures.