Skip to main content
SHARE
Publication

Correlating Log Messages for System Diagnostics...

Publication Type
Conference Paper
Publication Date
Conference Name
Cray User Group
Conference Location
Edinburgh, United Kingdom
Conference Date

In large-scale computing systems, the sheer volume of log data generated presents daunting challenges for debugging and monitoring of these systems. The Oak Ridge Leadership Computing Facility’s premier simulation platform, the Cray XT5 known as Jaguar, can generate a few hundred thousand log entries in less than a minute for many system level events. Determining the root cause of such system events requires analyzing and interpretation of a large number of log messages. Most often, the log messages are best understood when they are interpreted collectively rather than individually. In this paper, we present our approach to interpreting log messages by identifying their commonalities and grouping them into clusters. Given a set of log messages within a time interval, we group the messages based on source, target, and/or error type, and correlate the messages with hardware and application information. We monitor the Lustre log messages in the XT5 console log and show that such grouping of log messages assists in detecting the source of system events. By intelligent grouping and correlation of events in the log, we are able to provide system administrators with meaningful information in a concise format for root cause analysis.