Abstract
With the advent of big data, the I/O subsystems of large-scale compute
clusters are becoming a center of focus, with more
applications putting greater demands on end-to-end I/O performance. These
subsystems are often complex in design. They comprise of multiple hardware and
software layers to cope with the increasing capacity, capability and scalability
requirements of data intensive applications. The sharing nature of storage
resources and the intrinsic
interactions across these layers make it to realize user-level, end-to-end
performance gains a great challenge.
We propose a topology-aware resource load balancing strategy to improve
per-application I/O performance. We demonstrate the effectiveness of
our algorithm on an extreme-scale compute cluster, Titan, at the Oak Ridge
Leadership Computing Facility (OLCF). Our experiments with both synthetic
benchmarks and a real-world application show that, even under congestion, our
proposed algorithm can improve large-scale application I/O performance
significantly, resulting in both the reduction of application run times and
higher resolution simulation runs.