Skip to main content
SHARE
Publication

A Next-Generation Parallel File System Environment for the OLCF...

Publication Type
Conference Paper
Publication Date
Conference Name
Cray User Group
Conference Location
Stuttgart, Germany
Conference Date
-

When deployed in 2008/2009 the Spider system at the
Oak Ridge National Laboratory’s Leadership Computing
Facility (OLCF) was the world’s largest scale Lustre
parallel file system. Envisioned as a shared parallel
file system capable of delivering both the bandwidth
and capacity requirements of the OLCF’s diverse
computational environment, Spider has since become
a blueprint for shared Lustre environments deployed
worldwide. Designed to support the parallel I/O requirements
of the Jaguar XT5 system and other smallerscale
platforms at the OLCF, the upgrade to the Titan
XK6 heterogeneous system will begin to push the limits
of Spider’s original design by mid 2013. With a doubling
in total system memory and a 10x increase in FLOPS, Titan
will require both higher bandwidth and larger total
capacity. Our goal is to provide a 4x increase in total
I/O bandwidth from over 240GB=sec today to 1TB=sec
and a doubling in total capacity. While aggregate bandwidth
and total capacity remain important capabilities,
an equally important goal in our efforts is dramatically
increasing metadata performance, currently the Achilles
heel of parallel file systems at leadership. We present in
this paper an analysis of our current I/O workloads, our
operational experiences with the Spider parallel file systems,
the high-level design of our Spider upgrade, and
our efforts in developing benchmarks that synthesize our
performance requirements based on our workload characterization
studies.