Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

by Henri Monti, Ali Butt, Sudharshan S Vazhkudai

Publication Type

Journal

Journal Name

IEEE Transactions on Parallel and Distributed Systems

Publication Date

August, 2011

Page Numbers

1307 to 1322

Volume

Issue

View DOI Listing

Abstract

Modern High-Performance Computing (HPC) centers are
facing a data deluge from emerging scientific applications. Supporting
large data entails a significant commitment of the highthroughput
center storage system, scratch space. However, the
scratch space is typically managed using simple “purge policies,”
without sophisticated end-user data services to balance resource
consumption and user serviceability. End-user data services
such as offloading are performed using point-to-point transfers
that are unable to reconcile center’s purge and users’ delivery
deadlines, unable to adapt to changing dynamics in the end-toend
data path and are not fault-tolerant. Such inefficiencies can
be prohibitive to sustaining high performance.
In this paper, we address the above issues by designing a
framework for the timely, decentralized offload of application
result data. Our framework uses an overlay of user-specified
intermediate and landmark sites to orchestrate a decentralized
fault-tolerant delivery. We have implemented our techniques
within a production job scheduler (PBS) and data transfer tool
(BitTorrent). Our evaluation using both a real implementation
and supercomputer job log-driven simulations show that: the
offloading times can be significantly reduced (90.4% for a 5 GB
data transfer); the exposure window can be minimized while also
meeting center-user Service Level Agreements.

Timely Result-Data Offloading for Improved HPC Center Scratch Provisioning and Serviceability

Abstract

Organizations