Skip to main content
SHARE
Publication

Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime...

by Yoav Tock, Benjamin Mandler, Jose Moreira, Terry R Jones
Publication Type
Conference Paper
Publication Date
Conference Name
Euro-Par 2013
Conference Location
Aachen, Germany
Conference Sponsor
German Research School for Simulation Sciences, Forschungszentrum Jülich, and RWTH Aachen University
Conference Date
-

As HPC systems and applications get bigger and more complex, we
are approaching an era in which resiliency and run-time elasticity concerns be-
come paramount.We offer a building block for an alternative resiliency approach
in which computations will be able to make progress while components fail, in
addition to enabling a dynamic set of nodes throughout a computation lifetime.
The core of our solution is a hierarchical scalable membership service provid-
ing eventual consistency semantics. An attribute replication service is used for
hierarchy organization, and is exposed to external applications. Our solution is
based on P2P technologies and provides resiliency and elastic runtime support at
ultra large scales. Resulting middleware is general purpose while exploiting HPC
platform unique features and architecture. We have implemented and tested this
system on BlueGene/P with Linux, and using worst-case analysis, evaluated the
service scalability as effective for up to 1M nodes.