Skip to main content
SHARE
Publication

Reducing Application Runtime Variability on Jaguar XT5...

Publication Type
Conference Paper
Publication Date
Conference Name
Cray Users Group Conference (CUG) 2010
Conference Location
Edinburgh, United Kingdom
Conference Sponsor
Cray
Conference Date

Operating system (OS) noise is defined as interference generated by the OS that
prevents a compute core from performing ``useful'' work. Compute node kernel
daemons, network interfaces, and other OS related services are major sources of
such interference. This interference on individual compute cores can vary in
duration and frequency, and can cause de-synchronization (jitter) in collective
communication tasks and thus results in variable (degraded) overall parallel
application performance. This behavior is more observable in large-scale
applications using certain types of collective communication primitives, such
as MPI\_Allreduce.

This paper presents our effort towards reducing the overall effect of OS noise
on our large-scale parallel applications. Our tests were performed on the
quad-core Jaguar, the Cray XT5 at the Oak Ridge National Laboratory Leadership
Computing Facility (OLCF). At the time of these tests, Jaguar was a 1.4 PFLOPS
supercomputer with 149,504 compute cores and 8 cores per node. We aggregated
OS noise sources onto a single core for each node. The scientific application
was then run on six of the remaining cores in each node. Our results show
that we were able to improve the MPI_Allreduce performance by two orders of
magnitude. We demonstrated up to a 30% boost in the performance of the Parallel
Ocean Program (POP) using this technique.