An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algori...

by Guangye Chen, Luis Chacon, Daniel C Barnes

Publication Type

Journal

Journal Name

Journal of Computational Physics

Publication Date

June, 2012

Page Numbers

5374 to 5388

Volume

231

Issue

View DOI Listing

Abstract

Recently, a fully implicit, energy- and charge-conserving particle-in-cell method has been developed
for multi-scale, full-f kinetic simulations [G. Chen, et al., J. Comput. Phys. 230, 18
(2011)]. The method employs a Jacobian-free Newton-Krylov (JFNK) solver and is capable
of using very large timesteps without loss of numerical stability or accuracy. A fundamental
feature of the method is the segregation of particle orbit integrations from the field solver,
while remaining fully self-consistent. This provides great flexibility, and dramatically improves
the solver efficiency by reducing the degrees of freedom of the associated nonlinear
system. However, it requires a particle push per nonlinear residual evaluation, which makes
the particle push the most time-consuming operation in the algorithm. This paper describes
a very efficient mixed-precision, hybrid CPU-GPU implementation of the implicit PIC algorithm.
The JFNK solver is kept on the CPU (in double precision), while the inherent
data parallelism of the particle mover is exploited by implementing it in single-precision on
a graphics processing unit (GPU) using CUDA. Performance-oriented optimizations, with
the aid of an analytical performance model, the roofline model, are employed. Despite being
highly dynamic, the adaptive, charge-conserving particle mover algorithm achieves up
to 300 − 400 GOp/s (including single-precision floating-point, integer, and logic operations)
on a Nvidia GeForce GTX580, corresponding to 20 − 25% absolute GPU efficiency (against
the peak theoretical performance) and 50-70% intrinsic efficiency (against the algorithm’s
maximum operational throughput, which neglects all latencies). This is about 200-300 times
faster than an equivalent serial CPU implementation. When the single-precision GPU particle
mover is combined with a double-precision CPU JFNK field solver, overall performance
gains ∼ 100 vs. the double-precision CPU-only serial version are obtained, with no apparent
loss of robustness or accuracy when applied to a challenging long-time scale ion acoustic
wave simulation.

An efficient mixed-precision, hybrid CPU-GPU implementation of a nonlinearly implicit one-dimensional particle-in-cell algori...

Abstract

Organizations