Abstract
We develop a parallel EM algorithm for multivariate Gaussian mixture models and use it to
perform model-based clustering of a large climate data set. Three variants of the EM algorithm
are reformulated in parallel and a new variant that is faster is presented. All are implemented
using the single program, multiple data (SPMD) programming model, which is able to take
advantage of the combined collective memory of large distributed computer architectures to
process larger data sets. Displays of the estimated mixture model rather than the data allow
us to explore multivariate relationships in a way that scales to arbitrary size data. We study
the performance of our methodology on simulated data and apply our methodology to a high
resolution climate dataset produced by the community atmosphere model (CAM5). This article
has supplementary material online.