Skip to main content
SHARE
Research Highlight

Parallel Hybrid Metaheuristics with Distributed Intensification and Diversification for Large-scale Optimization in Big Data Statistical Analysis

The distributed diversification and intensification strategies take less time to find all matching samples when more computing power is harnessed. Computational Urban Sciences
The distributed diversification and intensification strategies take less time to find all matching samples when more computing power is harnessed.

Scientific Achievement

Distributed search strategies are implemented in a parallel genetic algorithm approach as a high-performance computing solution to large-scale observational data-based causal inference studies.

Significance and Impact

When a randomized experiment is not possible, using observational data may offer a path toward causal inference studies, but requires optimal subset selection to mimic experimental data, an NP-hard problem. This work develops parallel hybrid metaheuristics that harness supercomputing power to accelerate subset selection among a large number of observations.

Research Details

  • Prior work established the optimal subset selection approach for enabling optimization-based observational data analysis for causal inference
  • Distributed heuristics are devised to diversify decision space traversal in order to avoid wasteful effort and intensify a search of interest by engaging more processors
  • The heuristics are implemented using non-blocking and asynchronous communications to increase the computing and communication overlap
  • The approach is evaluated in a parallel computing environment using the seminal Lalonde CPS dataset, where the Kolmogorov-Smirnov statistic is employed to measure distribution difference. Desirable scalability is achieved.

Facility

The experiments conducted in this paper used the Extreme Science and Engineering Discovery Environment (XSEDE) resources, which are supported by National Science Foundation grant number ACI-1548562. Specifically, the authors acknowledge the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources, i.e., the Stampede2 system, that have contributed to the research results reported within this paper.

Publication

Wendy K. Tam Cho and Yan Y. Liu. 2019. Parallel Hybrid Metaheuristics with Distributed Intensification and Diversification for Large-scale Optimization in Big Data Statistical Analysis. 2019 IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, 2019, pp. 3312-3320. DOI: 10.1109/BigData47090.2019.9006045

Funding

This work is supported in part by the Laboratory Directed Research and Development Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725.

Overview

Important insights into many data science problems that are traditionally analyzed via statistical models can be obtained by re-formulating and evaluating within a large-scale optimization framework. However, the theoretical underpinnings of the statistical model may shift the goal of the decision space traversal from a traditional search for a single optimal solution to a traversal with the purpose of yielding a set of high quality, independent solutions. We examine statistical frameworks with astronomical decision spaces that translate to optimization problem but are challenging for standard optimization methodologies. We address the new challenges by designing a hybrid metaheuristic with specialized intensification and diversification protocols in the base search algorithm. Our algorithm is extended to the high-performance computing realm using the Stampede2 supercomputer where we experimentally demonstrate the effectiveness of our algorithm to utilize multiple processors to collaboratively hill climb, broadcast messages to one another regarding landscape characteristics, diversify across the solution landscape, and request aid in climbing particularly difficult peaks.