ORNL's methodical leap into the exascale era

Al Geist. Image credit: Carlos Jones, ORNL

November 6, 2023

With the May 2022 launch of Frontier, ORNL officially vaulted the United States into the exascale era of high-performance computing. But back in 2008, the feasibility of exascale machines that can perform a billion billion floating point operations per second looked bleak.

The University of Notre Dame’s Peter M. Kogge led a study sponsored by the Defense Advanced Research Projects Agency to identify exascale’s key hurdles. Published in May 2008, Technology Challenges in Achieving Exascale Systems surveyed HPC experts from universities, research labs and industry to predict whether the nation could — by 2015 — attain a thousandfold increase in computational power over then-new petascale systems.

Their consensus? Not without changing the trajectory of HPC technology from the state of the art at that time.

But now, 15 years later, exascale computing has become a reality, albeit not quite as soon as DARPA had hoped. Frontier is the harbinger of a new era in computational science, to be followed in the next few years by Aurora at Argonne National Laboratory and El Capitan at Lawrence Livermore National Laboratory. These exascale supercomputers will tackle more extensive problems and answer more complex questions than ever before.

How did the Oak Ridge Leadership Computing Facility and its vendor partners AMD and HPE Cray overcome the obstacles identified in Kogge’s report? It took a lot of organizational work, several false starts and some fortuitous advances in computing technology.

“We had to build more energy-efficient processors. We had to build more reliable hardware and better interconnects. We had to be able to move data around more efficiently. We had to design algorithms that can use that level of concurrency,” said Jack Dongarra, until recently the director of University of Tennessee, Knoxville’s Innovative Computing Laboratory. Concurrency is the ability to process more than one task at the same time, with no two tasks executing at the same instant.

“Those are all complicated things, and it required research to develop the technology necessary to achieve them. But in the intervening 10 years or so, we did just that,” Dongarra said.

Planning ahead for a new class of supercomputer

To design and construct a next-generation supercomputer, the first step is the trickiest: finding funding. But to convince government agencies — and, in turn, Congress — to spend large sums on cutting-edge technology, a strong case must be made for its necessity and feasibility, and that entails lots of study groups and a lot of effort to find consensus.

Many reports were published in the decade following that first exascale study, usually with Dongarra’s name on them. Many examined the possibilities — and potential impossibilities — of exascale computing. Most identified the same two technical challenges that needed to be overcome to achieve exascale: reducing energy consumption and increasing the reliability of the chips powering these massive systems.

Meanwhile, exascale alliances between the national labs began forming with the goal of figuring out how to solve those technical challenges — and of gaining enough political traction to move exascale projects through federal agencies. Most fizzled, but forward motion toward exascale came on three fronts.

First, relationships with computer technology vendors were formalized in 2012 with DOE’s FastForward and DesignForward programs. FastForward focused on the processor, memory and storage vendors — such as AMD and Intel — to address power consumption and resiliency issues. DesignForward focused mainly on system integrators such as HPE, Cray and IBM to plan packaging, integration and engineering. Cray was later acquired by HPE, in 2019.

Second, the Collaboration of Oak Ridge, Argonne, and Livermore, or CORAL, was formed by DOE in 2012 to streamline the supercomputer procurement process. Each lab was acquiring pre-exascale systems at the time, and it made sense for them to work together. CORAL-2 continues this successful collaboration with the goal of procuring exascale systems.

Third, the Exascale Computing Project launched in 2016 as part of DOE’s Exascale Computing Initiative, assembling over 1,000 researchers from 15 labs, 70 universities and 32 vendors to tackle exascale application development as well as software libraries and software technologies. The software will be a critical factor in determining the early success of the new exascale systems.

“We are very fortunate to have access to a one-of-a-kind, world-class supercomputer, and we don’t want to let it sit idle for even a single minute. We are delivering a wide range of applications and software technologies that can use this precious resource to solve problems of national interest,” said Lori Diachin, ECP project director and deputy associate director for science and technology in LLNL’s Computation Directorate.

“Everything has to be production-hardened quality, performant and portable on Day One. So, to get the largest return on taxpayer investment, and the fastest route to new scientific discovery, we have to have all these apps ready.”

With these programs in place, requirements for actual exascale systems needed to be solidified for DOE to issue Requests for Information and Requests for Proposals, necessary steps in the government procurement process to spur competition between vendors.

“We need these machines to be stable so that, in concert with our apps and software stack, they can fulfill their promise of being consequential science and engineering instruments for the nation,” said Doug Kothe, former associate laboratory director of ORNL’s Computing and Computational Sciences Directorate, who was replaced as ECP director by Diachin in 2023. “So, we as a community started talking about this years ago — and talking means writing down answers to key requirements questions like: Why? What? How? How well? What do we need to focus on in terms of R&D and software?”

Many of those questions revolved around what basic architecture the system would ultimately use — which was partly determined by a choice made by the OLCF years earlier to use a piece of consumer PC hardware in supercomputers.

Specifying the future of HPC

Before the OLCF’s Titan debuted in 2012, graphics processing units, or GPUs, were best known for powering high-end gaming PCs. At the time, most supercomputers relied only on central processing units, or CPUs, to crunch their algorithms. But Titan introduced a revolutionary hybrid processor architecture that combined AMD 16-core Opteron CPUs and NVIDIA Kepler GPUs, which tackled computationally intensive math problems. At the same time, the CPUs efficiently directed tasks, thereby significantly speeding up calculations.

“Titan’s GPU-based system was a unique supercomputer design at the time,” Kothe said. “I don’t think the whole community bought into it or thought it would become the base technology for exascale. Oak Ridge believed that accelerated-node computing would be the norm for the foreseeable future, and so far that has come to fruition.”

Titan led to many more GPU-based supercomputers, including Summit in 2017 — Titan’s successor at the OLCF — and LLNL’s Sierra in 2018. They both employ NVIDIA V100 GPUs, and that choice also proved important in configuring the upcoming exascale system’s capabilities.

Al Geist, chief technology officer for the ECP and the OLCF, wrote many of the documents for CORAL and CORAL-2. He sees Summit’s architecture as another fortunate turning point in supercomputer design.

“As supercomputers got larger and larger, we expected them to be more specialized and limited to a small number of applications that could exploit their particular capabilities. But what happened when Summit was announced, NVIDIA jumped up and said, ‘Oh, by the way, those Volta GPUs have something called tensor cores in them that allow you to do AI calculations and all sorts of additional things,’” Geist said. “They could do the traditional HPC modeling and simulation, but Summit is also very effective at doing high-performance data analytics and artificial intelligence.”

The ability of GPUs to greatly accelerate performance as well as handle mixed-precision math for data science — all while using less power than CPUs — made them the best choice for exascale architecture, especially with more powerful and efficient next-generation chip designs being produced by vendor partners like AMD, Intel and NVIDIA.

“What we found out in the end is that exascale didn’t require this exotic technology that came out of that 2008 report,” Geist said. “We didn’t need special architectures. We didn’t need new programming paradigms. Getting to exascale turned out to be very incremental steps — not a giant leap like we thought it was going to take to get to Frontier.”

Media Contact

Contact

Coury Z Turczyn

turczyncz@ornl.gov,

865.341.0352

ORNL Review

ORNL's methodical leap into the exascale era

Media Contact

Contact

Related News

ORNL's methodical leap into the exascale era

Media Contact

Contact

Researchers

Related News