US Dept. of Energy Announces Frontier Supercomputer: Cray and AMD to Build 1.5 Exaflop Machine
by Ryan Smith on May 7, 2019 7:40 AM ESTThe history of the computing industry is one of constant progress. Processors get faster, storage gets cheaper, and memory gets denser. We see the repercussions of this advancement through all aspects of society, and that extends to the top as well, where national governments continue to invest in bigger and better supercomputers. One part technological necessity and one part technological race, the exascale era of supercomputers is about to begin, as orders for the first exaFLOP-capable are now going out. It’s only fitting then that this morning the United States Department of Energy is announcing the contract for their fastest supercomputer yet, the Frontier system, which will be built by Cray and AMD.
Frontier is planned for delivery in 2021, and when it’s activated it will become the second and most powerful of the US DOE’s two planned 2021 exascale systems, with performance expected to reach 1.5 exaFLOPS. The ambitious system won’t come cheaply, however; with a price tag of over 500 million dollars for the system alone – and another 100 million dollars for R&D – Frontier is among the most expensive supercomputers ever ordered by the US Department of Energy.
The new supercomputer is being built as part of the US DOE’s CORAL-2 program for supercomputers, with Frontier scheduled to replace Oak Ridge National Laboratory’s current Summit supercomputer. Summit is the current reigning champion in the supercomputer world, with 200 petaFLOPS of performance, and accordingly the US DOE and Oak Ridge are aiming to significantly improve on its performance for the new computer. All told, Frontier should be able to deliver over 7x the performance of Summit, and is expected to be the fastest supercomputer in the world once it’s activated.
Like Summit (and Titan before it), Frontier is an open science system, meaning that it’s available to academic researchers to run simulations and experiments on. Accordingly, the lab is expecting the supercomputer to be used for a wide range of projects across numerous disciplines, including not only traditional modeling and simulation tasks, but also more data-driven techniques for artificial intelligence and data analytics. In fact the latter is a bit of new ground for the lab and the system’s eventual users; just as we’ve seen in the enterprise space over the past few years, neural network-based AI is becoming an increasingly popular technique to solve problems and extract analysis from large datasets, and now researchers are looking at how to refine those techniques from the current-generation systems and apply them to exascale-level projects.
US Department of Energy Supercomputers | |||||
Frontier | Aurora | Summit | |||
CPU Architecture | AMD EPYC (Future Zen) |
Intel Xeon Scalable | IBM POWER9 | ||
GPU Architecture | Radeon Instinct | Intel Xe | NVIDIA Volta | ||
Performance (RPEAK) | 1.5 EFLOPS | 1 EFLOPS | 200 PFLOPS | ||
Power Consumption | ~30MW | N/A | 13MW | ||
Nodes | 100 Cabinets | N/A | 3,400 | ||
Laboratory | Oak Ridge | Argonne | Oak Ridge | ||
Vendor | Cray | Intel | IBM | ||
Year | 2021 | 2021 | 2018 |
Frontier: Powered by Cray & AMD
Officially, the prime contractor for Frontier will be Cray. But looking at the specifications, you could be excused for thinking it was AMD. Cray for its part is partnering with the chipmaker for the system, and as a result AMD is providing most of the core hardware for the new supercomputer. Designed as a next-generation CPU + accelerator system, with a mix of CPUs and GPUs doing the heavy compute work, AMD will be supplying both the CPUs and GPUs for Frontier. And as the principle processor provider, AMD will also be taking on a lot of the responsibility for developing the software stack as well, with the company working with Cray to develop an enhanced version of their ROCm environment to best extract performance from the massive cluster of CPUs and GPUs.
On the CPU side of matters, AMD will be supplying a customized next-generation EPYC CPU. AMD has confirmed that it’s going to be using a future generation of their Zen CPU cores, and given the timing of the project, we’re almost certainly looking at a Zen 3 or Zen 4 design here. Just how custom AMD’s CPU is remains to be seen, but their announcement has revealed that Frontier’s CPUs will include new instructions for the optimization of AI and supercomputing workloads.
Meanwhile on the GPU side of matters, AMD and Cray are holding their cards a little closer. Rather than naming any architecture or architectural generation, AMD is only saying that the GPUs are “based on the Radeon Instinct family” and have “yet to be announced.” AMD’s current public roadmap goes out to “Next Gen” in 2020, and with GPU development cycles averaging 2 years, this may be the architecture we see. But with the particular needs for a supercomputer, AMD may have something slightly more bespoke.
What the company is confirming for now is that they aren’t holding back on features. The HPC-focused GPU is being designed with Frontier in mind and will incorporate mixed precision compute support. Feeding the beast will be HBM memory, and AMD will be tapping a version of Infinity Fabric to connect the CPUs and GPUs.
In fact while AMD has kept the details on the technology light, it sounds like this version of IF will be the most advanced version yet. AMD is specifically noting that it’s an “incredibly” coherent fabric, calling it the first fully optimized CPU + GPU design for supercomputing. AMD’s GPUs and CPUs will be arranged in a 4-to-1 ratio, with 4 GPUs for each EPYC CPU. It’s worth noting that AMD’s slide shows a mesh with every GPU connected to the CPU and two other GPUs, but I’m not reading too much into this quite yet, as AMD hasn’t disclosed any other details on the IF setup.
With AMD going up to the blade level, tying together all of these nodes will be Cray’s job. For Frontier the supercomputer vendor is launching their new Slingshot interconnect, an equally ambitious interconnect that will support adaptive routing, congestion management, and quality-of-service features. Slingshot is capable of 200Gb/sec per port, with individual blades incorporating a port for each GPU in the blade so that other nodes can directly read and write data to a GPU’s memory. As a result Frontier will have a significant amount of interconnect bandwidth, which is all but necessary in order to allow the system to scale to exaFLOP levels.
Overall, Frontier will be organized into over 100 Cray Shasta cabinets. And while Cray has not announced a specific power consumption figure for Frontier, with each cabinet rated for 300KW, this would put the complete system at over 30MW. Which to put things in context, this is over twice the power consumption of the 13MW Summit. So while Frontier is a significantly faster system than the supercomputer it replaces, Cray, AMD, and the US DOE are all feeling the pinch of Dennard scaling slowing down, as power efficiency gains get harder to achieve. All told, in a passing comment made in the press briefing, it sounds like Oak Ridge will be installing a total of 40MW of capacity for Frontier, which is a significant amount of power to say the least.
Along with furthering the US’s own supercomputing leadership goals, securing the Frontier contract also represents big wins for Cray and AMD. Cray is now involved in both 2021 exascale systems, reinforcing their own place in the supercomputing world. Meanwhile for AMD, who is spending this current generation from the outside looking in, they have now secured a major and prestigious win for both their CPU and GPU divisions.
In fact it’s interesting to note that of the two 2021 exascale systems being ordered, both are coming from full-service processor vendors that supply both CPUs and GPUs. Current-generation systems like Summit use mixed vendors – e.g. IBM + NVIDIA – so the move to integrated vendors is a big shift for these CPU + accelerator systems. Clearly there are technological and procurement benefits to using a single vendor for all of the processors, which benefits both AMD and Intel. Though it’s worth noting that the CORAL-2 program requires the DOE to buy systems based on two different architectures, so if the future is integrated systems, then AMD and Intel are the logical choices.
At any rate, with the contract placed for Frontier, the job is only half-done. AMD and Cray will need to continue developing their hardware and software for the system, not to mention locking down the specific specifications for the finished supercomputer. So expect to continue to hear news about Frontier trickle out over the next couple of years, leading up to its installation in 2021.
Source: AMD
77 Comments
View All Comments
Yojimbo - Thursday, May 9, 2019 - link
I meant "for the necessary power and budget constraints".sa666666 - Tuesday, May 7, 2019 - link
Of course, since Intel can do no wrong, and are _always_ best in _all_ situations. No exceptions. And AMD are _always_ crap. That's your outlook, right? Don't analyze anything at all. Just see Intel -> good, AMD -> bad. And never look deeper. Must be nice to live in a black and white world, devoid of all logic and requirement to actually think about anything.Koenig168 - Tuesday, May 7, 2019 - link
7X is quite an amazing performance jump. Summit is about 50% more powerful than Sunway TaihuLight, the previous leading supercomputer from 2016 which is in turn 50% more powerful than Tianhe-2A from 2013.Kevin G - Tuesday, May 7, 2019 - link
This is the one system that has the funding to really flesh out advanced packaging options IE massive interposers. Big thing would be placing numerous amounts of HBM, the massive IO controller, external fabric controller, lots of CPU dies and GPU dies into one modular package. Such a schema would likely show some platform based performance/watt gain because much of the platform need to loop chips together would simply disappear by going on-die. Things can scale up rather well but it then becomes a thermal density issue and anything resembling this would require liquid cooling.Intel could do that same with Xeon and Xe but AMD appears to have a head start based upon what they've announced and shipped thus far.
amrnuke - Tuesday, May 7, 2019 - link
The cooling solution is liquid-based, thousands of gallons a minute. It's incredible to think about the scale of it!Jleppard - Tuesday, May 7, 2019 - link
YA Intel can use the paper launched card that they have not even manufactured as of yet. Wonder if that would be on 10nm to!HStewart - Tuesday, May 7, 2019 - link
Intel is based on Sunny Cove or higher which is 10nm and possibly lower. But keep in mind it not the process that is most important but architexture on the chip. What I really like about Sunny Cove is they added addition store store unit and now there is store/get on two separate parts - sound like twice the speed internal for read/store operations which is significantThunder 57 - Tuesday, May 7, 2019 - link
"...and possibly lower". Hahaha that's a good one. "What I really like about Sunny Cove is..." that it is Intel. AMD is really bad and touched me as a child. Intel is good like rele good. They solve all the world's problems one at a time.Kevin G - Tuesday, May 7, 2019 - link
Intel has stated that they have fully decoupled their core designs from their manufacturing side. The 10 nm troubles have forced their hand on this. They have stated that the Sunny Cove core design is up on 10 nm and is ready for 7 nm when it arrives. What they haven't said but I suspect is that Sunny Cove has a 14 nm version as a fall back.Kevin G - Tuesday, May 7, 2019 - link
More importantly is that Sky Lake-SP introduced the on-die tile based coherent fabric. Scaling that upward with interposers/EMIB should be straight forward. Things like package power distribution becomes critical.If Xe could sit on the same tile based on-die fabric, Intel could pull off some magic here. While under performing, Intel's graphic have been rather feature complete and are coherent with the CPU side of things. That is the big feature which AMD is touting with this deal, just with performant parts.