US Dept. of Energy Announces Frontier Supercomputer: Cray and AMD to Build 1.5 Exaflop Machineby Ryan Smith on May 7, 2019 7:40 AM EST
The history of the computing industry is one of constant progress. Processors get faster, storage gets cheaper, and memory gets denser. We see the repercussions of this advancement through all aspects of society, and that extends to the top as well, where national governments continue to invest in bigger and better supercomputers. One part technological necessity and one part technological race, the exascale era of supercomputers is about to begin, as orders for the first exaFLOP-capable are now going out. It’s only fitting then that this morning the United States Department of Energy is announcing the contract for their fastest supercomputer yet, the Frontier system, which will be built by Cray and AMD.
Frontier is planned for delivery in 2021, and when it’s activated it will become the second and most powerful of the US DOE’s two planned 2021 exascale systems, with performance expected to reach 1.5 exaFLOPS. The ambitious system won’t come cheaply, however; with a price tag of over 500 million dollars for the system alone – and another 100 million dollars for R&D – Frontier is among the most expensive supercomputers ever ordered by the US Department of Energy.
The new supercomputer is being built as part of the US DOE’s CORAL-2 program for supercomputers, with Frontier scheduled to replace Oak Ridge National Laboratory’s current Summit supercomputer. Summit is the current reigning champion in the supercomputer world, with 200 petaFLOPS of performance, and accordingly the US DOE and Oak Ridge are aiming to significantly improve on its performance for the new computer. All told, Frontier should be able to deliver over 7x the performance of Summit, and is expected to be the fastest supercomputer in the world once it’s activated.
Like Summit (and Titan before it), Frontier is an open science system, meaning that it’s available to academic researchers to run simulations and experiments on. Accordingly, the lab is expecting the supercomputer to be used for a wide range of projects across numerous disciplines, including not only traditional modeling and simulation tasks, but also more data-driven techniques for artificial intelligence and data analytics. In fact the latter is a bit of new ground for the lab and the system’s eventual users; just as we’ve seen in the enterprise space over the past few years, neural network-based AI is becoming an increasingly popular technique to solve problems and extract analysis from large datasets, and now researchers are looking at how to refine those techniques from the current-generation systems and apply them to exascale-level projects.
|US Department of Energy Supercomputers|
|CPU Architecture||AMD EPYC
|Intel Xeon Scalable||IBM POWER9|
|GPU Architecture||Radeon Instinct||Intel Xe||NVIDIA Volta|
|Performance (RPEAK)||1.5 EFLOPS||1 EFLOPS||200 PFLOPS|
|Laboratory||Oak Ridge||Argonne||Oak Ridge|
Frontier: Powered by Cray & AMD
Officially, the prime contractor for Frontier will be Cray. But looking at the specifications, you could be excused for thinking it was AMD. Cray for its part is partnering with the chipmaker for the system, and as a result AMD is providing most of the core hardware for the new supercomputer. Designed as a next-generation CPU + accelerator system, with a mix of CPUs and GPUs doing the heavy compute work, AMD will be supplying both the CPUs and GPUs for Frontier. And as the principle processor provider, AMD will also be taking on a lot of the responsibility for developing the software stack as well, with the company working with Cray to develop an enhanced version of their ROCm environment to best extract performance from the massive cluster of CPUs and GPUs.
On the CPU side of matters, AMD will be supplying a customized next-generation EPYC CPU. AMD has confirmed that it’s going to be using a future generation of their Zen CPU cores, and given the timing of the project, we’re almost certainly looking at a Zen 3 or Zen 4 design here. Just how custom AMD’s CPU is remains to be seen, but their announcement has revealed that Frontier’s CPUs will include new instructions for the optimization of AI and supercomputing workloads.
Meanwhile on the GPU side of matters, AMD and Cray are holding their cards a little closer. Rather than naming any architecture or architectural generation, AMD is only saying that the GPUs are “based on the Radeon Instinct family” and have “yet to be announced.” AMD’s current public roadmap goes out to “Next Gen” in 2020, and with GPU development cycles averaging 2 years, this may be the architecture we see. But with the particular needs for a supercomputer, AMD may have something slightly more bespoke.
What the company is confirming for now is that they aren’t holding back on features. The HPC-focused GPU is being designed with Frontier in mind and will incorporate mixed precision compute support. Feeding the beast will be HBM memory, and AMD will be tapping a version of Infinity Fabric to connect the CPUs and GPUs.
In fact while AMD has kept the details on the technology light, it sounds like this version of IF will be the most advanced version yet. AMD is specifically noting that it’s an “incredibly” coherent fabric, calling it the first fully optimized CPU + GPU design for supercomputing. AMD’s GPUs and CPUs will be arranged in a 4-to-1 ratio, with 4 GPUs for each EPYC CPU. It’s worth noting that AMD’s slide shows a mesh with every GPU connected to the CPU and two other GPUs, but I’m not reading too much into this quite yet, as AMD hasn’t disclosed any other details on the IF setup.
With AMD going up to the blade level, tying together all of these nodes will be Cray’s job. For Frontier the supercomputer vendor is launching their new Slingshot interconnect, an equally ambitious interconnect that will support adaptive routing, congestion management, and quality-of-service features. Slingshot is capable of 200Gb/sec per port, with individual blades incorporating a port for each GPU in the blade so that other nodes can directly read and write data to a GPU’s memory. As a result Frontier will have a significant amount of interconnect bandwidth, which is all but necessary in order to allow the system to scale to exaFLOP levels.
Overall, Frontier will be organized into over 100 Cray Shasta cabinets. And while Cray has not announced a specific power consumption figure for Frontier, with each cabinet rated for 300KW, this would put the complete system at over 30MW. Which to put things in context, this is over twice the power consumption of the 13MW Summit. So while Frontier is a significantly faster system than the supercomputer it replaces, Cray, AMD, and the US DOE are all feeling the pinch of Dennard scaling slowing down, as power efficiency gains get harder to achieve. All told, in a passing comment made in the press briefing, it sounds like Oak Ridge will be installing a total of 40MW of capacity for Frontier, which is a significant amount of power to say the least.
Along with furthering the US’s own supercomputing leadership goals, securing the Frontier contract also represents big wins for Cray and AMD. Cray is now involved in both 2021 exascale systems, reinforcing their own place in the supercomputing world. Meanwhile for AMD, who is spending this current generation from the outside looking in, they have now secured a major and prestigious win for both their CPU and GPU divisions.
In fact it’s interesting to note that of the two 2021 exascale systems being ordered, both are coming from full-service processor vendors that supply both CPUs and GPUs. Current-generation systems like Summit use mixed vendors – e.g. IBM + NVIDIA – so the move to integrated vendors is a big shift for these CPU + accelerator systems. Clearly there are technological and procurement benefits to using a single vendor for all of the processors, which benefits both AMD and Intel. Though it’s worth noting that the CORAL-2 program requires the DOE to buy systems based on two different architectures, so if the future is integrated systems, then AMD and Intel are the logical choices.
At any rate, with the contract placed for Frontier, the job is only half-done. AMD and Cray will need to continue developing their hardware and software for the system, not to mention locking down the specific specifications for the finished supercomputer. So expect to continue to hear news about Frontier trickle out over the next couple of years, leading up to its installation in 2021.
Post Your CommentPlease log in or sign up to comment.
View All Comments
HStewart - Tuesday, May 7, 2019 - link200 is over 100 cabinets - it also depends on what is in cabinets - it still too earlier to know how valid the #'s are.
Jleppard - Tuesday, May 7, 2019 - linkNot really because everyone knows AMD can actually provide twice the cores per socket as Intel can. So the Intel system using twice the cabinet space adds up
HStewart - Tuesday, May 7, 2019 - linkBut power per core in AMD is significantly less than Intel and 32 vs 28 is not twice - that is using release technology, this is future tech and it anybody guess how many is actual is there.
Irata - Tuesday, May 7, 2019 - linkRome is about to be released and it is 64C/128T max.
HStewart - Tuesday, May 7, 2019 - linkYes and Intel has Sunny Cove too, you can't compare future AMD vs existing Intel/
fallaha56 - Tuesday, May 7, 2019 - linkEr no you big shill lol
AMD is shipping 64 cores now vs some fantasy future chip from Intel
-or Intel’s ‘watercooling as standard’ 48 core 400W chip...watt a disaster
Irata - Tuesday, May 7, 2019 - linkRome is future as in Q3 this year. However, frontier may even use the next gen Epyc - same for Intel and Aurora, seeing when they will be deployed
Korguz - Tuesday, May 7, 2019 - linkHStewart.. yes you can, and we are.. but YOU dont want to because your beloved intel, is not getting the design win.. AMD is.. there fore.. this is a LOSS for intel.. and a WIN for amd.. and you cant handle it being an intel fanboy... you HATE it when something like this happens.. and do EVERYTHING you can to try to put a positive spin on this toward intel... guess what HStewart.. intel has played the me too game too...
jospoortvliet - Tuesday, May 7, 2019 - link> Frontier (...) will become the second and most powerful of the US DOE’s two planned 2021 exascale systems
In other words, they selected Intel for the slow one and AMD for the faster one they are building. The reason is obvious and also in the text: AMD can deliver a faster system with 100 cabinets than Intel with 200 and I bet power and cost of those 200 cabinets adds up to more than for those 100 with AMD. I bet that they would have gone for two AMD systems if they were not forced to pick different vendors.
All of this is completely in line with what we see today from Intel vs AMD in the high end server market and in line with roadmaps: AMD delivers higher density and better performance per watt than Intel on many big scale work loads, with a better roadmap. So it should be no surprise to anyone in the industry.
I'm sure Intel will get their act together and catch up but not before 2021.
Yojimbo - Thursday, May 9, 2019 - linkIt doesn't work that way. Aurora was not originally supposed to be an exascale system. It was renegotiated and pushed back. Intel was not eligible to win Frontier because they had Aurora. I believe that was part of the RFP (request for proposals). At least they were not eligible to win it with the same type of system they won Aurora with (Xe GPUs). That basically means it would have had to be a CPU-only system, which couldn't very well get to exascale in that time period for necessary price and budget constraints.
AMD put in a bid and won. The DOE isn't playing politics with the way they hand out the winners to the bids, they are looking at the submitted proposals and selecting the systems they think give them the best value and follow their principles of procurement. I am guessing that AMD is building this system on a razor thin margin compared to Intel, however. I only say that because of the margins they get on their commercial CPUs and GPUs. NVIDIA will be between GPU generations at the time the DOE wants this system to be delivered. Their post-Volta data center chip will go into Perlmutter in 2020 and their generation after that probably wouldn't be available in time for the delivery schedule of Frontier. El Capitan will be delivered a year later and that will probably include NVIDIA's post-post-Volta data center GPU, assuming they manage to win the contract.