Intel Launches Cooper Lake: 3rd Generation Xeon Scalable for 4P/8P Servers
by Dr. Ian Cutress on June 18, 2020 9:00 AM ESTWe’ve known about Intel’s Cooper Lake platform for a number of quarters. What was initially planned, as far as we understand, as a custom silicon variant of Cascade Lake for its high-profile customers, it was subsequently productized and aimed to be inserted into a delay in Intel’s roadmap caused by the development of 10nm for Xeon. Set to be a full range update to the product stack, in the last quarter, Intel declared that its Cooper Lake platform would end up solely in the hands of its priority customers, only as a quad-socket or higher platform. Today, Intel launches Cooper Lake, and confirms that Ice Lake is set to come out later this year, aimed at the 1P/2P markets.
Count Your Coopers: BFloat16 Support
Cooper Lake Xeon Scalable is officially designated as Intel’s 3rd Generation of Xeon Scalable for high-socket count servers. Ice Lake Xeon Scalable, when it launches later this year, will also be called 3rd Generation of Xeon Scalable, except for low core count servers.
For Cooper Lake, Intel has made three key additions to the platform. First is the addition of AVX512-based BF16 instructions, allowing users to take advantage of the BF16 number format. A number of key AI workloads, typically done in FP32 or FP16, can now be performed in BF16 to get almost the same throughput as FP16 for almost the same range of FP32. Facebook made a big deal about BF16 in its presentation last year at Hot Chips, where it forms a critical part of its Zion platform. At the time the presentation was made, there was no CPU on the market that supported BF16, which led to this amusing exchange at the conference:
BF16 (bfloat16) is a way of encoding a number in binary that attempts to take advantage of the range of a 32-bit number, but in a 16-bit format such that double the compute can be packed into the same number of bits. The simple table looks a bit like this:
Data Type Representations | ||||||
Type | Bits | Exponent | Fraction | Precision | Range | Speed |
float32 | 32 | 8 | 23 | High | High | Slow |
float16 | 16 | 5 | 10 | Low | Low | 2x Fast |
bfloat16 | 16 | 8 | 7 | Lower | High | 2x Fast |
By using BF16 numbers rather than FP32 numbers, it would also mean that memory bandwidth requirements as well as system-to-system network requirements could be halved. On the scale of a Facebook, or an Amazon, or a Tencent, this would appeal to them. At the time of the presentation at Hot Chips last year, Facebook confirmed that it already had silicon working on its datasets.
Doubling Socket-to-Socket Interconnect Bandwidth
The second upgrade that Intel has made to Cooper Lake over Cascade Lake is in socket-to-socket interconnect. Traditionally Intel’s Xeon processors have relied on a form of QPI/UPI (Ultra Path Interconnect) in order to connect multiple CPUs together to act as one system. In Cascade Lake Xeon Scalable, the top end processors each had three UPI links running at 10.4 GT/s. For Cooper Lake, we have six UPI links also running at 10.4 GT/s, however these links still only have three controllers behind them such that each CPU can only connect to three other CPUs, but the bandwidth can be doubled.
This means that in Cooper Lake, each CPU-to-CPU connection involves two UPI links, each running at 10.4 GT/s, for a total of 20.8 GT/s. Because the number of links is doubled, rather than an evolution of the standard, there are no power efficiency improvements beyond anything Intel has done to the manufacturing process. Note that double the bandwidth between sockets is still a good thing, even if latency and power per bit is still the same.
Intel still uses the double pinwheel topology for its eight socket designs, ensuring at max two hops to any required processor in the set. Eight socket is the limit with a glueless network – we have already seen companies like Microsoft build servers with 32 sockets using additional glue logic.
Memory and 2nd Gen Optane
The third upgrade for Cooper Lake is the memory support. Intel is now supporting DDR4-3200 with the Cooper Xeon Platinum parts, however only in a 1 DIMM per channel (1 DPC) scenario. 2 DPC is supported, but only at DDR4-2933. Support for DDR4-3200 technically gives the system a boost from 23.46 GB/s per channel to 25.60 GB/s, an increase of 9.1%.
The base models of Cooper Lake will also be updated to support 1.125 TiB of memory, up from 1 TB. This allows for a 12 DIMM scenario where six modules are 64 GB and six modules are 128 GB. One of the complaints about Cascade Xeons was that in 1 TB mode, it would not allow for an even capacity per memory channel when it was filled with memory, so Intel have rectified this situation. In this scenario, it means that the six 128 GB modules could also be Optane. Why Intel didn’t go for the full 12 * 128 GB scenario, we’ll never know.
The higher memory capacity processors will support 4.5 TB of memory, and be listed as ‘HL’ processors.
Cooper Lake will also support Intel’s second generation 200-series Optane DC Persistent Memory, codenamed Barlow Pass. 200-series Optane DCPMM will still available in 128 GB, 256 GB, and 512 GB modules, same as the first generation, and will also run at the same DDR4-2666 memory speed. Intel claims that this new generation of Optane offers 25% higher memory bandwidth than the previous generation, which we assume comes down to a new generation of Optane controller on the memory and software optimization at the system level.
Intel states that the 25% performance increase is when they compare 1st gen Optane DCPMM to 2nd gen Optane DCPMM at 15 W, both operating at DDR4-2666. Note that the first-gen could operate in different power modes, from 12 W up to 18 W. We asked Intel if the second generation was the same, and they stated that 15 W is the maximum power mode offered in the new generation.
99 Comments
View All Comments
Deicidium369 - Saturday, June 20, 2020 - link
No one, even someone like me only buying 60, are paying any where near MSRP. and for big customers like FB which would likely install hundreds if not thousands of these systems - the MSRP is irrelevent. ~$11K list - my Q60 order was less than $9K per.The only bad press is from the fanboys.. some of them are editors... So yes, was delayed - yet record revenue - so yeah not a bad deal. Companies like FB don't care what the manufacturing process is - they ask "can it do what I need it to do right now?" And apparently 14nm PCIe3 Cooper Lake does.
schujj07 - Saturday, June 20, 2020 - link
If you are buying 60 hosts @ $9k/host I see a lot of waste. At that cost you aren't getting much in a Xeon. You could save huge amounts of money by reducing your number of hosts and sockets.Deicidium369 - Thursday, June 25, 2020 - link
60 CPUs purchased16 - Engineering workstations - dual socket - single CPU installed
2 - my engineering workstation - dual socket with dual CPU installed
16 - 4 dual node servers - 2 nodes x 2 sockets - primary datacenter (Colorado)
16 - 4 dual node servers - 2 nodes x 2 sockets - secondary datacenter (Dallas)
6 - 3 single node dual socket - flash arrays - 1 at primary, 1 at secondary, 3rd for engineering
4 - dual node server, 2 nodes x 2 sockets - systems used by IT for testing new software
2 are basically spares, today. The 8 CPUs for the SAP server were originally intended for a different purpose - for a possible replacement for my large SGI TP16000 array - which never materialized (the array is the only remaining IB system in the mix - a Mellanox SwitchX-2 SX6710G made the conversion between the 8 40Gb/s IB to 8 40Gb/s Ethernet).
When we moved to SAP, we had no baseline whatsoever - so it went on it's own physical server in our datacenter in Colorado - with a mirror at Level 3 in Dallas. After a year, we decided to virtualize - and after the move to virtual, added the 4th server (nodes 7&8) to the pool - changes made in Colorado are made in Dallas as well.
When we replace the servers in the next ~12 months, the plan is to go back to 6 nodes - whether that is another 2U 2 node configuration, or as individual servers remains to be seen - will most likely be Ice Lake SP to be able to leverage PCIe4 to use dual 100Gb/s Ethernet for the planned network upgrade.
So initially the SAP system was on a dual node, 4 socket total physical server.
The CPUs were $9K per - not the hosts - hosts are servers. I can see why you say hosts - you missed the context.
"The MSRP is irrelevant. ~$11K list - my Q60 order was less than $9K per"
The $11K was in response to flgt post "Unless you're a FB or Intel employee, no one has any idea what the real price they pay for these processors." which was a response to Duncan Macdonald's post about "A 16 core 4 socket 4.5TB Xeon (the 6328HL) has a list price of $4779"
So talking about MSRP/List prices - Duncan made a claim about MSRP prices, flgt responded that very large customers pay less per unit - and I responded with my own experience with purchasing a very small number of CPUs compared to FB prices "even someone like me only buying 60, are paying any where near MSRP."
My post needed to be edited to be "even someone like me only buying 60, are *NOT* paying any where near MSRP
so $9K per CPU - not $9K per host. You missed the context.
You need to try to be more civil - the constant effort to refute everything I say is fine - but you also need to understand the context, rather than immediately sniping. I have no problem debating the merits of whatever - but the mindless / reactionary responses from you and people like Korguz need to stop. He never offers anything to the conversation and just lies in wait - with probable screen captures to try and make his point - which is "I suck".
Sorry that I didn't choose your preference for my systems - sorry that you and others feel attacked whenever someone states the facts about AMD. I prefer Intel (along with 95% of the server market). When you are putting together a PO to buy the servers and switches, etc for your business - you can choose what you wish, and what your budget will allow.
I have 10 people in my IT department, with over 200+ years of experience between us. The decision for hardware are not made on the fly. Other than now having a 7th and 8th node that is not needed, I have been pretty happy with the decisions we have made. Business continuity and performance were our primary goals, and both were met. The opinions held by posters on a tech forum do not come into play
schujj07 - Saturday, June 20, 2020 - link
How many of the $9k hosts are your SAP HANA hosts?Deicidium369 - Saturday, June 20, 2020 - link
thing is 4 socket motherboards for Intel exist - they don't for Epyc.If you are buying a 4 socket Intel system - you would buy either and Inspur or a Supermicro - which would be the motherboard, case, those "special power supplies" etc...
Those "special power supplies" are redundant and hot swap - something companies like Sun, SGI, Cisco and every single OEM have had for ages - they are considered STANDARD - not special.
brucethemoose - Thursday, June 18, 2020 - link
So I guess the use case is training on enormous datasets that don't fit into the VRAM of a GPU/AI accelerator?xenol - Thursday, June 18, 2020 - link
It looks like that, plus using Optane offers very fast persistent storage which depending on bandwidth needs can replace DRAM. Either way, having a large amount of very fast storage vs. a split between DRAM and secondary storage seems to have a benefit if you believe Intel's marketing materials.Deicidium369 - Thursday, June 18, 2020 - link
Ever been involved with a very large SAP install? Once the system is in production and needs to be restarted - the amount of time it takes to bring down and back up can take hours and hours - all the while it is not able to be used. Systems like SAP runs entirely out of memory - and so during a reboot, a ton of data needs to be loaded from storage to memory - with NVDIMMs alot of that data can be available with only a cursory check, rather than having to be loaded from relatively slow storage - allowing the system to come up much quicker - even saving a couple hours on a reboot means the business can be back up and running, saving hours on lost productivity. In most large companies - nothing happens without SAP.Intel's marketing materials are based on having 95% market share in the Datacenter and a long relationship with businesses and their needs. So not like they are trying to cram on more cores to convince businesses that is what they need - and making few sales.
Duncan Macdonald - Thursday, June 18, 2020 - link
A PCIe 4.0 NVMe drive can easily transfer over 250GB/Min so each terabyte of persistent (Optane or equivalent) memory gives a startup advantage of 4 minutes - hardly a massive advantageDeicidium369 - Thursday, June 18, 2020 - link
Funny admins of extremely large SAP and other ERP installs say otherwise.