A new update to the Intel document for software developers indicates that the company will begin to introduce various AVX-512 instruction set extensions to its consumer CPUs soon. This will start from the codenamed Cannon Lake (CNL) and Ice Lake (ICL) processors, made using 10 nm process technologies. The new extensions will enable future chips to improve performance in certain applications. One of the main questions on AVX-512 is which consumer programs will actually support the AVX-512 when these CNL and ICL processors hit the market. In addition to the AVX-512, the upcoming processors will introduce a host of other new non-AVX-512 instructions.

AVX-512 Coming to Consumer CPUs

According to the Intel Architecture Instruction Set Extensions and Future Features Programming Reference document, Intel’s Cannon Lake CPUs will support AVX512F, AVX512CD, AVX512DQ, AVX512BW, and AVX512VL. This will bring the feature set of these CPUs to the current level of the Skylake-SP based processors. In addition, the Cannon Lake microarchitecture will support the AVX512_IFMA and AVX512_VBMI commands, but at this point, it is unclear whether the support will be limited to servers, or will also be featured in the consumer processors (the latter scenario is likely based on the document wording, but remains unclear).

Intel originally promised to release Cannon Lake processors in 2016 – 2017 timeframe, but delayed introduction of its 10 nm process technology to 2018, thus postponing the CPU launch as well. Initially it was expected that the Cannon Lake CPUs would generally resemble the Kaby Lake and Coffee Lake chips with some refinements, but the addition of the AVX-512 support means a rather tangible architecture improvement. For AVX-512, large the chunks of data require massive memory bandwidth, which the Skylake-SP cores get due to large caches and more memory controllers. Keeping in mind memory bandwidth and power consumption factors, the AVX-512 might not be supported by all Cannon Lake client CPUs, but only by those aimed at higher-performance machines (i.e., no AVX-512 for ULP mobile parts as well as entry-level desktop SKUs, but this is a speculation at this point). Meanwhile, a good news is that by the time AVX-512-supporting Cannon Lake processors arrive, programs for client PCs that take advantage of the latest extensions will likely be available.

The evolution of the AVX-512 on general-purpose CPUs is not going to stop. Intel’s Ice Lake processors will support AVX512_VPOPCNTDQ (which will also be supported by the Xeon Phi ‘Knights Mill’) commands as well as AVX512_VNNI, AVX512_VBMI2, AVX512+VPCLMULQDQ and AVX512_BITALG instructions. The ICL chips will also feature AVX-512 versions of known AES and GFNI algorithms for encryption and error corrections — AVX512+VAES and AVX512+GFNI.

Meanwhile, the Knights Mill will exclusively support AVX512_4FMAPS and AVX512_4VNNI (at least for a while, because an Intel filing with the Linux kernel states that the upcoming Xeon Phi and Xeon CPUs will support both commands, but descriptions of Linux patches are not always accurate, plus, plans tend to change).

AVX-512 Support Propogation by Various Intel CPUs
  Xeon, Core X General Xeon Phi  
Skylake-SP AVX512BW
AVX512DQ
AVX512VL
AVX512F
AVX512CD
AVX512ER
AVX512PF
Knights Landing
Cannon Lake AVX512VBMI
AVX512IFMA
AVX512_4FMAPS
AVX512_4VNNIW
Knights Mill
Ice Lake AVX512_VNNI
AVX512_VBMI2
AVX512_BITALG
AVX512+VAES
AVX512+GFNI
AVX512+VPCLMULQDQ
AVX512_VPOPCNTDQ
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

As it turns out from Intel’s document, the Cannon Lake and Ice Lake processors will have an up-to-date AVX-512 support. It is unknown whether the CNL and the ICL cores will be used inside the future server processors (remember that Intel has server-specific 'Cascade Lake' product incoming), but if this is the case, then it looks like Intel’s cores for server and client computers will have the same feature-set going forward, at least when it comes to the AVX-512 support.

Adding the AVX-512 to consumer processors looks like an important development even though the instruction set was primarily designed to process large amounts of data common for servers and, to a degree, workstations (such as encoding, rendering, cryptography, deep learning, etc.). Apparently, Intel believes that 512-bit INT/FP calculations will be important for mainstream PCs as well. A big question is how exactly Intel plans to implement the AVX-512 in various Cannon Lake and Ice Lake processors going forward. Keep in mind that Intel’s six and eight-core Skylake-X CPUs officially support one fused FMA for AVX-512-F, but the chips with 10+ cores officially support dual 512-bit AVX-512-F ports and can offer up to two times higher performance. So in that respect, there is potential for further differentiation between products.

In the meantime, Intel’s Cannon Lake and Ice Lake CPUs will have a number of other new instructions for various matters and they are certainly worth looking at.

New Instructions to Improve Security, Performance of Upcoming CPUs

In a bid to speed up certain cryptography algorithms, Cannon Lake will feature the SHA-NI instruction set that is already supported by the Goldmont cores. SHA-NI is of a similar base to AES-NI, that was added several generations prior. Based on Intel’s publications, SHA-NI can speed up SHA1, SHA256 and SHA224 algorithms. In addition, the new CPUs will also support the UMIP security mechanism that prevents the execution of certain instructions in if their privilege level is insufficient for that, preventing certain apps from accessing the OS settings.

The Ice Lake chips will bring support for Fast Short REP MOV instruction that will enable fast moves of large amounts of data from one location to another, which will benefit optimized memory-intensive applications. Keep in mind that we are moving towards persistent memory for a number of server applications and therefore large amounts of data located in DRAM and/or NVDIMMs will be more common in the future.

Another interesting feature supported by the Ice Lake consumer processors is CLWB (Cache Line Write Back) command for NVMe programming. The feature is already supported by the Skylake-SP cores and is required to better handle SSDs connected to the processor, but will come into consumer products with Ice Lake. CLWB flushes the write caches, but does not invalidate the data, making it available if it is needed after the line is flushed, thus improving performance in certain situations. Given the Purley/Skylake-SP context, CLWB is something required for upcoming NVDIMMs (based on 3D XPoint), but it is not completely clear how Intel expects to use it in case of consumer platforms (they make sense for certain workstation applications and for that reason CLWB is supported by SKL-SP). In any case, the addition of CLWB will add some speed in certain cases when very fast SSDs are used and cache miss is an issue.

There are other features coming in the Goldmont Plus (the heart of upcoming Gemini Lake SoCs) and Ice Lake processors, namely PTWRITE and RDPID, which seem to be aimed mostly at software developers and which purpose may not benefit end users right away.

Instruction Set Extensions of Cannon Lake, Ice Lake and Goldmont+ CPUs
  Instruction Purpose Description
Cannon Lake SHA-NI Security Cryptography acceleration.
UMIP

User-Mode Instruction Prevention
Security Prevents execution of certain instructions if the Current Privilege Level (CPL) is greater than 0. If these instructions were executed while in CPL > 0, user space applications could have access to system-wide settings such as the global and local descriptor tables, the task register and the interrupt descriptor table.
Ice Lake CLWB

Cache Line
Write Back
Performance Writes back modified data of a cache line similar to CLFLUSHOPT, but avoids invalidating the line from the cache (and instead transitions the line to non-modified state). CLWB attempts to minimize the compulsory cache miss if the same data is accessed temporally after the line is flushed if the same data is accessed temporally after the line is flushed.
Fast Short REP MOV Performance Enables fast moves of data from one location to another.
RDPID

Read Processor ID
General Quickly reads processor ID to discover its feature set and apply optimizations/use specific code path if possible.
Goldmont Plus PTWRITE

Write Data to a Processor Trace Packet
Debugging Unclear.
UMIP Security See above
RDPID General See above
Source: Intel Architecture Instruction Set Extensions and Future Features Programming Reference (pages 12 and 13)

Some History

Intel and AMD have been adding various instruction set extensions to the x86 architecture since the mid-1990s. Throughout the recent 20 years, both companies have brought in hundreds of new instructions designed to improve performance in various applications by SIMD instructions and feeding CPU cores large amounts of data at once or by using special-purpose hardware. Intel’s latest mainstream extensions are called the AVX/AVX2 and their main purposes were increasing the width of the register file (both SIMD and integer) to 256 bits and the introduction of commands like the FMA3 (that serves the same purpose — does relatively complex computations in one instruction). To perform 256-bit AVX2 operations, CPUs have to lower their frequency to maintain stability, as cores tend to draw a lot of power under such workloads, but even at lower clock rates AVX/AVX2 make a lot of sense and increase overall throughput.

The next step in the evolution of the instruction set extensions that Intel made was the AVX-512. With AVX-512 the company decided to introduce different sets of instructions for different applications and implemented them in different products. Some of the AVX-512 extensions are aimed primarily at enterprise workloads, whereas the others are needed for supercomputers or high performance compute. Implementing all of them in in all products hardly makes a lot of sense for Intel and its customers, so the latest Skylake-SP Xeons (and the high-end desktop processors) support one set of AVX-512 commands and the Xeon Phis support another one. In the meantime, contemporary mainstream consumer CPUs do not support AVX-512 at all.  One of the reasons for this is because the physical implementation significantly increases die size (by up to 15% in case of the Skylake core). Other factors such as the cost associated with a die increase, and partly because client applications today cannot take advantage of such instructions, are also in the mix. In the future, this is going to change as Intel plans to enable support of certain AVX-512 variations in its future Cannon Lake and Ice Lake processors for mainstream consumers.

Wrapping Up

The addition of the AVX-512 to the future consumer CPUs is a good news for those who use such processors for things like video encoding, rendering or other applications that are common for workstations. Meanwhile, with the Ice Lake consumer chips, Intel is adding a deep learning-specific (AVX512_VNNI) 512-bit instructions as well as the NV-DIMM-oriented features such as CLWB, although immediate advantages for this market segment are unclear. Intel is opening this information up to allow developers to prepare for these processors and develop software in advance. In any case, all new features are always welcome by many because at some point they start to bring certain advantages.

Related Reading

Source: Intel (via WikiChip Twitter).

Comments Locked

50 Comments

View All Comments

  • edzieba - Thursday, October 19, 2017 - link

    Clock-for-clock, a Zen core's AVX 256 throughput is half that of a Coffee Lake (or Kaby Lake) core's.
  • ddriver - Friday, October 20, 2017 - link

    Wrong. It is split into two uops but the two are executed simultaneously nonetheless. Well, technically not, but practically, for all intents and purposes, zen's throughput is not any lower for most avx 256 operations. There are a few very corner cases which are not supported in hardware and emulated which get a hit, but those are rarely used, hence amd's decision to not waste transistors on them for the time being.
  • Kevin G - Friday, October 20, 2017 - link

    Only theoretical peak performance fall into that category. Real code is more than just AVX operands which lowers the real world benefits of having such a potentially higher throughput. See Amdahl's Law.
  • bcronce - Thursday, October 19, 2017 - link

    AVX can consume large amounts of power because of the nature of the processing. This is why Prime95 can make any AVX supporting cpu run ridiculously hot. Intel had two choices. Lower the clock or raise the voltage. They actually do both, depending on the SKU.
  • extide - Thursday, October 19, 2017 - link

    That's because it only does 128-bits at a time and takes 2 clocks where as Intel chips do it all at once and take 1 clock. So not only do Intel's chips have a higher clock speed but AMD takes a 50% penalty for taking 2 clocks instead of 1.
  • jospoortvliet - Friday, October 27, 2017 - link

    Actually there are 2 128bit units so the AMD CPU's don't do them one after another.

    And both in Intel and AMD CPU's do the instructions take multiple clock cycles, that's because they are big and heavy and can't be done in a single clock cycle unless you decrease the clock speed a LOT ;-)
    Note that often different instructions take different numbers of clock cycles. It is certainly possible that for some, Intel takes, say, 6 and AMD 5 and for others, AMD takes 12 and Intel 10.
  • Santoval - Tuesday, December 12, 2017 - link

    Zen (up to Epyc, since it is the same core) has no AVX256 blocks, so there is less need to clock down. Zen can do AVX256 by pairing two AVX128 units, but that is not the same thing. Cannon Lake's AVX512 units will be "true" AVX512 ones, not 2 x AVX256 ones, so my guess is that it will need to clock down even further.
  • bobhumplick - Wednesday, August 22, 2018 - link

    zen runs at lower clocks to start with and zen does less work than intels avx2. certain operations cant be done in one clock on the zen. i run an 8700k at 5ghz with no avx offset at ~1.35v stable. the intel cpu gets more avx work done per clock than the zen so of course it will draw more power and throttling clocks to stay within tdp means that the intel actually has a higher througput of avx instructions even with the lower clocks. it has nothing to do with "more effiecient power management"
  • Elstar - Thursday, October 19, 2017 - link

    I wouldn't read too much into this. Intel regularly enables or disables features in order to fulfill other goals like yield management, price point diversity, marketing, etc.

    AVX-512 is actually a great example of this behavior because Intel knows that people who need/want AVX-512 will pay dearly for it, and therefore the feature is only enabled on premium Skylake parts at the moment. Maybe Intel will still view AVX-512 as being a premium feature in the Cannonlake timeframe, or maybe not. Either way, the die logic will almost surely be there.
  • shabby - Thursday, October 19, 2017 - link

    Will canonlake users need a new motherboard to support avx512?

Log in

Don't have an account? Sign up now