Tegra X1's GPU: Maxwell for Mobile

Going into today’s announcement of the Tegra X1, while NVIDIA’s choice of CPU had been something of a wildcard, the GPU was a known variable. As announced back at GTC 2014, Erista – which we now know as Tegra X1 – would be a future Tegra product with a Maxwell GPU.

Maxwell of course already launched on the PC desktop as a discrete GPU last year in the Maxwell 1 based GM107 and Maxwell 2 based GM204. However despite this otherwise typical GPU launch sequence, Maxwell marks a significant shift in GPU development for NVIDIA that is only now coming to completion with the launch of the X1. Starting with Maxwell, NVIDIA has embarked on a “mobile first” design strategy for their GPUs; unlike Tegra K1 and its Kepler GPU, Maxwell was designed for Tegra from the start rather than being ported after the fact.

By going mobile-first NVIDIA has been able to reap a few benefits. On the Tegra side in particular, mobile-first means that NVIDIA’s latest and greatest GPUs are appearing in SoCs earlier than ever before – the gap between Maxwell 1 and Tegra X1 is only roughly a year, versus nearly two years for Kepler in Tegra K1. But it also means that NVIDIA is integrating deep power optimizations into their GPU architectures at an earlier stage, which for their desktop GPUs has resulted chart-topping power efficiency, and these benefits are meant to cascade down to Tegra as well.

Tegra X1 then is the first SoC to be developed under this new strategy, and for NVIDIA this is a very big deal. From a feature standpoint NVIDIA gets to further build on their already impressive K1 feature set with some of Maxwell’s new features, and meanwhile from a power standpoint NVIDIA wants to build the best A57 SoC on the market. With everyone else implementing (roughly) the same CPU, the GPU stands to be a differentiator and this is where NVIDIA believes their GPU expertise translates into a significant advantage.

Diving into the X1’s GPU then, what we have is a Tegra-focused version of Maxwell 2. Compared to Kepler before it, Maxwell 2 introduced a slew of new features into the NVIDIA GPU architecture, including 3rd generation delta color compression, streamlined SMMs with greater efficiency per CUDA core, and graphics features such as conservative rasterization, volumetric tiled resources, and multi-frame anti-aliasing. All of these features are making their way into Tegra X1, and for brevity’s sake rather than rehashing all of this we’ll defer to our deep dive on the Maxwell 2 architecture from the launch of the GeForce GTX 980.

For X1 in particular, while every element helps, NVIDIA’s memory bandwidth and overall efficiency increases are going to be among the most important of these improvements since they address two of the biggest performance bottlenecks facing SoC-class GPUs. In the case of memory bandwidth optimizations, memory bandwidth has long been a bottleneck at higher performance levels and resolutions, and while it’s a solvable problem, the general solution is to build a wider (96-bit or 128-bit) memory bus, which is very effective but also drives up the cost and complexity of the SoC and the supporting hardware. In this case NVIDIA is sticking to a 64-bit memory bus, so memory compression is very important for NVIDIA to help drive X1. This coupled with a generous increase in memory bandwidth from the move to LPDDR4 helps to ensure that X1’s more powerful GPU won’t immediately get starved at the memory stage.

Meanwhile just about everything about SoC TDP that can be said has been said. TDP is a limiting factor in all modern mobile devices, which means deceased power consumption directly translates into increased performance, especially under sustained loads. Coupled with TSMC’s 20nm SoC process, Maxwell’s power optimizations will further improve NVIDIA’s SoC GPU performance.

Double Speed FP16

Last but certainly not least however, X1 will also be launching with a new mobile-centric GPU feature not found on desktop Maxwell.  For X1 NVIDIA is implanting what they call “double speed FP16” support in their CUDA cores, which is to say that they are implementing support for higher performance FP16 operations in limited circumstances.

As with Kepler and Fermi before it, Maxwell only features dedicated FP32 and FP64 CUDA cores, and this is still the same for X1. However in recognition of how important FP16 performance is, NVIDIA is changing how they are handling FP16 operations for X1. On K1 FP16 operations were simply promoted to FP32 operations and run on the FP32 CUDA cores; but for X1, FP16 operations can in certain cases be packed together as a single Vec2 and issued over a single FP32 CUDA core.

There are several special cases here, but in a nutshell NVIDIA can pack together FP16 operations as long as they’re the same operation, e.g. both FP16s are undergoing addition, multiplication, etc. Fused multiply-add (FMA/MADD) is also a supported operation here, which is important for how frequently it is used and is necessary to extract the maximum throughput out of the CUDA cores.

In this respect NVIDIA is playing a bit of catch up to the competition, and overall it’s hard to escape the fact that this solution is a bit hack-ish, but credit where credit is due to NVIDIA for at least recognizing and responding to what their competition has been doing. Both ARM and Imagination have FP16 capabilities on their current generation parts (be it dedicated FP16 units or better ALU decomposition), and even AMD is going this route for GCN 1.2. So even if it only works for a few types of operations, this should help ensure NVIDIA doesn’t run past the competition on FP32 only to fall behind on FP16.

So why are FP16 operations so important? The short answer is for a few reasons. FP16 operations are heavily used in Android’s display compositor due to the simplistic (low-precision) nature of the work and the power savings, and FP16 operations are also used in mobile games at certain points. More critical to NVIDIA’s goals however, FP16 can also be leveraged for computer vision applications such as image recognition, which NVIDIA needs for their DRIVE PX platform (more on that later). In both of these cases FP16 does present its own limitations – 16-bits just isn’t very many bits to hold a floating point number – but there are enough cases where it’s still precise enough that it’s worth the time and effort to build in the ability to process it quickly.

Tegra X1 GPU By The Numbers

Now that we’ve covered the X1’s GPU from a feature perspective, let’s take a look the GPU from a functional unit/specification perspective.

Overall the X1’s GPU is composed of 2 Maxwell SMMs inside a single GPC, for a total of 256 CUDA cores. This compares very favorably to the single SMX in K1, as it means certain per-SMM/SMX resources such as the geometry and texture units have been doubled. Furthermore Maxwell’s more efficient CUDA cores means that X1 is capable of further extending its lead over Kepler, as we’ve already seen in the desktop space.

NVIDIA Tegra GPU Specification Comparison
  K1 X1
CUDA Cores 192 256
Texture Units 8 16
ROPs 4 16
GPU Clock ~950MHz ~1000MHz
Memory Clock 930MHz (LPDDR3) 1600MHz (LPDDR4)
Memory Bus Width 64-bit 64-bit
FP16 Peak 365 GFLOPS 1024 GFLOPS
FP32 Peak 365 GFLOPS 512 GFLOPS
Architecture Kepler Maxwell
Manufacturing Process TSMC 28nm TSMC 20nm SoC

Meanwhile outside of the CUDA cores NVIDIA has also made an interesting move in X1’s ROP configuration. At 16 ROPs the X1 has four times the ROPs of K1, and is consequently comparatively ROP heavy. This is as many ROPs as is on a GM107 GPU, for example. With that said, due to NVIDIA’s overall performance goals and their desire to drive 4K displays at 60Hz, there is a definite need to go ROP-heavy to make sure they can push the necessary amount of pixels. This also goes hand-in-hand with NVIDIA’s memory bandwidth improvements (efficiency and actual) which will make it much easier to feed those ROPs. This also puts the ROP:memory controller ratio at 16:1, the same ratio as on NVIDIA’s desktop Maxwell parts.

Finally, let’s talk about clockspeeds and expected performance. While NVIDIA is not officially publishing the GPU clockspeeds for the X1, based on their performance figures it’s easy to figure out. With NVIDIA’s quoted (and promoted) 1 TFLOPs FP16 performance figure for the X1, the clockspeed works out to a full 1GHz for the GPU (1GHz * 2 FP 16 * 2 FMA * 256 = 1 TFLOPs).

This is basically a desktop-class clockspeed, and it goes without saying that is a very aggressive GPU clockspeed for an SoC-class part. We’re going to have to see what design wins X1 lands and what the devices are like, but right now it’s reasonable to expect that mobile devices will only burst here for short periods of time at best. However NVIDIA’s fixed platform DRIVE devices are another story; those can conceivably be powered and cooled well enough that the X1’s GPU can hit and sustain these clockspeeds.

Introduction, CPU, and Uncore GPU Performance Benchmarks
Comments Locked

194 Comments

View All Comments

  • chizow - Monday, January 5, 2015 - link

    Careful, you do mean A8X right? Because Denver K1 is an actual product that absolutely stomps A8, only after Apple somewhat unexpectedly "EnBiggened" their A8 by increasing transistors and functional units 50%, did they manage to match K1's GPU and edge the CPU in multi-core (by adding a 3rd core).

    To say Denver K1 didn't deliver is a bit of a joke, since it is miles ahead of anything on the Android SoC front, and only marginally bested in CPU due to Apple's brute-force approach with A8X while leveraging 20nm early. We see that once the playing field has been leveled 20nm, its no contest in favor of Tegra X1.
  • Jumangi - Monday, January 5, 2015 - link

    I mean a product that is widely available to CONSUMERS dude. And please stop with the "stomping" stuff. It means nothing about its performance with its also vastly higher power consumption. The A8 can exist in a smartphone. What smartphones have the K1? Oh that's right none because you would get a hour of use before your battery was dead. Mobile is about performance and speed. You can diss Apple all you want but from an SoC perspective they do it better than anyone else right now.
  • pSupaNova - Tuesday, January 6, 2015 - link

    @Jumangi try to comprehend what he is saying.
    Apple used a superior process on its A8X and more transistors to just edge the K1 in some CPU benchmarks. While core for core Nvida's is actually more powerful.
    The GPU in the K1 also has near desktop parity etc OpenGL 4.4. Features like Hardware Tessellation are absent from the A8X.
  • Alexey291 - Tuesday, January 13, 2015 - link

    That's great. It really is but lets be honest. A8x is faster than K1.

    And end of the day that is sadly all that matters to the vaaaaaast majority of consumers.

    Frankly even that barely matters. What does though is that games run better on my tablet than they do on yours so to speak. (Actually likely they run better on yours since I'm still using a nexus 10 xD)

    But sure the new paper launch from nv late this year our early next year will be great and the 2.5 devices that x1 will appear in will be amazing. Making sales in hundreds of thousands.
  • SM123456 - Sunday, February 1, 2015 - link

    The point is that the Tegra K1 Denver on 28nm beats the Apple A8 fairly comprehensively on 20nm with the same number of cores. Apple stuck on 50% more cores and 50% more transistors to allow the A8X on 20nm to have a slight edge over the Tegra K1 Denver. This means if Tegra K1 is put on 20nm, it will beat the 3 core Apple A8X with two cores, and the same thing will happen when both move to 16nm.
  • utferris - Monday, April 13, 2015 - link

    Oh. Really? Denver K1 is not even as fast as A8X. Do not mention that it uses more than 2 times energy. I really do not understand people like you going around and saying how good nvidia shit is.
  • eanazag - Wednesday, January 7, 2015 - link

    It'll likely be in the next Shield.
  • name99 - Monday, January 5, 2015 - link

    (1) I wouldn't rave too enthusiastically about Denver. You'll notice nV didn't...
    Regardless of WHY Denver isn't on this core, the fact that it isn't is not a good sign. Spin it however you like, but it shows SOMETHING problematic. Maybe Denver is too complicated to shift processes easily? Maybe it burns too much power? Maybe it just doesn't perform as well as ARM in the real world (as opposed to carefully chosen benchmarks)?

    (2) No-one gives a damn about "how many GPU cores" a SoC contains, given that "GPU core" is a basically meaningless concept that every vendor defines differently. The numbers that actually matter are things like performance and performance/watt.

    (3) You do realize you're comparing a core that isn't yet shipping with one that's been shipping for three months? By the time X1 actually does ship, that gap will be anything from six to nine months. Hell, Apple probably have the A9/A9X in production TODAY at the same level of qualification as X1 --- they need a LONG manufacturing lead time to build up the volumes for those massive iPhone launches. You could argue that this doesn't matter since the chip won't be released until September except that it is quite likely that the iPad Pro will be launched towards the end of Q1, and quite likely that it will be launched with an A9X, even before any Tegra X1 product ships.
  • chizow - Tuesday, January 6, 2015 - link

    @Name99

    1) Huh? Denver is still one of Nvidia's crowning achievements and the results speak for themselves, fastest single-core ARM performance on the planet, even faster than Apple's lauded Cyclone. Why it isn't in this chip has already been covered, its a time to market issue. Same reason Nvidia released a 32-bit ARM early and 64-bit Denver version of Tegra K1 late, time to market. Maybe, in the tight 6 month window they would have needed between bringing Denver and working on Erista, they simply didn't have enough time for another custom SoC? I'm not even an Apple fan and I was impressed with Cyclone when it was first launched. But suddenly, fastest single-core and a dual-core outperforming 4 and even 8-core SoC CPUs is no longer an impressive feat! That's interesting!

    2) Actually, anyone who is truly interested does care, because on paper, a 6-core Rogue XT was supposed to match the Tegra K1 in theoretical FLOPs performance. And everyone just assumed that's what the A8X was when Apple released the updated SoC that matched TK1 GPU performance. The fact it took Apple a custom 8-core variant is actually interesting, because it shows Rogue is not as efficient as claimed, or conversely, Tegra K1 was more efficient (not as likely since real world synthetics match their claimed FLOPs counts). So if 6 core was supposed to match Tegra K1 but it took 8 cores, Rogue XT is 33% less efficient than claimed.

    3) And you do realize, only a simpleton would expect Nvidia to release a processor at the same performance level while claiming a nearly 2x increase in perf/w right? There's live demos and benchmarks of their new X1 SoC for anyone at CES to test, but I am sure the same naysayers will claim the same as they did for the Tegra K1 a year ago, saying it would never fit into a tablet, it would never be as fast as claimed yada yada yada.

    Again, the A9/A9X may be ready later this year, but the X1 is just leveling the playing field at 20nm, and against the 20nm A8/X we see it is no contest. What trick is Apple going to pull out of its hat for A9/A9X since they can't play the 20nm card again? 16nm FinFET? Possible, but that doesn't change the fact Apple has to stay a half step ahead just to remain even with Nvidia in terms of performance.
  • lucam - Wednesday, January 7, 2015 - link

    1) He was saying: why NV didn't continue with Denver design? Being so efficient and only 2 cores why don't shift at 20nn easily? Because they can't and that's it. The other things are speculations.

    2) You still compare apple (not Apple) with pears. Any vendors put inside his proprietary technology with their market strategy, important is to figure how GFLOPS and Texel is capable at same frequency and watt. You don't even know how Img cluster is built and nobody does and you still compare with NV cuda cores. Rogue XT frequency is set at 200mhz, Tegra K1 at 950mhz. Again what the heck you re talking about.

    3) it is still a prototype type with a fan and nobody could check all the real frequency even though 1ghz seem reasonable. Hod dare you can compare a tablet with a reference board?

    Again A9/A9X already exist now as prototypes, Apple doesn't sell chips and doesn't to any those sort of market. They need to see their product in a cycle year life. You live in another planet to not understand that.

Log in

Don't have an account? Sign up now