As someone who analyzes GPUs for a living, one of the more vexing things in my life has been NVIDIA’s Maxwell architecture. The company’s 28nm refresh offered a huge performance-per-watt increase for only a modest die size increase, essentially allowing NVIDIA to offer a full generation’s performance improvement without a corresponding manufacturing improvement. We’ve had architectural updates on the same node before, but never anything quite like Maxwell.

The vexing aspect to me has been that while NVIDIA shared some details about how they improved Maxwell’s efficiency over Kepler, they have never disclosed all of the major improvements under the hood. We know, for example, that Maxwell implemented a significantly altered SM structure that was easier to reach peak utilization on, and thanks to its partitioning wasted much less power on interconnects. We also know that NVIDIA significantly increased the L2 cache size and did a number of low-level (transistor level) optimizations to the design. But NVIDIA has also held back information – the technical advantages that are their secret sauce – so I’ve never had a complete picture of how Maxwell compares to Kepler.

For a while now, a number of people have suspected that one of the ingredients of that secret sauce was that NVIDIA had applied some mobile power efficiency technologies to Maxwell. It was, after all, their original mobile-first GPU architecture, and now we have some data to back that up. Friend of AnandTech and all around tech guru David Kanter of Real World Tech has gone digging through Maxwell/Pascal, and in an article & video published this morning, he outlines how he has uncovered very convincing evidence that NVIDIA implemented a tile based rendering system with Maxwell.

In short, by playing around with some DirectX code specifically designed to look at triangle rasterization, he has come up with some solid evidence that NVIDIA’s handling of tringles has significantly changed since Kepler, and that their current method of triangle handling is consistent with a tile based renderer.


NVIDIA Maxwell Architecture Rasterization Tiling Pattern (Image Courtesy: Real World Tech)

Tile based rendering is something we’ve seen for some time in the mobile space, with both Imagination PowerVR and ARM Mali implementing it. The significance of tiling is that by splitting a scene up into tiles, tiles can be rasterized piece by piece by the GPU almost entirely on die, as opposed to the more memory (and power) intensive process of rasterizing the entire frame at once via immediate mode rendering. The trade-off with tiling, and why it’s a bit surprising to see it here, is that the PC legacy is immediate mode rendering, and this is still how most applications expect PC GPUs to work. So to implement tile based rasterization on Maxwell means that NVIDIA has found a practical means to overcome the drawbacks of the method and the potential compatibility issues.

In any case, Real Word Tech’s article goes into greater detail about what’s going on, so I won’t spoil it further. But with this information in hand, we now have a more complete picture of how Maxwell (and Pascal) work, and consequently how NVIDIA was able to improve over Kepler by so much. Finally, at this point in time Real World Tech believes that NVIDIA is the only PC GPU manufacturer to use tile based rasterization, which also helps to explain some of NVIDIA’s current advantages over Intel’s and AMD’s GPU architectures, and gives us an idea of what we may see them do in the future.

Source: Real World Tech

Comments Locked

191 Comments

View All Comments

  • Wolfpup - Tuesday, August 2, 2016 - link

    I would have loved seeing what this looks like on Kepler, just as a verification that this was a change to Maxwell...a couple generations of Intel hardware too.
  • Scali - Wednesday, August 3, 2016 - link

    Yup, Maxwell v1 (GTX750), Kepler, perhaps even Fermi... See how far we have to go back to see different behaviour.
  • LordConrad - Tuesday, August 2, 2016 - link

    Might this also explain why DirectX 12 and Vulkan work better on AMD cards?
  • Wolfpup - Tuesday, August 2, 2016 - link

    Eh?
  • Scali - Tuesday, August 2, 2016 - link

    Has nothing to do with it, see my post above.
  • Scali - Tuesday, August 2, 2016 - link

    Sorry, forgot to paste link to my post: http://www.anandtech.com/comments/10536/nvidia-max...
  • tuxRoller - Wednesday, August 3, 2016 - link

    No, David Kanter didn't provide any such evidence, his beliefs not withstanding.
    A number of folks chimed in to say that it looks like "clever thread scheduling", but that's about.

    http://www.realworldtech.com/forum/?threadid=15987...

    http://www.realworldtech.com/forum/?threadid=15987...

    http://www.realworldtech.com/forum/?threadid=15987...
  • Scali - Wednesday, August 3, 2016 - link

    "A number of folks chimed in to say that it looks like "clever thread scheduling", but that's about."

    Since apparently the threads are scheduled in a tile-arrangement, what exactly is the difference between saying it's 'clever thread scheduling' and 'tile-based' in this case?
    This patent seems to describe what is going on here: https://www.google.com/patents/US20140118366
    It basically describes a system where tiles are used as cache, and a set of primitives is processed per-tile (which is different from a pure immediate renderer, which may divide a primitive over quads or even tiles, but does not process multiple primitives at the same time).
  • wumpus - Wednesday, August 3, 2016 - link

    Sounds like people are too hung up on "immediate vs. deferred". If it is seen by the programmer as immediate, it has to allow the programmer to read rasterized pixels that the GPU has been instructed to execute regardless of location and write (presumably texture or filter) the results in a different location. This means that the GPU isn't completely picky about the tiles and is effectively using them for caching.

    But the tiles are plainly obvious in David Kanter's demo. And it seems that some commenters simply insist that "tile" has a well defined technical usage that goes beyond "a small portion of a frame buffer [typically square] that is rasterized at the same time". There are plenty of such comments, but I can't see them pointing to such a specific definition anywhere.
  • Scali - Wednesday, August 3, 2016 - link

    Thing is, to the programmer, even a deferred renderer looks exactly the same. You pass it some triangles to render, and whether it renders them immediately or to a tile first, and then flushes them to VRAM, a draw call is an atomic operation from the programmer's perspective, so you can't tell the difference.

Log in

Don't have an account? Sign up now