Diving Deeper: The Maxwell 2 Memory Crossbar & ROP Partitions

Now that we have a basic understanding of the proper specifications of the GTX 970 we can dig deeper into its architecture and layout. As we mentioned previously, the GTX 970 ships with an unusual configuration, one we’ve never quite seen before. In it, thanks to a new feature of the Maxwell architecture, NVIDIA has been able to disable part of a ROP/Memory Controller partition, something that they could not do on Kepler or earlier generation designs. As a result of this the intended behavior of the GTX 970 is different than the GTX 980 or any theoretical Maxwell part with a completely disabled ROP/MC partition.

The biggest impact of this configuration is that it creates the segmented memory conditions NVIDIA outlined in their earlier statement. Due to performance issues from the unbalanced ROP/MC partition, NVIDIA segments the memory into a high-performance 3.5GB segment – what they refer to as segment 0 – and a low-performance 512MB (0.5GB) segment containing the rest of the RAM. It is this segmentation that results in some of the previously-unusual memory allocation behaviors and memory bandwidth benchmarks on the GTX 970.

But in order to understand just why the segments exist, how they are allocated, and how they influence performance, we first need to learn about the Maxwell memory crossbar.

GTX 970 Memory Crossbar

In the above diagram from NVIDIA, NVIDIA illustrates at a high-level overview how the Maxwell memory crossbar is laid out, and specifically how it’s configured for GTX 970. In it we have the 16 GM204 SMMs at the top connected to the crossbar, meanwhile below we have 4 ROP/MC partitions, with each partition divided up into individual ROP/L2 units and memory controllers. Note that the specific units disabled on any GTX 970 chip will vary – it’s not necessarily the 1/3/5 SMMs or the last ROP/MC partition – but from a high level overview this is consistent for GTX 970 regardless of which specific physical units have been disabled.

In any case, while the crossbar shows how the SMMs and ROP/MC partitions are connected via the crossbar, memory segmentation is really a story about the ROP/MC partitions. The SMMs do not play a role in how GTX 970’s memory is configured – the crossbar abstracts all of that – and the crossbar itself matters only in as much as the ROP/MC partitions are connected to it.

When it comes to the ROP/MC partitions then, we can see first-hand in the GTX 970 what’s new in the architecture. In Kepler and previous generation designs, NVIDIA’s granularity for disabling ROP/MC partitions was all or nothing – either the entire partition was turned on or the entire partition was turned off. However starting with Maxwell, NVIDIA has gained the ability to disable the individual ROP/L2 units within a partition, allowing one of the two ROP/L2 units to be disabled. This essentially introduces a 3rd possible option: a partially-enabled ROP/MC partition, backed by two memory controllers and a single ROP/L2 unit. Meanwhile the memory controllers are still tied together, and disabling one of them requires disabling the entire partition.

NVIDIA ROP/Memory Controller Configuration Options
Kepler Maxwell
ROP/MC Partition Fully Enabled ROP/MC Partition Fully Enabled
N/A ROP/MC Partition Partially Disabled
ROP/MC Partition Fully Disabled ROP/MC Partition Fully Disabled

It’s this option that the GTX 970 leverages, and in doing so becomes the first Maxwell part to use this feature. NVIDIA has disabled 1 ROP/L2 unit, removing 8 “ROPs” (or rather 1 unit capable of 8 pixels/clock) and 256KB of L2 cache from the GTX 970. This is why the GTX 970 is left with 56 ROPs spread over the 4 partitions, along with 1.75MB of L2 cache.

Meanwhile there’s one other new feature here that’s activated only on the partially disabled partition, and that’s the link between the first and second units of the ROP partition. Typically each ROP/L2 unit would have a link to a port on the crossbar and a link to its own dedicated 32-bit memory controller channel; however because GTX 970 disabled a ROP/L2 unit, the “buddy” link comes in to play. This link is essentially the lynchpin of Maxwell’s new partial disable functionality, and allows the second half of the memory controller to stay active. This link only needs to be active when a ROP/L2 unit is disabled, and NVIDIA has confirmed that it is a full bandwidth link identical to the normal ROP/L2 to MC link, meaning it’s capable of 4 32 byte requests per clock (2 reads and 2 writes). Ultimately this link is what makes a partially disabled partition possible, and is also what makes it possible to have the full 256-bit memory bus present and active in spite of the lack of a ROP/L2 unit and its associated crossbar port.

Finally, because of this configuration and the lack of a ROP/L2 unit we get to the memory segments.  Although the full 256-bit memory bus is present and active on GTX 970 and capable of providing 224GB/sec of combined memory bandwidth between the DRAM modules and the memory controllers, it’s a bit of a misnomer to say the card has that much bandwidth between the memory controllers and everything else, at least in the same sense that the GTX 980 does.

Like virtually every other operation on a GPU, memory accesses are done in parallel to improve throughput. On a fully enabled card such as GTX 980 these operations are striped across all of the memory controllers in what NVIDIA calls a 1KB stride. In doing this, each and every 32 bit memory channel needs a direct link to the crossbar through its partner ROP/L2 unit. However in the case of the GTX 970 a wrench is thrown into the works, as there are 7 crossbar ports and 8 memory channels.

Ultimately due to the design of the crossbars and the memory controllers, it is not possible for 1 crossbar port to carry the full load of 2 memory channels in all circumstances. The crossbar port and its attached ROP/L2 unit can access both memory channels at once, splitting up the 4 operations among them, but there is only 1 read return bus and 1 write data bas, and hence in practice it cannot issue identical operations to both memory channels at once . As a result NVIDIA has segmented the GTX 970’s memory into the now-familiar 3.5GB and 512MB segments. In the case of the 3.5GB segment, this behaves otherwise identically to a fully enabled card such as the GTX 980, with the 1KB stride being striped over 7 crossbar ports, and hence 7 DRAM modules. Meanwhile the 8th and final DRAM module sits in its own 512MB segment, and must be addressed by the crossbar on its own.

This in turn is why the 224GB/sec memory bandwidth number for the GTX 970 is technically correct and yet still not entirely useful as we move past the memory controllers, as it is not possible to actually get that much bandwidth at once when doing a pure read or a pure write. In the case of pure reads for example, GTX 970 can read the 3.5GB segment at 196GB/sec (7GHz * 7 ports * 32-bits), or it can read the 512MB segment at 28GB/sec, but it cannot read from both at once; it is a true XOR situation. The same is also true for writes, as only one segment can be written to at a time.

Unfortunately what this means is that accessing the weaker 512MB segment blocks access to the stronger 3.5GB segment if both memory operations are identical; or put another way, using the 512MB segment can harm the performance of the 3.5GB segment. For example, if we want to issue reads to both segments at once, reading the 512MB segment blocks any other reads to the 3.5GB segment for that cycle. If the 3.5GB segment is blocked in this fashion and doesn't have a non-blocking write to work on instead, it would have to go idle for that cycle, which would reduce the effective memory bandwidth of the 3.5GB segment. This means that taken over time in our example, the larger the percentage of the time the crossbar is reading the 512MB segment, the lower the effective read memory bandwidth would be from the 3.5GB segment.

Despite all of this, achieving peak memory bandwidth performance on the GTX 970 is still possible, but it requires much more effort since simple striping will not do the trick. The easiest and most effective solution in this regard is to interleave reads and writes over the segments, such that one segment is writing while another segment is reading. Interleaving in this fashion allows both segments to work at once – avoiding the blocking effect of the shared read and write buses – and makes it more likely that both segments are doing useful work rather than waiting for their turn on an operation. However because this is only applicable to situations where more than 3.5GB of VRAM is in use and both segments are necessary, this means it's only theoretically possible to achieve 224GB/sec when more than 3.5GB of VRAM is in use. In any situations below 3.5GB we are de-facto limited to just the larger segment, in which case there are only 7 memory channels (196GB/sec) to stripe memory operations across. NVIDIA could of course interleave operations sooner than that and use both segments more often, but due to the blocking effect we've discussed before the performance hit from using the 512MB segment can quickly become greater than any gains.

The end result of all of this is that the GTX 970 has a unique memory layout that needs to be accounted for by the operating system and NVIDIA’s drivers, as the use of multiple segments adds a new level of complexity to optimizing memory performance. Unequal in size and performance, among the two segments the 3.5GB segment is a larger, faster, and otherwise more preferable segment to the 512MB segment. Which means from a logical hierarchical perspective, the 512MB segment essentially serves as an additional layer of memory between the main VRAM and system memory, being slower than the 3.5GB segment but offering almost 2x the performance of going out over the PCIe bus to system memory.

The upshot here is that while this is the first time NVIDIA has used this specific ROP/MC configuration in a product, this is not the first product they have designed with segmented or otherwise unbalanced memory configurations. Since the GTX 500 series, on some midrange SKUs NVIDIA has used unbalanced/asymmetrical memory configurations, most recently on the GTX 660 and GTX 660 Ti. In the case of both of those cards, NVIDIA utilized a 192-bit memory bus with 2GB of VRAM attached, which meant that some memory controllers had more VRAM attached to them than others. The end result as it turns out is very similar, and while NVIDIA has never explained in-depth how they handle memory allocation on those cards, it turns out that it’s very similar to GTX 970’s memory segmentation. Which is to say that NVIDIA actually has multiple generations of experience with segmented memory, and this is not the first time they have implemented it. Rather this is first time we’ve seen such a configuration on a high-performance card such as the GTX 970.

For Comparison: GTX 660 Ti's Unbalanced Memory Bus

NVIDIA Maxwell Family Memory Controller Configurations
Part Memory Segments
GTX 980 1 Segment (4x8 MC)
GTX 970 2 Segments (4x7 MC)
GTX 980M 1 Segment (4x8 MC)
GTX 970M 1 Segment (3x6 MC)
GTX 965M 1 Segment (2x4 MC)

As for why NVIDIA is using such a configuration here, the crux of the matter is money and yields. Without the ability to partially disable a ROP/MC partition, NVIDIA would either have to spec a card to use a fully enabled partition – essentially reducing yields for that card and driving up costs – or disable the entire partition and lose all of the benefits of the additional ROPs, memory, and the memory controller. This finer granularity allows NVIDIA to better control how they harvest bad chips and what resulting configurations they bring to market, along with making a single ROP/L2 defect less harmful to overall performance by keeping the rest of a partition online. Otherwise, to stick with a “balanced” configuration with as many crossbar ports as DRAM modules would result in either a higher spec GTX 970, or a lower spec card with a 192-bit memory bus.

In the grand scheme of things then, depending on how you want to define memory capacity the GTX 970 can be said to have either 3.5GB of VRAM or 4GB of VRAM. Only the first 3.5GB segment is a high-speed (full bandwidth) segment, and for naïve operations this is the segment that software will want to use. However the final 512MB segment is accessible, and despite its lower performance it is still usable and still far better than going to system memory, in which case GTX 970 has and can access a full 4GB of VRAM. This means that both answers can be correct, and like so many other facets of modern technology, the true answer is often more complex than a single number.

GeForce GTX 970: Correcting The Specs & Exploring Memory Allocation Segmented Memory Allocation in Software
Comments Locked


View All Comments

  • Mondozai - Monday, January 26, 2015 - link

    When a company intentionally lies to its consumers, that isn't a storm in a teacup. Ryan may believe them but I don't. I agree with him that it's incredibly stupid to do this kind of stuff, but the notion that they didn't know, even after all the manuals were passed around the company? Knowing the number of ROPs is basic stuff for technical marketing.

    And okay if this got missed a single round. But in successive rounds, over a period of almost half a year? C'mon. Nvidia knows that it wouldn't sell as well if they marketed it as "3.5 VRAM" and they tried to cover this shit up.

    I'm guessing Jonah Alben didn't have anything to do with this, and I'm guessing he's pissed as fuck. The big quesiton is if Jen-Hsun knew or not. Their marketing team are not exactly people I'd trust(watch Tom Peterson in any stream and you'll know what I mean).

    Throwing the marketing guys under the bus is poetic justice. But also an easy move. Again, did the CEO know?
  • mapesdhs - Monday, January 26, 2015 - link

    "intentionally lies".. yeah right! So you're saying this is not acceptable, and yet it's ok for AMD
    (and indeed NVIDIA) to market dual-GPU cards by advertising the sum of the VRAM on both
    GPUs, even though an application can only see & access the individual amount? Look at
    *any* seller site spec list for an AMD 295x2, they all say 8GB (ditto the specs page on
    AMD's site), while Anandtech's own review shows quite clearly that it's just 2x4GB, so the
    real amount accessible by an application is 4GB, not 8GB. Surely this is far more of a
    deception than the mistake NVIDIA states they have made with the 970 specs.

    So I call out hypocrasy; your comment is just NVIDIA-bashing when there have been far
    more blatant deceptions in the past, from both sides. NVIDIA does the double-up VRAM
    nonsense aswell, eg. the sale ads for the Titan Z all state 12GB, as do the specs on the
    NVIDIA web site, but again it's just 6GB per GPU, so 6GB max visible to an application.
    Look back in time, you'll see the same mush published for cards like the GTX 295 and
    equivalent ATIs from back then.

    So quit moaning about what is merely a mistake which doesn't change the conclusions
    based on the initial 970 review performance results, and instead highlight the more blatant
    marketing fibs, especially on dual-GPU cards. Or of course feel free to cite in *any* dual-
    GPU review where you complained about the VRAM diddle.

    Sorry if I sound peeved, but your comment started by claiming something is true when
    it's just your opinion, based on what you'd like to believe is true.

  • alacard - Monday, January 26, 2015 - link

    "So you're saying this is not acceptable, and yet it's ok for AMD
    (and indeed NVIDIA) to market dual-GPU cards by advertising the sum of the VRAM on both
    GPUs, even though an application can only see & access the individual amount?"

    That's what's known as a straw-man, he never mentioned anything about dual GPUs. His point about ROPs is perfectly valid--and no Ian it's not ok to lie about that, nor about the amount of cache.

    "Sorry if I sound peeved, but your comment started by claiming something is true when
    it's just your opinion, based on what you'd like to believe is true."

    Why would you give Nvidia the benefit of the doubt here? If you really and truly believed no one brought this up before release or noticed it afterwards than you're a bigger fool than i could have ever guessed you are.

    Sorry if I sound peeved, but your comment started is claiming something is true when
    it's just your opinion, based on what you'd like to believe is true.
  • dragonsqrrl - Monday, January 26, 2015 - link

    "Why would you give Nvidia the benefit of the doubt here?"

    Why would Nvidia want to deceive the whole PC gaming world over something so minor? As Ryan stated in the article that would be genuinely stupid. Can you think of a reason why Nvidia would intentionally seed a slightly inaccurate spec sheet to the press? What would they gain from that? I don't think there's any reason to believe the initial spec sheet was anything other than a mistake by Nvidia, and neither does any credible tech journalist I know of.

    That being said I also highly doubt they weren't aware of the mistake until now. While I think their response to this incident has been good so far, I really think they should've come out with this information sooner (like last week when this started to really heat up). But I think that time was probably spent confirming what had happened and how to present it to the press.
  • alacard - Monday, January 26, 2015 - link

    " Can you think of a reason why Nvidia would intentionally seed a slightly inaccurate spec sheet to the press?"

    Is this a real question or some sort of a joke? You're asking why a company would knowingly inflate a spec sheet for a product they want to sell, and doing so with a straight face? Is that PT Barnum's johnson i see swinging from your asshole?
  • Galidou - Tuesday, January 27, 2015 - link

    People buy performance, don't say a thing about memory bandwidth rops and such install it in your computer. You paid it less than some video cards it outperforms, don't care about stats, you're on the good way.

    Companies lie to us about advertising any sort of things on tv and so on. I've seen many LCD monitors advertising X nits and not delivering totally the amount and no one ever sues them. If the monitor is still averages better or the same image quality than the best monitors in it's price class who cares about the advertisement.

    Not saying that lying to improve sales number is right, but SO MANY companies do that. Unless it turns out to be a really bad product for the price you paid, then sue them. But don't whine when there's a SLIGHT difference but still outperforms everything in it's price class, uses less power, has good drivers and so on.

    The only reason Nvidia would have to do this intentionally would be to back up a medium video card performance, a kind of semi failure, which the GTX 970 SURELY isn't. Why would a company need to boost sales while they know it's gonna be sold out for the next month because of it's price/performance ratio?
  • FlushedBubblyJock - Friday, January 30, 2015 - link

    Oh, so that's why AMD lied about the number of transistors in the Bullldozer core, claiming it was 2 billion, then months later correcting their lie to journalists and revising it downward quite a large leap to 1.2 billion, a full 40% drop.
    Yes, lying about a cruddy product that never met expectations by pumping up that core transistor count to give the impression of latent power just not yet utilized, by say, optimizations required for the Windows OS to use all the "8"/(4) cores better with improved threading...

    Hahahhaaa no it's not a joke...

  • dragonsqrrl - Tuesday, January 27, 2015 - link

    Wow, disproportionately aggressive response to appropriate and logical questions. I can't tell if you're trying to intentionally mislead others or if you really have no clue what you're talking about. Yes, I'm asking why Nvidia would conspire to intentionally lie about something so minor in the initial spec sheet that would almost certainly be discovered soon after launch? I even tried to help you out a little: What would they gain from that?

    It just takes a simple risk assessment and a little bit of logic to pretty much rule this out as an intentional deception.
  • Galidou - Tuesday, January 27, 2015 - link

    Nvidia's way of thinking by the mad community: ''With the performance to cost ratio of that card when it's gonna be launched, it will be sold out for weeks to come even if we give the true spec sheets! Let's speak to marketing department and modifiy that so it can be SOLD OUT TIMES 2!! YEAH, now you speak, let's make the community so mad they have to wait for it! YEAH, we want the community to HATE US!''
  • alacard - Tuesday, January 27, 2015 - link

    Galadou, Dragonsqrrl: Can you explain how a 970 with one of the dram banks partitioned for low priority data is supposed to operate at 256 bits? Given that the last 512 chunk is only being accessed as a last resort, and only after all the other RAM is occupied, the memory subsystem could only be operating at 224 bits for the majority of cases.

    I could be wrong but i just don't see it. Given that, we're not merely talking about diminished ROP and cache count, but also a shallower memory interface which NVIDIA marketed specifically as being exactly the same as the 980. Here is a direct quote from their reviewer's guide:

    "Equipped with 13 SMX units and 1664 CUDA Cores the GeForce GTX 970 also has the rending horsepower to tackle next generation gaming. And with its 256-bit memory interface, 4GB frame buffer, and 7Gbps memory the GTX 970 ships with the SAME MEMORY SUBSYSTEM as our flagship GEFORCE GTX 980"

    If it really is only operating at 224 bits, THIS IS A BIG DEAL. Even if it were an honest mistake, it's still a big deal. Giving them the benefit of the doubt and assuming their initial materials were wrong, the idea they didn't notice it after release... come on.

    BTW that PT Barnum comment was just a joke that popped in my head at the last second and i couldn't resist adding it.

Log in

Don't have an account? Sign up now