Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU

Name: Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU
Item: Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU
Author: Andrei Frumusanu

by Andrei Frumusanu on October 13, 2020 4:00 AM EST

74 Comments | Add A Comment

74 Comments

Configurations - Up to 4 GPUs at 6TFLOPs

Starting off with the smallest GPU building blocks, it’s good to remind ourselves how an Imagination GPU looks like – the following is from last year’s A-Series presentation:

PowerVR GPU Comparison
	AXT-16-512 BXT-16-512	GT9524	GT8525	GT7200 Plus
Core Configuration	1 SPU (Shader Processing Unit) - "GPU Core" 2 USCs (Unified Shading Clusters) - ALU Clusters
FP32 FLOPS/Clock MADD = 2 FLOPs MUL = 1 FLOP	512 (2x (128x MADD))	240 (2x (40x MADD+MUL))	192 (2x (32x MADD+MUL))	128 (2x (16x MADD+MADD))
FP16 Ratio	2:1 (Vec2)
Pixels / Clock	8			4
Texels / Clock	16	8		4
Architecture	A-Series B-Series	Series-9XTP (Furian)	Series-8XT (Furian)	Series-7XT (Rogue)

Fundamentally and at a high-level, the new B-Series GPU microarchitecture looks very similar to the A-Series. Microarchitecturally, Imagination noted that we should generally expect a 15% increase in performance or increase in efficiency compared to the A-Series, with the building blocks of the two GPU families being generally the same save for some more important additions such as the new IMGIC (Imagination Image Compression) implementation which we’ll cover in a bit.

An XT GPU still consists of the new SPU design which houses the new more powerful TPU (Texture Processing Unit) as well as the new 128-wide ALU designs that is scaled into ALU clusters called USCs (Unified Shading Clusters).

Imagination’s current highest-end hardware implementation in the BXT series is the BXT 32-1024, and putting four of these together creates an MC4 GPU. In a high-performance implementation reaching up to 1.5GHz clock speeds, this configuration would offer up to 6TFLOPs of FP32 computing power. Whilst this isn’t quite enough to catch up to Nvidia and AMD, it’s a major leap for a third-party GPU IP provider that’s been mostly active in the mobile space for the last 15 years.

The company’s BXM series continues to see a differentiation in the architecture as some of its implementations do not use the ultra-wide ALU design of the XT series. For example, while the BXM-8-256 uses one 128-wide USC, the more area efficient BXM 4-64 for example continues to use the 32-wide ALU from the 8XT series. Putting four BXM-4-64 GPUs together gets you to a higher performance tier with a better area and power efficiency compared to a larger single GPU implementation.

The most interesting aspect of the multi-GPU approach is found in the BXE series, which is Imagination’s smallest GPU IP that purely focuses on getting to the best possible area efficiency. Whilst the BXT and BXM series GPUs until now are delivered as “primary” cores, the BXE is being offered in the form of both a primary as well as a secondary GPU implementation. The differences here is that the secondary variant of the IP lacks a firmware processor as well as a geometry processing, instead fully relying on the primary GPU’s geometry throughput. Imagination says that this configuration would be able to offer quite high compute and fillrate capabilities in extremely minuscule area usage.

PowerVR Hardware Designs GPU Comparison
Family	Texels/ Clock	FP32/ Clock	Cores	USCs	Wavefront Width	MC Design
BXT-32-1024 MC1	32	1024	1	4	128	P
BXT-16-512 MC1	16	512	1	2	128	P
BXM-8-256 MC1	8	256	1	1	128	P
BXM-4-64 MC1	4	64	1	1	32	P
BXE-4-32 Secondary	4	32	1	1	16	S
BXE-4-32 MC1	4	32	1	1	16	P
BXE-2-32 MC1	2	32	1	1	16	P
BXE-1-16 MC1	1	16	1	1	8	P

Putting the different designs into a table, we’re seeing only 8 different hardware designs that Imagination has to create the RTL and do physical design and timing closure on. This is already quite a nice line-up in terms of scaling from the lowest-end area focused IP to something that would be used in a premium high-end mobile SoC.

PowerVR MC GPU Configurations
Family	Texels/ Clock	FP32/ Clock	Cores	USCs	Wavefront Width	MC Design
BXT-32-1024 MC4	128	4096	4	16	128	PPPP
BXT-32-1024 MC3	96	3072	3	12	128	PPP
BXT-32-1024 MC2	64	2048	2	8	128	PP
BXT-32-1024 MC1	32	1024	1	4	128	P
BXT-16-512 MC1	16	512	1	2	128	P
BXM-8-256 MC1	8	256	1	1	128	P
BXM-4-64 MC4	16	256	4	4	32	PPPP
BXM-4-64 MC3	12	192	3	3	32	PPP
BXM-4-64 MC2	8	128	2	2	32	PP
BXM-4-64 MC1	4	64	1	1	32	P
BXE-4-32 MC4	16	128	4	4	16	PSSS
BXE-4-32 MC3	12	96	3	3	16	PSS
BXE-4-32 MC2	8	64	2	2	16	PS
BXE-4-32 MC1	4	32	1	1	16	P
BXE-2-32 MC1	2	32	1	1	16	P
BXE-1-16 MC1	1	16	1	1	8	P

The big flexibility gain for Imagination and their customers is that they can simple take one of the aforementioned hardware designs, and scale these up seamlessly by laying out multiple GPUs. On the low-end, this creates some very interesting overlaps in terms of compute abilities, but offer different fillrate capabilities at different area efficiency options.

At the high-end, the biggest advantage is that Imagination can quadruple their processing power from their biggest GPU configuration. Imagination notes that for the BXT series, they no longer created a single design larger than the BXT-32-1024 because the return on investment would simply be smaller, and involve more complex timing work than if a customer would simply scale performance up via a multi-core implementation.

Introduction - Scaling up to Multi-GPU Introducing IMGIC - A better frame-buffer compression

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

74 Comments

View All Comments

myownfriend - Tuesday, October 13, 2020 - link
Yea like if the back buffer were drawn with on-chip memory... like a tile-based GPU.
anonomouse - Tuesday, October 13, 2020 - link
Probably works out-ish a bit better with a tile-based deferred renderer, since the active data for a given time will be more localized and more predictable.
myownfriend - Tuesday, October 13, 2020 - link
The thing with tile-based GPUs is that they have less data to share between cores since the depth, stencil, and color buffers for each tile are stored on-chip. Since screen-space triangles are split into tiles and one triangle can potentially turn into thousands of fragments, it becomes less bandwidth intensive to distribute work like that. All the work that Imagination in particular has put into HSR to reduce texture bandwidth as well as texture pre-fetch stuff would also benefit them in multi-GPU configurations.
SolarBear28 - Tuesday, October 13, 2020 - link
This tech seems very applicable to ARM Macs although Apple is probably using in-house designs.
Luke212 - Tuesday, October 13, 2020 - link
why would i want to see 2 gpus as 1 gpu? its a terrible idea. its NUMA x 100
myownfriend - Tuesday, October 13, 2020 - link
On an SOC or even in a chiplet design, they wouldn't necessarily have separate memory controllers. We're talking about GPUs as blocks on an SOC.
CiccioB - Tuesday, October 13, 2020 - link
It simplify things better than see them as 2 separate GPUs
myownfriend - Sunday, June 6, 2021 - link
I'm gonna be a weirdo and add to something like half a year later. I'm not sure why seeing two or, in this case, four GPUs is preferable to seeing one in situations where all the GPUs are tile-based and on the same chip.

Let me think out loud here.

At the vertex processing stage, you could toss triangles at each GPU and they'll transform them to screen-space then clip, project, and cull them. Their respective tiling engines then determine which tiles each triangle is in and appends that to the parameter and geometry buffer in memory. I can't think of many reasons why they would really need to communicate with each other when making this buffer. After that's done, the fragment shading stage would consist of each GPU requesting tiles and textures from memory, shading and blending them in their own tile memory, and writing out the finished pixels in memory. I can't really find much in that example that makes all four GPUs work differently than one larger one.

I can see why that might be preferable with IMR GPUs though. If we were to just toss triangles at each GPU they would transform them to screen-space and clip, project, and cull them just like a TBDR. After this, a single IMR GPU would do an early-z test, if it passes then procedes with the fragment pipeline. This is where the first big issue comes up in a multi-GPU configuration though: overlapping geometry. Each GPU will be transforming different triangles and some of these triangles may overlap. It would be really useful for GPU0 to know if GPU1 is going to write over the pixels it's about to work on. This would require sharing the z-value of the current pixels between both GPUs. They could just compare z-values at the same stages, but unless they were synced with each other, that wouldn't prevent GPU0 from working on pixels that already passed GPU1's early z-test and are about to be written to memory. Obviously, that would result in a lot of unnecessary on-chip traffic, very un-ideal scaling, and possibly pixels being drawn to buffers than shouldn't have.

What might help is to do typical dual-GPU stuff like alternate frame or split-frame rendering so those z-comparisons would only have to happen between the pixels on each chip. The latter raises another problem though. Neither GPU can know what a triangles final screen space coordinates are until AFTER they transform it. This means if GPU0 is supposed to be rendering the top slice of the screen and it gets a triangle from the bottom of the screen or across the divide then it has to know how to deal with that. It could just send that triangle to GPU1 to render. Since they both share the same memory, it has a second option which is to do the z-comparison thing from before and GPU0 could render the pixels to bottom of the screen anyway.

Obviously you could also bin the triangles like TBDR or give each GPU a completely separate task like having one work on the G-buffer while the other creates shadow maps or have each rendering a different program. Because there's so many ways to use two or more IMRs together and each has it's drawbacks, it makes sense to expose them as two separate GPUs. It puts the burden on parrallizing them in someone elses hands. TBDRs don't need to do that because they work more like they normally would. That's why PowerVR Series 5 GPUs pretty much just scaled by putting more full GPUs on the SOC.

Obviously, these both become a lot more complicated when they're chiplets, especially if they have their own memory controllers but I won't get into that.
brucethemoose - Tuesday, October 13, 2020 - link
Andrei, could you ask Innosilicon for one of those PCIe GPUs?

Even if it only works for compute workloads, another competitor in the desktop space would be fascinating.

Also, that is a *conspicuously* flashy and desktop-oriented shroud for something thats ostensibly a cloud GPU.
myownfriend - Tuesday, October 13, 2020 - link
I was thinking the same thing about the shroud.

Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU

Configurations - Up to 4 GPUs at 6TFLOPs

Post Your Comment

74 Comments

View All Comments

myownfriend - Tuesday, October 13, 2020 - link

anonomouse - Tuesday, October 13, 2020 - link

myownfriend - Tuesday, October 13, 2020 - link

SolarBear28 - Tuesday, October 13, 2020 - link

Luke212 - Tuesday, October 13, 2020 - link

myownfriend - Tuesday, October 13, 2020 - link

CiccioB - Tuesday, October 13, 2020 - link

myownfriend - Sunday, June 6, 2021 - link

brucethemoose - Tuesday, October 13, 2020 - link

myownfriend - Tuesday, October 13, 2020 - link

Log in

Don't have an account? Sign up now