Extending CPU framebuffer graphics rendering subroutines into GPU parallel subroutines

Okay, okay, so I thought it would be awesome to have a good CPU framebuffer graphics rendering library, but it turned out that’s not enough for me. Why? Well, the justification is a little bit difficult to explain, but in a few words, the motivation will be obvious. When you use OpenCL on classic VidCore IV Raspberry Pi platforms, it ties up the GPU and needs exclusive, non-shared, uninterrupted access to it. Yeah, like many things on Raspberry Pi, the same can be said about the accelerated SPI display output library. This means that you cannot use the Mesa VC4 driver at the same time, so that means OpenGL cannot be used for graphics rendering in an OpenCL application. And no, VidCore IV is not good enough to support OpenGL ES 3.1, so no Compute Shaders extension is remotely on the radar for support either.

So… if you’re creating an OpenCL accelerated application, and you need 3D graphics rendering capabilities at the same time, what is the solution? Well, if you already have your own OpenCL 3D graphics rendering library, this problem is a piece of cake to solve. So, now that the justification is obvious, how do I approach the conversion to parallel on the code I have so far?

Let’s start with the easy parts of the problem to solve. Transform and lighting on vertices, that can be easily and obviously extended into parallel. Matrix multiplication, that is also easy to extend to parallel. That leaves the final core parallel primitive we need for graphics rendering down to rasterization. Fortunately, for a 3D scanner, we have no need to render textured surfaces, so that simplifies our task a bit. How do we rasterize a triangle in parallel?

The first most obvious way to convert to parallel is to break up processing each scan line separately. This does mean that we need to multiply to compute the start and end points of each scanline in parallel, but GPUs are totally up to the task of multiplications, especially when constrained to 16-bit integers. Filling in each pixel within the scanline can also be done in parallel, a conditional move can be used to keep parallel threads executing identical instruction streams in lock-step, even when the scanlines are not all of the same length.

The primary problem here, then, is not that parallelizing triangle rasterization, but the fact that can be left with a significant share of time when most of the parallel threads are idling not effectively busy. An alternative you might consider is to render separate triangles in parallel and rasterize a single triangle sequentially. But then the problem you encounter is that these triangles must be strictly non-overlapping, and z-buffering itself can only be be computed in parallel by the means of a parallel reduce operator.

Let’s further consider the idea of rasterizing several triangles simultaneously. We queue up all triangles to rasterize in the framebuffer, sort them vertically, and we parallel rasterize the scanlines in the framebuffer. We then scan-detect each scanline from the triangles we need to rasterize and rasterize them in sequence. This then allows for proper z-buffering.

Okay, the problem here is that we need iterations to do the vertical select, and then sort from left to right, and that could result in non-busy threads. Not so fast. If we don’t care to strictly process one csanline in a left to right fashion (i.e. our cache memory is big enough for a single scanline), then we can do a random horizontal rasterization. But yeah, for vertical select, we still need to do a search.

Okay, I think I’m convinced. This seems to be the better of the methods, to try to rasterize triangles all at once in the framebuffer, rather than thinking about rasterizing triangles one at a time.

Related to sorting, different kernels with different numbers of max sort iterations could be used.

But, the question of parallel sorting. Now, that can be tough. The first idea that comes to mind is to use a hashing technique, sort into buckets. As long as there are no hash collisions, there will be no need for thread serialization. But that’s just the problem.

Okay, let’s try this take on a parallel sort. Your goal is to have n threads sort into n bins each of size n. So there are n * n items to sort. Each thread only makes its first write to its own index in the bin. This way, it is guaranteed that the first iteration of sorting will never collide across threads. If a thread needs to write a second item to the same bin, then we can signal “thread overflow” for that thread. We have to wait until non-overflow writes have happened, then we check if we have an empty bin. If so, we write to it, provided that there aren’t too many thread overflows. This will effect a parallel osrt that works most of the time for binning.

Rather than merely sorting triangles vertically, hierarchical binning is more efficient. You divide the image in half, then divide those halves in half, and so on. The triangle is placed in the smallest bin that fits. Gathering all triangles that intersect a scan line is then easy and efficient.

Now, those last two points together? We now have a reasonable way to approach parallel sorting that coincides closely with our target data structure. Yes, we still have to run multiple iterations for each bucket, but at least we have some justification for the reason why, and we still have a logarithmic number of bucket levels.

Yeah… so the point is, we have an efficient means of executing symmetric binning in parallel, but if there is going to be steep asymmetry in the results, that is where we are less efficient. We might be able to respond by dynamically adapting and terminating early to re-partition the bucketing.

Oh, okay, now I go searching about the Internet. I see what’s going on here… efficient GPU sorting algorithms are in fact implemented around merge-sort, bitonic merge sort in particular. However… how does the performance compare with efficient CPU algorithms? Generally, the performance matches the CPU algorithms or is faster by a small factor such as 2.5. This is based off of the assumption that the CPU runs at a higher clock speed than the GPU, say 6 times faster. So, for Raspberry Pi… because VidCore IV GPU memory access is so slow, CPU sorting is probably more efficient, end of story. Okay, fine, if I’m targeting Raspberry Pi, then that makes sorting easy, not to mention because of the unified memory architecture (UMA), there is no round-tripping delay.

20200626/DuckDuckGo gpu parallel sort
20200626/https://developer.nvidia.com/gpugems/gpugems2/part-vi-simulation-and-numerical-algorithms/chapter-46-improved-gpu-sorting
20200626/DuckDuckGo bitonic merge sort
20200626/https://en.wikipedia.org/wiki/Bitonic_sorter

Wow, that GPU Gems 2 book? It’s an old book, published in the year 2005, yet it describes such sophisticated and modern GPU programming concepts! All except for OpenCL, that is. This is a pre-CUDA book it looks like.

Okay, what does Wikipedia have to say about scanline rendering?

20200626/https://en.wikipedia.org/wiki/Scanline_rendering

Hints, hints, that’s what I need. Apparently, z-buffering rendering better lends itself to parallel compute and is more efficient with scenes of very high complexity. Yeah, yeah, I just need to figure out how the rendering is done in parallel. Is it done per-triangle? How are memory conflicts from overlapping parallel writes prevented?

Yeah, really, I don’t know the best way to do it, so I’ll just give my wild guess based off of the hints. First off, you want to divide up the framebuffer into tiles. In each tile is where one GPU processor core works, so that they can execute independently of each other. This way, you can have each core process one triangle at a time, in its own image tile, where there will be no conflicting accesses. At the top-level looping within each time, you run through the list of triangles to be rasterized, discarding all the ones that are out of bounds and only working on triangles that are in bounds. Then, what do the SIMD parallel threads do? Really the only thing they can do is assist with rasterizing the current triangle, which is known not to overlap with itself, of course. So you just put them to work like so and brace yourself for the fact that you may be wasting 50% of or more of the available compute capacity. Alternatively, it could be possible to put the SIMD threads to work in yet smaller tiles, using conditional move for no-ops as necessary.

Okay, how about a better method of using the SIMD capacity. You step through rasterizing the triangle almost sequentially, except that at you parallelize only adjacent pixels in a scanline. Okay, that wastes less when drawing large triangles, but it wastes more when drawing small triangles. Actually, the problem is, when you have lots of small triangles, you really can’t draw them efficiently using either method.

Well, how were accelerated GPUs all done? First they accelerated 2D primitives, then they moved to 3D, at least on the PC. In other words, the hardest primitives to accelerate, so you say, were done first. Okay, fine, I’ll try to rebase my thought process.

Obviously, I’d say, if you’re accelerating 2D graphics, you’d start by accelerating bit-block transfers. Now, this is obviously really easy to do in parallel since the operator always operates on a nice rectangular grid. Your parallel threads can easily be kept as busy as possible. Indeed, drawing any arbitrary polygon can be done via masking for clipping, then you just do a “bit-block transfer” style drawing operation to fill the interior. Yes, I know, I know, now you’re talking being wasteful at the edges, but that would likely be the best you can do for 2D via a naive approach. And, finally… there is no attempt made to accelerate simultaneous drawing, high level drawing commands are always done sequentially.

So, all your stated problems in your naive implementation? Clearly, they’d also be present in an early 2D video card also.

So… fine, parallel rasterization is hard. Now I’ll try to summarize my techniques thus far.

Parallel “brush”: plot a matrix of pixels simultaneously. Works for drawing thin lines, thick lines, triangles, polygons (by extension), and bit-block transfers. Bezier curves may also work. Very common early 2D video card parallel acceleration technique.

The main cause of sequential compute slow-down is due to drawing a large number of small primitives, such as would be the case when doing high poly count 3D rendering. And, therefore, this drawing technique is not suitable for 3D acceleration.
Tiling: Split up an image into a series of tiles, with only one thread capable of writing into each tile. This is most efficient in combination with queuing and sorting rasterization commands.

The main cause of sequential slowdown is the need to sequentially process stacked primitives. Drawing is most efficient in the case of completely non-overlapping primitives. However, if a large number of overlapping primitives are to be expected, a multi-layered z-buffer can be used so that parallel rasterization followed by a parallel reduce can be used. ID buffer and deferred shading can potentially be used to reduce memory consumption and enable additional rendering speedups. Subdivision can be used to handle drawing large primitives, thereby enabling parallel processing of each piece of hte larger primitive.

Parallel sorting algorithms can be used for efficient sorting on the GPU, without the need to round-trip back to the CPU.
Scanline rendering: Basically like a less sophisticated version of tiling. At the very least, primitives are sorted vertically, maybe they may even be bucketed by vertical size. You can then gather all primitives that intersect with a given scanline and render the contents of the scanline. Each scanline can then be rendered in parallel. If the contents of a scanline are sorted horizontally, then the scanline can be subdivided into runs that are rendered in parallel. Within a scanline or run, you will process the primitives sequentially so that you can implement proper z-buffering.

This method requires you to be able to efficiently compute an arbitrary vertical intersect across a primitive, which is easy in the case of triangles.

Okay, for the sake of rendering 3D scan results, I think we have a clear winner: tiling. When you think about it, the worst overlap factor in drawing a 3D scan result is fairly limited. A perfectly convex object will only have a max overlap factor of two. Of course, most 3D scans are not perfectly convex, but mostly convex, and they have surface texture ripple. In this case, you can get an estimate of the max overlap factor by multiplying the max cross section thickness (i.e. diameter) by the polygon density (i.e. triangles per unit distance). Worst comes to worst, you’ll have that many overlapping triangles. You then adjust your parallel buffer allocations downward by the max factor you think is reasonable for serial data processing.

Now, what about subdividing large triangles? Ideally, of course, 3D scanning does not produce such triangles, so this is something you wouldn’t have to worry about. But if you zoom in on a specific area? Okay, but then we can roughly calculate the required subdivision factor and all is good. So, the end-all be-all is really only if you want to support general-purpose 3D rendering, you need a means to subdivide large primitives into smaller ones to take advantage of parallel rendering via tiling. But how?

Here is a my idea of the process. The first step is classification. You have your list of triangles, and you want to tag them with a size classification. This is an easy map operation. Now, you want to collate and count each size class so that you can schedule the proper subdivision jobs. But, how do you collate in parallel?

Now, here’s where the clever tricks come into play, on mastering your higher-order vector functions. After doing a map to classify, we can do further map operations to split by class. Then you have either “in-class” or “null item.” Now we want to convert this array to a array of target indices in the respective collated arrays. This is a special kind of unfold style map. We can use reduce* and friends that we’ve used for computing factorials to convert this “in-class” or “null item” array to the target indices array. Now we just take the last index in the array and add one to determine the allocation size. We allocate memory accordingly, and finally do a map to store the collated items.

Alternatively, you can use a parallel sort to bin together classes, then a parallel reduce to obtain the counts of each class.

Now, once you’ve got the triangles classified by size, you can run parallel subdivides accordingly. Simply put, the size class is the number of iterations that needs to be performed to subdivide a triangle down to the target size. So, classifying triangles like this allows you to more efficiently run parallel threads executing identical instruction streams in lock-step, keeping no-ops in threads to a minimum.

Alternatively, rather than subdividing triangles for tiled rasterization, you can simply duplicate large primitives into each tile’s render queue that it intersects with. Then, we rely on the tile knowing how to intersect it and render only the subset that crosses the tile. Essentially, we borrow the same assumption we were using for scanline rendering. The main advantage of this is not simply eliminating the subdivision computation, but also allowing for pixel-perfect rendering of triangles specified via integer coordinates, since slopes cannot be represented precisely on such subdivided triangles.

Then again… even if you don’t need to suvdivide the large primitives, classifying them is a bit tougher. How do you classify single pixels for parallel rendering? Easy, you just modulo-divide to compute the tile number. With powers of two, it is just bit shifting. But, larger objects, you need to do a bounds comparison. If it is max 4 tiles overlap, such as for primitives that are no larger than the tile itself by may be offset from the center, that is easy. But, for larger objects, you may need to iterate. So then, what it comes down to is that you do need to classify by size, but only for the sake of binning the same large primitive into multiple tiles. Essentially, you can compute the tile size up front, but then you need to loop to bin it into all tiles that it occupies. Yeah, literally… it’s a prelude to the underlying parallel rasterize operation!

Okay, I think I’ve got this design realy solid by now. Parallel compute, I’ve got a good method to get that going overall. The problem still is… idle threads. If you’ve got uniform geometry coverage, idle threads will not be an issue. Otherwise, this comes down to maybe another iteration of classification and binning, building a work queue and issuing bundles of work to work-groups.

Quorten Blog 1

First blog for all Quorten's blog-like writings

Extending CPU framebuffer graphics rendering subroutines into GPU parallel subroutines