Anton Schreiner

Anton Schreiner

06-11-2022

16:23

Also wanted to share some unfinished thoughts on gpu traversals. Let's imagine we have a two level tree for checking whether a point is inside any of the leaf nodes. On the first level we have planes and on the second AABB nodes. A simple impl is having 2 nested loops

we know that the tree always has planes on top and it ends up with aabb nodes and we need to check wether a point is inside any of the leaf aabb nodes

with a simple impl threads that encounter an aabb node enter the second loop, other threads have to wait. which is not good.

a better approach would be 2 pass traversal. 1st pass collects the bottom tree nodes and waits for all the threads to finish collecting. 2nd pass does bottom level traversal with all the finished threads

much better lane occupancy i.e. less yellow which designates disabled lane

lanes are kind of an imaginary entity, haven't seen a good way of visualizing them for real workloads(except for hw simulators). every time you see a v_*_f32 you need to remember that there's an exec mask that turns lanes off/on. The control flow is actually scalar.

So that piece of context(wave) is going to keep spinning until there's at least one bit in the exec mask. HW doesn't magically reorder, stop and continue per lane progress. i.e. there's only one instruction pointer per wave, thus it's called SIMD/SIMT

in the literature you can read about all different kinds of models with separate instruction pointers per lanes and how it could improve occupancy in some cases. AFAIK there's no HW that actually does that, could be wrong



Follow us on Twitter

to be informed of the latest developments and updates!


You can easily use to @tivitikothread bot for create more readable thread!
Donate 💲

You can keep this app free of charge by supporting 😊

for server charges...