Short answer: because computational shaders give you more efficient tools for performing complex computations.
Long answer:
Perhaps the biggest advantage they provide (in the case of tracing) is the ability to precisely control how the GPU works. This is important when you are viewing a complex scene. If your scene is trivial (e.g. Cornell Box), then the difference is not significant. Trace some areas throughout your fragment shader throughout the day. Check out http://shadertoy.com/ to witness the frenzy that can be achieved with modern GPUs and flash shaders.
But. If your scene and shading are quite complex, you need to control how the work is done. Atv rendering and tracing in a flash shader, at best, will cause your application to freeze while the driver cries, changes his legal name and moves to the other side of the world ... and, in the worst case, the Driver. Many drivers will be aborted if one operation takes too much time (which almost never happens with standard use, but will happen very quickly when you start tracing 1M poly scenes).
So, you are doing too much work in the flash shader ... next, is it logical? Well, limit your workload. Draw smaller boxes to control how much screen you are viewing right away. Or use glScissor. Make the workload smaller and smaller until your driver can handle it.
Guess what we just invented? Calculate shaders working groups ! Workgroups are a shader computational mechanism for controlling the size of a job, and they are much better abstractions for this than hacking at the fragment level (when we deal with such a difficult task). Now we can very precisely control how many rays we send, and we can do this without being closely connected with the screen space. For a simple indicator, this adds unnecessary complexity. For โrealโ, this means that we can easily perform sub-pixel raycasting on a shaky grid for AA, a huge amount of raikast per pixel to track the path if we want, etc.
Other features of computational shaders that are useful for performers, industry indicators:
- Shared memory between groups of streams (allows, for example, packet tracing, in which the entire package of spatially coherent rays is monitored at the same time in order to use memory consistency and the ability to interact with neighboring rays)
- Scatter Writes allow you to calculate shaders for recording at arbitrary locations of images (note: the image and texture differ in a subtle way, but the advantage remains relevant); You no longer need to trace directly from a known pixel location.
In general, the architecture of modern graphics processors is designed to more naturally use this type of task using computation. Personally, I wrote a real-time progressive path tracer using MLT, kd-tree acceleration and a number of other expensive computational methods (PT is already very expensive). I tried to stay in frame with fragments / full-screen quad-core processor as much as I could. Once my scene was complex enough to require an acceleration structure, my driver began to choke no matter what hackers I pulled out. I re-implemented in CUDA (not quite the same as computed, but used the same basic architectural achievements of the GPU), and everything was fine with the world.
If you really want to dig, take a look at section 3.1 here: https://graphics.cg.uni-saarland.de/fileadmin/cguds/papers/2007/guenther_07_BVHonGPU/Guenter_et_al._-_Realtime_Ray_Tracing_on_GPU_with_BVH-based_Pack . Honestly, the best answer to this question would be an extensive discussion of the GPU microarchitecture, and I'm not at all ready to give it. By looking at modern GPU trace documents like the ones above, you get an idea of โโhow deep the performance considerations are.
One final note: any advantage of computing a calculation over a fragment in the context of raytracing a complex scene has nothing to do with the overhead / adjacent rasterization / vertex shader, etc. . For a complex scene with complex shading, the bottlenecks are completely in the trace calculations, which, as discussed, shaders calculate, have tools for more efficient implementation.