Sometimes, even if you know you're starting with somewhat suboptimal performance, the ability to use CPU code you've already written and tested on the GPU is very valuable.
Many years ago (approx 2011-2012) my own introduction to CUDA came by way of a neat .NET library Cudafy that allowed you to annotate certain methods in your C# code for GPU execution. Obviously the subset of C# that could be supported was quite small, but it was "the same" code you could use elsewhere, so you could test (slowly) the nominal correctness of your code on CPU first. Even now the GPU tooling/debugging is not as good, and back then it was way worse, so being able to debug/test nearly identical code on CPU first was a big help. Of course sometimes the abstraction broke down and you ended up having to look at the generated CUDA source, but that was pretty rare.
This was many years ago, after Unity released mathematics and burst. I was porting (part of) my CPU toy pathtracer to a compute shader. At one point, I literally just copy-pasted chunks of my CPU code straight into an HLSL file, fully expecting it to throw some syntax errors or need tweaks. But nope. It ran perfectly, no changes needed. It felt kinda magical and made me realize I could actually debug stuff on the CPU first, then move it over to the GPU with almost zero hassle.
For folks who don't know: Unity.Mathematics is a package that ships a low-level math library whose types (`float2`, `float3`, `float4`, `int4x4`, etc.) are a 1-to-1 mirror of HLSL's built-in vector and matrix types. Because the syntax, swizzling, and operators are identical, any pure-math function you write in C# compiles under Burst to SIMD-friendly machine code on the CPU and can be dropped into a `.hlsl` file with almost zero edits for the GPU.
Many years ago (approx 2011-2012) my own introduction to CUDA came by way of a neat .NET library Cudafy that allowed you to annotate certain methods in your C# code for GPU execution. Obviously the subset of C# that could be supported was quite small, but it was "the same" code you could use elsewhere, so you could test (slowly) the nominal correctness of your code on CPU first. Even now the GPU tooling/debugging is not as good, and back then it was way worse, so being able to debug/test nearly identical code on CPU first was a big help. Of course sometimes the abstraction broke down and you ended up having to look at the generated CUDA source, but that was pretty rare.