I think the opposite is where things need to go.
Having a wide SIMD ALU quickly accessible from your CPU core is very useful, especially as it shares the same memory system and a much more flexible programming model that allows you to do everything in a single source.
The programming model is not very flexible at the lowest level: one has to create all the software infrastructure to communicate with the GPU (which boils down to sending commands and receiving response). There are languages (like futhark, julia, or even python), which handle all that boilerplate transparently.
The main problem is, afaik, that there is not enough control about where the code will run in these languages. At some point, one will want to describe all the algorithms using a single language, and somehow describe how the workload will have to be distributed across all the processors, or at least that's what I've been thinking about for a while. Once you have that level of control, the need for a versatile CPU is less clear. Note that nowadays people seems happy with hybrid solutions where the code is scattered across several languages (eg, one for the main program and one for the shaders, or for the client side UI), so my position is maybe not very strong.
HW-wise, is it possible that integrated GPUs are the first steps toward an architecture where CPU and GPU have better interconnections (ie, larger communication bandwidth and smaller latency) to the point where SIMD becomes moot? There is also the SWAR approach, where one doesn't rely on intrinsic SIMD instructions, but instead emulate them (though it's probably not very realistic for floating point computation).
Some other ideas:
- Apple has this neural engine in their latest chips, which is basically dedicated HW for neural networks
- In the wild, people are getting more and more interested in building their custom ASICs to cut software's middle-man cost: for them, the CPU solution is not good enough
- Intel recently introduced a new matrix ops extension in their CPUs: maybe at some point they'll introduce full GPU capabilities directly baked in the CPU? I am a little worried about the resulting ISA.
Anyway, I am not an HW engineer, nor a very good software one. I only have a limited view of the difficulties in writing good, CPU or GPU efficient code. My first post was prompted by remembering the first "large scale" multicores CPUs 15 years ago (specifically the Ultrasparc T1) which wheren't SIMD heavy. The direction naturally shifted as progress was made on SIMD to try to compete with GPUs, when it seems to me that originally CPUs and GPUs were complementary.
I tend to support modular solutions, but I don't know how costly that would be in term of efficiency at the HW level.