The technique itself isn't new in clang either. This is about a new implementation of it, where the difference to the existing implementation is that this happens later in the process (it's deferred to the machine-specific code generation phase, whereas the existing implementation happens in the middle-end and is target-agnostic).
The other major piece is that the hot-cold split is more efficient. Rather than thunking out the cold code via a function call it just jumps to the basic block, making it a more efficient approach (no register spilling and function call overhead)
The function call overhead itself is irrelevant, because by definition these blocks are cold. The saving/restoring of callee-clobbered registers does affect the code size of the hot function though, so that's important.
this was a common practice at Microsoft in the mid 90s, maybe 95-97ish. the set of apps involved were called BBT: "Basic Block Tools". Windows, SQL Server, among others, were post processed with profiling data. it also deduped basic blocks, reducing binary bloat from inlining even without profiling data. just needed some additional info in the debug symbols to work.
Identical code folding still breaks debugging with modern llvm. It can make it seem like the call came from an impossible place in a stack sample. Is it something that Microsoft solved long ago?