Whole program optimization is good if your inputs never change, and you have a infrastructure to benchmark your build and recompile with profiling on each release. JITs do this automatically. A team at work had this release process.
1. Cut the branch at around midnight
2. Run the build with a portion of production traffic for a few hours
3. Collect the profiling info and feed it back into the build.
4. Repush the binary to be tested again.
5. In the morning, the team would manually push the binary to prod.
The benefit was clear: about 20% reduction in CPU. However, to get to this level of automation is not easy, and you get it out of the box with JITs.
One other thing: It's easy to become dependent on such performance gains. The team that had the process got into a difficult situation where there was a bug in one of the releases, couldn't roll back, and had to cherry pick a change. A few lines of code change had to go through the whole push-profile-rebuild-test cycle before it could be rolled out. Pushing the non-profiled change would have caused several other latency SLO to be violated and pagers fired. Instead, they had to wait the several hours with the bug in place, stressing over how soon the profiling would be done.
Better PGO tooling can use the profiles from previous version of the code, which is almost but not quite the same, to compile a PGO optimized build of the patched version.
If there is no tooling to do that, a subset of the training data can be used which can be processed in a short amount of time to gather enough profile data to get most of the benefits. So say, instead of 20% faster code after 6 hours, 10% faster code after 15 minutes.
It is also possible to use PGO to find the critical optimizations done in the PGO optimized build that lead to most of the gains and add annotations in the code (branch taken, branch not taken, force inline, never inline, etc) or split functions the way the PGO optimized build does (e.g. common case is inline the guarding if statement at the beginning of the function, not inline the rest of the function).
1. Cut the branch at around midnight
2. Run the build with a portion of production traffic for a few hours
3. Collect the profiling info and feed it back into the build.
4. Repush the binary to be tested again.
5. In the morning, the team would manually push the binary to prod.
The benefit was clear: about 20% reduction in CPU. However, to get to this level of automation is not easy, and you get it out of the box with JITs.
One other thing: It's easy to become dependent on such performance gains. The team that had the process got into a difficult situation where there was a bug in one of the releases, couldn't roll back, and had to cherry pick a change. A few lines of code change had to go through the whole push-profile-rebuild-test cycle before it could be rolled out. Pushing the non-profiled change would have caused several other latency SLO to be violated and pagers fired. Instead, they had to wait the several hours with the bug in place, stressing over how soon the profiling would be done.