Depends a lot on the compiler and target arch. You'll miss out on a lot of stack...

Depends a lot on the compiler and target arch. You'll miss out on a lot of stack accesses, or add too many. You don't get around looking at the final executable if you want good results. And for more complex targets, in the end you need to know what the pipeline does and how the caches behave if you want a good bound on the cycle count. Of course assuming you're on anything more complex than an atmega, for which op counting might be enough. I work in the domain; lots of people do measurements, which only give a ballpark but are bad since you might miss the worst case (which is important for safety critical systems, where that latency spike in the wrong moment might be fatal). Pure op counting is bad since the results grossly overestimate (eg you always need to assume cache misses if you don't know the cache state, or pipeline stall, or DRAM, or...). Look at the complexity of PowerPC, this should give you a rough idea what we're usually dealing with (and yeah, I'm talking embedded here).

To me that "sometimes" feels like "I can wrestle some bears with my bare hands, e.g.a Teddy bear" ;-)