Show HN: CodSpeed – Continuous Performance Measurement

jammycrisp · on July 13, 2023

> we measure the number of instructions and memory/cache accesses through CPU instrumentation performed with Valgrind. This approach gives repeatable and consistent results that couldn’t be obtained with a time based statistical approach, especially in extremely noisy CI and cloud environments.

This is a neat approach! I'm curious how well it maps to actual perf degradations though. Valgrind models an old CPU with a more naive branch predictor. For low-level branch-y code (say a native JSON parser), I'd be curious how well valgrind's simulated numbers map to real world measurements?

My probably naive intuition guesses that some low-level branchy code that valgrind thinks may be slower may run fine on a modern CPU (better branch predictor, deeper cache hierarchy). I'd expect false negatives to be rarer though - if valgrind thinks it's faster it probably is? What's your experience been like here?

art049 · on July 13, 2023

>This is a neat approach! I'm curious how well it maps to actual perf degradations though. Valgrind models an old CPU with a more naive branch predictor. For low-level branch-y code (say a native JSON parser), I'd be curious how well valgrind's simulated numbers map to real world measurements?

We didn't try it on this specific case but on we found that on branchy code valgrind does aggravate the branching cost. Probably, we could mitigate this issue by collecting more data relative to the branching and incorporate those in our reported numbers to map more accurately to the reality and more recent CPUs.

>My probably naive intuition guesses that some low-level branchy code that valgrind thinks may be slower may run fine on a modern CPU (better branch predictor, deeper cache hierarchy). I'd expect false negatives to be rarer though - if valgrind thinks it's faster it probably is? What's your experience been like here?

Totally! We never encountered a false positive in the reports yet. But as you mentioned since valgrind models an old CPU, it's likely to happen. But even though the cache simulated has a quite old, it still improves the relevance of our measures. When we have some time, we'd really enjoy refreshing the cache simulation of valgrind since it would probably eliminate some edge cases and reflect memory accesses more accurately.

foota · on July 13, 2023

Seems interesting... how did you land on valgrind vs some other means of simulation? Looking at valgrind, it sounds like cachegrind? Imo seems like the biggest gap is non instruction, non cache sources of latency, like mutex contention or kernel slowness? (Or does it capture kernel delays?)

Pur most recent performance issues have been from someone accidentally creating a new thread pool in a request, from generating tons of stack traces in an error handling path, and from some thread hop delays. Sounds like the first two would probably be caught but maybe not the third?

on July 13, 2023

[deleted]