That's true that it quantizes (aka bins) the samples, so it isn't right for test...

phab · on Oct 13, 2021

I'm not sure it makes sense to separate "vertical" correctness from "horizontal" correctness when it comes to "did the feature behave" though; to extend the example in TFA, if your fade progress went from 0->0.99 but then stopped before it actually reached 1 for some reason, you might find that you still had a (small, but still present) signal on the output, which, if the peak-peak amplitude was < 0.1, the test wouldn't catch.

Obviously any time you're working with floating-point sample data the precise values of floats will almost always not be bit-accurate against what your model predicts (sometimes even if that model is a previous run of the same system with the same inputs as in this case); it's about defining an acceptable deviation. I guess what I'm saying is that for audio software, a peak-peak error of 0.1 equates to a signal at -20 dBFS (ref DBFS@1.0) (which of course is quite a large amount of error for an audio signal), so perhaps using higher-resolution graphs would be a good idea.

(Has anyone made a tool to diff sixels yet? /s)

jwosty · on Oct 13, 2021

Fair points here. Unfortunately adding more vertical resolution starts to get a little unwieldy to navigate through. Maybe it could start using different characters to multiply the resolution to something sufficiently less forgiving of errors. If it could choose between even 3 chars, for example, it would effectively squash 3 possible values into one line, tripling the resolution.

twobitshifter · on Oct 14, 2021

I think more resolution may give you more false negatives, which might not be helpful. We’ve used similar tools for integration testing at work and the smallest usually irrelevant change can bust the reference cases, due to the high-detail in the reference, which means going through all the changed tests and then seeing that everything is still fine.

For this, just thinking about sound, I wonder if you could invert the reference wave form and add it to the test to see how well it cancels? Then instead of just knowing there was a diff, you could get measurements of the degree of the diff.