> Accuracy gains are due primarily to the specific use of a DCT representation, which turns out to work curiously well for image classification.
It would seem quantization is a useful tool for any sort of NN-style application.
If the expected output is intended to be human-like, why not feed it information that a typical human could not distinguish from a lossless representation? Seems like a simple game of expectations and information theory.
That's kind of the key theory behind why JPEG (and other lossy encodings) work at all. A perfect being would see a JPEG next to a PNG or TIFF and find the first repugnantly error-ridden.
But we tend to ignore high-frequency data's specifics most of the time, so it psychologically works.
I often wonder though, what do my cat and dog hear when I'm playing compressed music? Does it sounds like a muddy phone call to them?
Audio is decidedly less "compressible" in human perceptual terms. The brain is amazingly skilled at detecting time delay and frequency deviations, so this perceptual baseline likely extends (mostly) to your pets.
You can fool the eyes a lot more easily. You can take away 50%+ or more of the color information before even a skilled artist will start noticing.
There are real differences in audio perception, though. Frequency range and sensitivity to different frequencies is a big difference in other animals; I would expect cats (who chase rodents, which often have very high pitched or even ultrasonic vocalizations) to be more sensitive to high frequencies than humans, and thus low passed / low sample rate audio could sounds 'bad.'
Another aspect is time resolution. Song birds can have 2-4x the time resolution of human hearing, which helps distinguish sounds in their very fast, complex calls. This may lead to better perception of artifacts in lossy coding schemes, but it's hard to say for sure.
True but hearing is logarithmic in both volume and frequency domains. Double the power does not equate to anything near double the loudness. Similarly each doubling of frequency is only one octave higher. Hearing up to 80khz doesn't mean hearing 4x more than humans... 10 octaves for humans, 12 octaves for cats. In a musical sense it probably isn't noticeable.
The extreme upper limit of human hearing is around 20khz, so cats really are hearing things that we don't, and for good reasons.
Sensitivity to different frequency ranges is more or less independent of anything else. Birds have heightened frequency response in the range they vocalize in, which helps them hear others if their species. Same for us; we vocalize at relatively low frequencies, so most of our hearing ability is focused on that range. There is also a range below which we don't hear: infrasound, which is utilized by elephants.
Logarithmic perception is certainly real, but the tuning of which frequency ranges an animal is more or less sensitive to is certainly species dependent.
As a comparison with removing the top two octaves from a cat's hearing, try removing the top two octaves from an audio file compared to your hearing range (lowpass at 5 kHz or less if you have hearing range loss, and/or resample to 10 kHz/ksps or less) and see if the results are musically noticeable. (At least for humans, the result is intelligible but heavily muffled, I can't speak for my pet cats though.)
That's fair... I'm just saying you shouldn't compare 20hz-20khz vs 20hz-80khz then decide cats can hear "three times" as much as humans. Two octaves is more than zero but a lot less than 3x.
while work has been done to characterize frequency sensitivity across species (which does vary quite a bit, especially in the higher ranges (>20khz)), i haven't seen any work that has been done to explore frequency domain perceptual masking curves in a cross species context.
since some species use their auditory systems for spatial localization, i would guess that the perceptual system would be totally different in those contexts.
No, audio compression doesn't filter out high frequencies, that's just what computer audio as a whole does. And I don't think there's enough of those high frequency components in what humans typically record for a cat or dog to notice the difference. As far as compression, the tricks that work on us should work on them.
the early xing mp3 codec famously cut everything off above 18khz, but that was out of spec. :)
instead perceptual audio compression typically filters out frequencies that neighbor other frequencies with lots of power. deleting these neighbors is called perceptual masking and to the best of my knowledge, we do not actually know if it works the same way in animal auditory systems.
>MP3 compression works by reducing (or approximating) the accuracy of certain components of sound that are considered (by psychoacoustic analysis) to be beyond the hearing capabilities of most humans.
-via Wikipedia
This holds true for most other audio compression as well.
Now, it's true that max recording frequency is bounded by sample rate via the Nyquist theorem, but that doesn't mean we're incapable of recording at higher fidelity - we just don't bother most of the time, because on consumer hardware it's going to be filtered out eventually anyway (or just not reproduced well enough, due to low-quality physical hardware). Recording studios will regularly produce masters that far exceed that normal hearing range though.
It would seem quantization is a useful tool for any sort of NN-style application.
If the expected output is intended to be human-like, why not feed it information that a typical human could not distinguish from a lossless representation? Seems like a simple game of expectations and information theory.