No, not really. If you're off by a millisecond (the minimum, assuming 64-sample buffers and 64 kHz sampling rate), frequencies which are even multiples of 500 Hz will be nullified, those which are odd multiples of 500 Hz will be amplified, and the rest will fall somewhere between those extremes. (This is known as a "comb filter"; see [1] and [2].)
Your examples demonstrate the behavior of such filtering only for frequencies much less than 500 Hz, which will indeed be attenuated. However, a camera "click" sound, being both brief and noisy, contains lots of high-frequency content which will be, on the whole, preserved.
In fact, assuming the "click" sound can be approximated by white noise, its total energy will be doubled by such filtering!
So yes, you do need to play the sounds within the same millisecond.
Your examples demonstrate the behavior of such filtering only for frequencies much less than 500 Hz, which will indeed be attenuated. However, a camera "click" sound, being both brief and noisy, contains lots of high-frequency content which will be, on the whole, preserved.
In fact, assuming the "click" sound can be approximated by white noise, its total energy will be doubled by such filtering!
So yes, you do need to play the sounds within the same millisecond.
[1] http://en.wikipedia.org/wiki/Comb_filter
[2] http://en.wikipedia.org/wiki/File:Comb_filter_response_ff_ne...