OP specifically mentioned contention, though -- not marginally higher cost of atomic inc/dec vs plain inc/dec.
> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).
A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.
> not marginally higher cost of atomic inc/dec vs plain inc/dec.
Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.
The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)
It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.
Can’t you have cross core contention just purely because of other processes doing atomics that happen to have a cache line address collision in the lock broadcast?
> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).
A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.