> but instead a single-threaded shared_ptr-like class that has no atomics (to av...

izabera · 2025-09-28T20:44:23 1759092263

atomics aren't free even without contention. the slogan of the language is "you don't pay for what you don't use", and it's really not great that there's no non atomic refcount in the standard. the fact that it is default atomic has also lead people to assume guarantees that it doesn't provide, which was trivially predictable when the standard first introduced it.

loeg · 2025-09-28T22:14:41 1759097681

OP specifically mentioned contention, though -- not marginally higher cost of atomic inc/dec vs plain inc/dec.

> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).

A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.

SkiFire13 · 2025-09-29T05:10:32 1759122632

> not marginally higher cost of atomic inc/dec vs plain inc/dec.

Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.

loeg · 2025-09-29T06:18:40 1759126720

The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)

It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.

vlovich123 · 2025-09-29T04:50:45 1759121445

Can’t you have cross core contention just purely because of other processes doing atomics that happen to have a cache line address collision in the lock broadcast?

eMSF · 2025-09-28T21:31:31 1759095091

Related to this, GNU's libstdc++ shared_ptr implementation actually opts not to use atomic arithmetic when it infers that the program is not using threads.

menaerus · 2025-09-29T05:54:19 1759125259

I never heard of this and went to check in the source and it really does exist: https://codebrowser.dev/llvm/include/c++/11/ext/concurrence....

aw1621107 · 2025-09-29T16:51:43 1759164703

The code you linked is a compile-time configuration option, which doesn't quite match "infer" IMO. I think GP is thinking of the way that libstdc++ basically relies on the linker to tell it whether libpthread is linked in and skips atomic operations if it isn't [0].

[0]: https://snf.github.io/2019/02/13/shared-ptr-optimization/

menaerus · 2025-09-30T07:32:51 1759217571

It's a compile-time flag which is defined when libpthread is linked into the binary.

aw1621107 · 2025-10-01T06:34:32 1759300472

Sure, but I think that's independent of what eMSF was describing. From libgcc/gthr.h:

    /* If this file is compiled with threads support, it must
           #define __GTHREADS 1
       to indicate that threads support is present.  Also it has define
       function
         int __gthread_active_p ()
       that returns 1 if thread system is active, 0 if not.

I think the mechanism eMSF was describing (and the mechanism in the blogpost I linked) corresponds to __gthread_active_p().

I think the distinction between the two should be visible in some cases - for example, what happens for shared libraries that use std::shared_ptr and don't link libpthread, but are later used with a binary that does link libpthread?

menaerus · 2025-10-01T09:26:22 1759310782

Hm, not sure. I can see that shared_ptr::_M_release [0] is implemented in terms of __exchange_and_add_dispatch [1] and which is implemented in terms of __is_single_threaded [2]. __is_single_threaded will use __gthread_active_p iff __GTHREADS is not defined and <sys/single_threaded.h> header not included.

Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

Strange optimization IMHO so I wonder what was the motivation behind it. The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

[0] https://codebrowser.dev/llvm/include/c++/11/bits/shared_ptr_...

[1] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[2] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[3] https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...

[4] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[5] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

aw1621107 · 2025-10-01T18:58:56 1759345136

> Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

The line you linked is for some FreeBSD/Solaris versions which appear to have some quirks with the way pthreads functions are exposed in their libc. I think the "normal" implementation of __gthread_active_p is on line 248 [0], and that is a pretty straightforwards check against a weak symbol.

> Strange optimization IMHO so I wonder what was the motivation behind it.

I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

> The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

Not entirely sure what you're getting at here? The former is used for single-threaded programs so there's ostensibly no need for atomics, whereas the latter is used for non-single-threaded programs.

[0]: https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...

menaerus · 2025-10-03T11:31:47 1759491107

> Not entirely sure what you're getting at here?

> I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

Obviously yes. What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact. Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.

aw1621107 · 2025-10-04T08:37:53 1759567073

> What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact.

I mean, the blog post basically starts with an example where the performance impact is noticeable:

> I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one.

And:

> I just referenced pthread_create in the program and the reference count became atomic again.

> Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.

So in principle an insert-heavy workload for that data structure could see a noticeable performance impact.

> Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.

Not entirely sure I'd agree? My impression is that while uncontended atomics are not too expensive they aren't exactly free compared to the corresponding non-atomic instruction. For example, Agner Fog's instruction tables [0] states:

> Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

And there's this blog post [1], which compares the performance of various concurrency mechanisms/implementations including uncontended atomics and "plain" code and shows that uncontended atomics are still slower than non-atomic operations (~3.5x if I'm reading the raw data table correctly).

So if the atomic instruction is in a hot loop then I think it's quite plausible that it'll be noticeable.

[0]: https://www.agner.org/optimize/instruction_tables.pdf

[1]: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...

menaerus · 2025-10-14T14:49:13 1760453353

Thanks, I'll revisit your comment. Some interesting things you shared.

grogers · 2025-09-28T21:32:46 1759095166

People assume non-existent guarantees such as?

izabera · 2025-09-28T22:12:56 1759097576

"is shared_ptr thread safe?" is a classic question asked thousands of times. the answer by the way is "it's as thread safe as a regular pointer"