Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A safe, non-owning C++ pointer class (rosemanlabs.com)
49 points by niekb 3 months ago | hide | past | favorite | 47 comments


> Note that our use case is in a single-threaded context. Hence, the word safe should not be interpreted as ‘thread-safe.’ Single-threadedness greatly simplifies the design; we need not reason about race conditions such as one where an object is simultaneously moved and accessed on different threads. Extending the design to a thread-safe one is left as an exercise to the reader.

Why intentionally design a worse alternative to std::weak_ptr which has been around since C++11??


(Author here.) That is a good question. For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention). However, when I wrote the blog-post, I replaced that not-so-well-known class by std::shared_ptr for the sake of accessibility of the blogpost for a general c++ audience, but by doing so, it indeed becomes a natural question to ask why one wouldn't use std::weak_ptr (which I hadn't realised when writing the post).

One reason why this design can still be beneficial when using the standard std::shared_ptr in its implementation, is when you do not want to manage the pointee object by a std::shared_ptr (which is a requirement if you want to use std::weak_ptr). E.g., if you want to ensure that multiple objects of that type are laid out next to each other in memory, instead of scattered around the heap.

Another goal of the post is to show this idea, namely to use a shared_ptr<T*> (instead of shared_ptr<T>), which is kind of non-standard, but can be (as I hope I convinced you) sometimes useful.


> but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention

Why would there be contention in a single threaded program?


atomics aren't free even without contention. the slogan of the language is "you don't pay for what you don't use", and it's really not great that there's no non atomic refcount in the standard. the fact that it is default atomic has also lead people to assume guarantees that it doesn't provide, which was trivially predictable when the standard first introduced it.


OP specifically mentioned contention, though -- not marginally higher cost of atomic inc/dec vs plain inc/dec.

> For our use case, we in fact do not use std::shared_ptr in our implementation, but instead a single-threaded shared_ptr-like class that has no atomics (to avoid cross-core contention).

A single-threaded program will not have cross-core contention whether it uses std::atomic<> refcounts or plain integer refcounts, period. You're right that non-atomic refcounts can be anywhere from somewhat cheaper to a lot cheaper than atomic refcounts, depending on that platform. But that is orthogonal to cross-core contention.


> not marginally higher cost of atomic inc/dec vs plain inc/dec.

Note that the difference is not so marginal, and the difference is not just in hardware instructions as the non-atomic operations generally allow for more optimizations by the compiler.


The actual intrinsic is like 8-9 cycles on Zen4 or Ice Lake (vs 1 for plain add). It's something if you're banging on it in a hot loop, but otherwise not a ton. (If refcounting is hot in your design, your design is bad.)

It's comparable to like, two integer multiplies, or a single integer division. Yes, there is some effect on program order.


Can’t you have cross core contention just purely because of other processes doing atomics that happen to have a cache line address collision in the lock broadcast?


Related to this, GNU's libstdc++ shared_ptr implementation actually opts not to use atomic arithmetic when it infers that the program is not using threads.


I never heard of this and went to check in the source and it really does exist: https://codebrowser.dev/llvm/include/c++/11/ext/concurrence....


The code you linked is a compile-time configuration option, which doesn't quite match "infer" IMO. I think GP is thinking of the way that libstdc++ basically relies on the linker to tell it whether libpthread is linked in and skips atomic operations if it isn't [0].

[0]: https://snf.github.io/2019/02/13/shared-ptr-optimization/


It's a compile-time flag which is defined when libpthread is linked into the binary.


Sure, but I think that's independent of what eMSF was describing. From libgcc/gthr.h:

    /* If this file is compiled with threads support, it must
           #define __GTHREADS 1
       to indicate that threads support is present.  Also it has define
       function
         int __gthread_active_p ()
       that returns 1 if thread system is active, 0 if not.
I think the mechanism eMSF was describing (and the mechanism in the blogpost I linked) corresponds to __gthread_active_p().

I think the distinction between the two should be visible in some cases - for example, what happens for shared libraries that use std::shared_ptr and don't link libpthread, but are later used with a binary that does link libpthread?


Hm, not sure. I can see that shared_ptr::_M_release [0] is implemented in terms of __exchange_and_add_dispatch [1] and which is implemented in terms of __is_single_threaded [2]. __is_single_threaded will use __gthread_active_p iff __GTHREADS is not defined and <sys/single_threaded.h> header not included.

Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

Strange optimization IMHO so I wonder what was the motivation behind it. The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

[0] https://codebrowser.dev/llvm/include/c++/11/bits/shared_ptr_...

[1] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[2] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[3] https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...

[4] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....

[5] https://codebrowser.dev/llvm/include/c++/11/ext/atomicity.h....


> Implementation of __gthread_active_p is indeed a runtime check [3] which AFAICS applies only to single-threaded programs. Perhaps the shared-library use-case also fits here?

The line you linked is for some FreeBSD/Solaris versions which appear to have some quirks with the way pthreads functions are exposed in their libc. I think the "normal" implementation of __gthread_active_p is on line 248 [0], and that is a pretty straightforwards check against a weak symbol.

> Strange optimization IMHO so I wonder what was the motivation behind it.

I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

> The cost function being optimized in this case is depending on WORD being atomic [4] without actually using the atomics [5].

Not entirely sure what you're getting at here? The former is used for single-threaded programs so there's ostensibly no need for atomics, whereas the latter is used for non-single-threaded programs.

[0]: https://codebrowser.dev/kde/include/x86_64-linux-gnu/c++/11/...


> Not entirely sure what you're getting at here?

> I believe the motivation is to avoid needing to pay the cost of atomics when there is no parallelism going on.

Obviously yes. What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact. Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.


> What I am wondering is what benefit does it bring in practice. Single-threaded program with shared-ptr's using atomics vs shared-ptr's using WORDs seem like a non-problem to me - e.g. I doubt it has a measurable performance impact.

I mean, the blog post basically starts with an example where the performance impact is noticeable:

> I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one.

And:

> I just referenced pthread_create in the program and the reference count became atomic again.

> Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.

So in principle an insert-heavy workload for that data structure could see a noticeable performance impact.

> Atomics are slowing down the program only when it comes to contention, and single-threaded programs can't have them.

Not entirely sure I'd agree? My impression is that while uncontended atomics are not too expensive they aren't exactly free compared to the corresponding non-atomic instruction. For example, Agner Fog's instruction tables [0] states:

> Instructions with a LOCK prefix have a long latency that depends on cache organization and possibly RAM speed. If there are multiple processors or cores or direct memory access (DMA) devices, then all locked instructions will lock a cache line for exclusive access, which may involve RAM access. A LOCK prefix typically costs more than a hundred clock cycles, even on single-processor systems. This also applies to the XCHG instruction with a memory operand.

And there's this blog post [1], which compares the performance of various concurrency mechanisms/implementations including uncontended atomics and "plain" code and shows that uncontended atomics are still slower than non-atomic operations (~3.5x if I'm reading the raw data table correctly).

So if the atomic instruction is in a hot loop then I think it's quite plausible that it'll be noticeable.

[0]: https://www.agner.org/optimize/instruction_tables.pdf

[1]: https://travisdowns.github.io/blog/2020/07/06/concurrency-co...


Thanks, I'll revisit your comment. Some interesting things you shared.


People assume non-existent guarantees such as?


"is shared_ptr thread safe?" is a classic question asked thousands of times. the answer by the way is "it's as thread safe as a regular pointer"


> laid out next to each other in memory

Moving goalpost. But just to follow that thought: Decoupling alloc+init via e.g. placement-new to do this introduces a host of complications not considered in your solution.

If that layout _is_ a requirement, and you don't want a totally nonstandard foundation lib with nonstandard types promiscuously necessitating more nonstandard types, you want a std::vector+index handle.


They never mention std::weak_ptr which makes me think they aren't aware of it.. yes this looks pretty useless and unsafe(isn't everything multi-threaded these days..)


> isn't everything multi-threaded these days..

There are alternative ways to utilize a machine with multiple cores, e.g. by running one thread per CPU core, and not sharing state between those threads; in each such thread you then have single-thread "semantics".


weak_ptr supports this -- it's only mt-safe if you specialize it with std::atomic


Last I checked weak_ptr is always atomic (ignoring weird attempted glibc magic when you don’t link against pthread)



Oh sure, a single weak_ptr instance itself is not safe for multiple concurrent access of non-const methods. But weak_ptr -> shared_ptr reacquisition is atomic and all control block operations are:

> Note that the control block used by std::weak_ptr and std::shared_ptr is thread-safe: different non-atomic std::weak_ptr objects can be accessed using mutable operations, such as operator= or reset, simultaneously by multiple threads, even when these instances are copies or otherwise share the same control block internally. The type T may be an incomplete type.

There’s no variant of shared_ptr / weak_ptr that is non atomic in the standard library AFAIK.


Multi-threading does not imply shared ownership, it can also be achieved with message passing.


We purposefully didn't use shared_ptr and hence weak_ptr. With these, it is all too easy to construct the "bad" version which has the stub reference count and pointer stored far away in memory from the object itself requiring a double dereference to access the object which is bad for cache performance. Instead we derived off a shareable class that has the reference count to make sure it is close in memory.

We were happy to use unique_ptr, however.


With make_shared it's guaranteed to be a single allocation these days so you shouldn't necessarily have cache locality problems. I do think there are benefits to things being intrusively recounted as you save 8 bytes per object. And if you give up a weak count you can save even more.


The atomics in std::weak_ptr are >20x more expensive even with 0 contention.


This sounds very similar to how base::WeakPtr works in Chromium [0]. It's a reasonable design, but it only works as long as the pointer is only accessed from the same thread it is created.

[0] https://chromium.googlesource.com/chromium/src/+/HEAD/base/m...


I’ve recently read the third edition of Bjarne’s “A tour of c++” (which is actually a good read). I feel the author of this post could benefit from doing so also.


For situations like this, I prefer a generational index where the lookup fails if the object has been destroyed. For context, the "manager" that holds the objects referred to typically has a lifetime of the whole program.


Interesting, but I see no real use-case where it may be useful. Usually raw pointers/references are used to pass a value to a function without ownership transfer and it's almost always true that this value remains valid until callee isn't returned. Other use-cases, like putting such pointer into a struct are dangerous and one should minimize doing this.


shared_ptr will bite you in the rear if you ever need to have well defined semantics about when an object is destructed. It has a lot of good use cases, especially in async code bases where you want to effectively cancel callbacks if the captured variable has gone away. Proactively cancelation is much more difficult by comparison. There are other ways to achieve this result but the one used in the article is a fine choice.


One time, I was patching up some buggy code that had dangling pointers just to stop it from crashing. My approach was to check if the vtable was correct. Sure, actually fixing the underlying bug would have been a lot better, but this was enough to just stop it from crashing.


> Extending the design to a thread-safe one is left as an exercise to the reader.

Doesn't get much glibber than that!


That was mostly meant as irony/a joke, but I admit that's not really clear from the text... For the sake of clarity, if you need thread-safety, probably best to just use std::shared_ptr / std::weak_ptr.


It's a common misconception that std::shared_ptr is thread safe. The counter is thread safe, but the actual shared_ptr itself can not be shared across multiple threads.

There is now atomic_shared_ptr which is thread safe.


It is now a template specialization of atomic std::atomic<std::shared_ptr<T>>.


std::span<> is another option. Especially when paired with libc++'s hardening mode(s). Apparently, Google has deployed them in production.


Hardened std::span just adds checks during operator[] that you would normally only happen in at(). Same for operator* and operator->. It doesn't really have any relevance for the problem the article is written about.


What? std::span is equivalent to just a pointer+length pair. It doesn't know anything about if the underlying object(s) are still valid.


This just seems intentionally bad to show where Rust would be better. This is yet another example of what I call "corner-case" instruction, which I define as, "I am going to take an obviously terrible corner-case that shows what an awful developer can do that will break a program, then demonstrate my brilliance by introducing my (highly-biased) opinionated point I wanted to make..."

In this particular case, it was subtly, Rust is preferred because it doesn't allow unsafe memory operations such as the one demonstrated. Really, all it demonstrates is that you can create really bad C++.


You could implement the same smart pointer library in rust and it would be fine. Rust doesn't magically solve the problems around defined destruction ordering when using ref counted pointers. I try very hard to model my usage of Rc or Arc to be very similar to what this article is trying to showcase for basically the same reasons I imagine they do. I'm actually inspired to write a crate with these semantics to make it harder to mess it up.


Sure enough I found it: https://crates.io/crates/refbox




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: