Currently I had the task to optimize a UI.
This UI was using a lot of data records using a non-standard shared pointer algorithm which required the target instance having the reference count as member.
This reference count was implemented as a single atomic for beeing thread safe and got increased each time the shared pointer was constructed and decreased as soon as it got destructed.
When the reference count reaches zero, the instance is destroyed.

For the meaning of the used memory ordering have a look at Atomics, memory barriers and reordering.

// this is the class which holds the reference count in itself.
class ClassWithReferenceCount {
    protected:
        std::atomic<int> ref_{0};
    public:
        void inc_ref() {
            ref_.fetch_add(1, std::memory_order_relaxed);
        }

        void dec_ref() {
            // decrease ref by one. fetch_sub returns the previous value of ref_
            // memory_order_release makes the change visible to all other threads
            if (ref_.fetch_sub(1, std::memory_order_release) == 1) {
                // memory_order_acquire ensures that the previous fetch_sub is 
                // visible to other threads - this way no other thread can execute
                // this code for this instance
                std::atomic_thread_fence(std::memory_order_acquire);
                delete this;
            }
        }
};


// this concept requires that an instance of T has
// the two callables inc_ref and dec_ref
template<typename T>
concept ReferenceCountable =
requires (T a) {
    a.inc_ref();
    a.dec_ref();
};

// this is the shared pointer type for any class that fulfills the
// ReferenceCountable requirements.
template < ReferenceCountable T >
class SharedPointer {
    private:
        T* ptr_;
    public:
        SharedPointer(T* inst):ptr_(inst) { if (ptr_) ptr_->inc_ref(); }
        ~SharedPointer() { if (ptr_) ptr_->dec_ref(); }
        T& operator*() const { return *ptr_; }
        T* operator->() const { return ptr_; }
        operator bool() const noexcept { return ptr_ != nullptr; }
};

using MySharedPointer = SharedPointer<ClassWithReferenceCount>;

This implementation works fine and allows ClassWithReferenceCount ‘holding’ itself by simply increasing its own reference count ref_.

But in case of a single threaded environment or where you can ensure that both pointers and instance are used inside a single thread only, the atomic reference counting can(!) be a performance issue.

I measured counting with atomic and non-atomic using google benchmark:

#include <benchmark/benchmark.h>
#include <atomic>
#include <numeric>

using counter_type = unsigned long;

static static counter_type counter = 0;
void NonAtomic() { ++counter; }

static void BM_NonAtomic(benchmark::State& state) {
    for (auto _ : state) NonAtomic();
}
BENCHMARK(BM_NonAtomic);

std::atomic<counter_type> atomic_counter;
void Atomic() { atomic_counter.fetch_add(1, std::memory_order_relaxed); }

static void BM_Atomic(benchmark::State& state) {
    for (auto _ : state) Atomic();
}
BENCHMARK(BM_Atomic);

BENCHMARK_MAIN();

The result isn’t really suprising, the non-atomic performs more than 3 times better:

-------------------------------------------------------
Benchmark             Time             CPU   Iterations
-------------------------------------------------------
BM_NonAtomic       3.10 ns         3.10 ns    224320150
BM_Atomic          10.5 ns         10.4 ns     65751778

So in case of a high performance environment where such instances aren’t shared between threads, it can make a difference.