Atomics, memory barriers and reordering
When running programs using several threads normally one or more synchronization points are required.
C++ offers a lot of them, here I try to explain the memory order for atomic operations.
Don’t use operators
First of all: although some atomics are offering operators such as pre- or post- increment: don’t use them!
One reason: you cannot define a memory order, more to this later.
But more important: you can create non-atomic instructions:
std::atomic<int> i{};
i += i * 2;
There are three operations, two of them atomic but executed seperatly:
- atomically load from
i
- multiply by 2
- store calculated value atomically in
i
For better understanding what memory barriers and orders are for, lets take a deeper (but not necessarily precise) look.
Reordering
Both compilers and CPUs may reorder instructions for optimization. The CPU mainly implicitly by branch prediction, the compiler explicitly
but respecting data dependency.
This means that you cannot be shure that your code is at execution time what you have indentionally written.
Example:
1std::atomic<int> data_has_changed{0};
2std::atomic<int> data{0};
3
4// thread A
5data = 100;
6data_has_changed = 1;
7
8// thread B
9if (data_has_changed)
10 std::cout << data <<std::endl;
If the compiler or CPU decides to move or execute line 6 before line 5 in thread A, thread B may get data_has_changed
signaled without
data
beeing set.
Memory and caches
Nearly all mainstream CPUs have one or more caches (L1, L2, L3…) per core.
In these caches the data from RAM is preloaded before code execution, because they are much faster accessible from the CPU cores than
the RAM itself. The size of each cache is smaller in storage size the closer it is placed to its core, a L1 cache for instance has mostly
a size between 16 to 64kb and holds both instructions and data, while a L2 cache can have a size of a few MB and caches data from RAM.
The L3 cache is currently used for cache consistency by the CPU itself and therefore not for interest at this time.
Both cache size and data alignment can also effect the computation time, but this is another topic.
This in practice means, that if your code changes a variable (which is not atomic) doesn’t change globally at this time - even if it is globally declared and therefore visible to all threads (cores).
Simply said, by the use of atomics it is ensured, that an atomic store writes the cached data of your thread to the RAM
and so gets visible to other threads, while an atomic load gets the data from RAM.
By using memory order settings at the atomic operations this can be controlled more precisely
and even allows the synchronization of non-atomic data.
Mutex objects
Both problems of reordering and caching can simply be resolved by using mutex objects:
std::mutex mx;
int data = 0;
bool has_data = false;
void producer() {
int i = 100;
while(i) {
{ // enter lock scope, begin of "critical section"
std::unique_lock _(mx);
if (!has_data) {
data = --i;
has_data = true;
}
} // leave lock scope, end of "critical section"
std::thread::this_thread::yield();
}
}
void consumer() {
while(true) {
int i = -1;
{ // enter lock scope, begin of "critical section"
std::unique_lock _(mx);
if (has_data) {
i = data;
has_data = false;
}
} // leave lock scope , end of "critical section"
if (i == 0)
break;
std::thread::this_thread::yield();
}
}
int main() {
// create and start 2 consumer and 1 producer thread
std::jthread c1(consumer), c2(consumer), prd(producer);
return 0;
}
Both compiler and CPU won’t reorder any operations from inside a critical section to the outside, or any operation before
the critical section after it and vice versa.
It also ensures that global date changed inside the critical section will become visible to other threads when the current thread
is leaving it.
A mutex is an easy and good maintainable (because understandable) possibility for synchronizing data access in different threads.
But when it comes to performance, there may be better solutions, especially for small data like in the example above.
Acquire-Release semantics
There are a few memory orders in C++
you can use with std::atomic
which can solve the upper problem without mutexes and therefore a bit faster:
1int data = 0;
2std::atomic<bool> has_data{false};
3
4void producer() {
5 int i = 100;
6 while(i) {
7 // acquire means that any store with release on has_data has to be
8 // accomplished before (so that any changed data is visible)
9 if (!has_data.load(std::memory_order_acquire)) {
10 data = --i;
11 // release means to make all data in this thread visible
12 // to other threads that are using has_data, too
13 has_data.exchange(true, std::memory_release);
14 }
15 std::thread::this_thread::yield();
16 }
17}
18
19void consumer() {
20 while(true) {
21 int i = -1;
22 if (has_data.load(std::memory_order_acquire)) {
23 i = data;
24 has_data.exchange(false, std::memory_release);
25 }
26
27 if (i == 0)
28 break;
29
30 std::thread::this_thread::yield();
31 }
32}
33
34int main() {
35 // create and start 2 consumer and 1 producer thread
36 std::jthread c1(consumer), c2(consumer), prd(producer);
37 return 0;
38}
Here has_data
is an atomic which is exchanged by the executing thread (consumer or producer) depending on its state.
By loading has_data
with acquire, all writes in other threads using has_data
will be visible to the loading thread.
This also means, that data
mustn’t be synchronized especially, because the write to data
happens before the write to has_data
due to the release memory order. Release ensures making all writes in the current thread beeing visible to others using
the same atomic - in this case has_data
. And of course has_data
itself won’t be invalidated to the current thread inbetween the
acquire-release (line 8 to 11 and 20 to 22 in the example).
Both acquire and release don’t have to be together as in the upper example, it can be also used over the boundaries of a callable since it is a ordering mechanism which effects compilation and runtime in whole.
std::atomic<unsigned int> lock{0};
std::string foo{};
void producer() {
lock.fetch_add(1, std::memory_order_release); // first add, lock is odd
// now the lock is odd for all other threads
foo = compute_it();
lock.fetch_add(1, std::memory_order_release); // second add, lock is even
// at this point lock is even and foo is set,
// both changes are now visible to other threads
}
void consumer() {
// do a busy loop as long as `lock` is odd
// by using acquire we get synchronized with the release in the other thread
while (lock.load(std::memory_order_acquire) % 2) {}
std::string my_foo = get_foo();
}
By the way - in the sense of locking acquire and release are hard to understand when it comes to memory orders.
I myself remind that this way: \
release ‘releases’ the cache to memory,
acquire ‘acquires’ the memory to the cache.
Sequential consistency ordering
Another - but stricter - ordering is sequential ordering, std::memory_order_seq_cst
.
This guarantees that all atomics (not even the same one) using this order
are seen by other threads in the same order as they where performed.
Neither CPU nor compiler will reorder them.
This order can be used for load (then it behaves like acquire), store (release) and load-modify-store operations (acquire + release).
There is a big section about this in the CPP reference with a lot of examples.