When running programs using several threads normally one or more synchronization points are required.
C++ offers a lot of them, here I try to explain the memory order for atomic operations.

Don’t use operators

First of all: although some atomics are offering operators such as pre- or post- increment: don’t use them!
One reason: you cannot define a memory order, more to this later.
But more important: you can create non-atomic instructions:

std::atomic<int> i{};
i += i * 2;

There are three operations, two of them atomic but executed seperatly:

atomically load from i
multiply by 2
store calculated value atomically in i

For better understanding what memory barriers and orders are for, lets take a deeper (but not necessarily precise) look.

Reordering

Both compilers and CPUs may reorder instructions for optimization. The CPU mainly implicitly by branch prediction, the compiler explicitly but respecting data dependency.
This means that you cannot be shure that your code is at execution time what you have indentionally written.

Example:

 1std::atomic<int> data_has_changed{0};
 2std::atomic<int> data{0};
 3
 4// thread A
 5data = 100;
 6data_has_changed = 1;
 7
 8// thread B
 9if (data_has_changed)
10  std::cout << data <<std::endl;

If the compiler or CPU decides to move or execute line 6 before line 5 in thread A, thread B may get data_has_changed signaled without data beeing set.

Memory and caches

Nearly all mainstream CPUs have one or more caches (L1, L2, L3…) per core.
In these caches the data from RAM is preloaded before code execution, because they are much faster accessible from the CPU cores than the RAM itself. The size of each cache is smaller in storage size the closer it is placed to its core, a L1 cache for instance has mostly a size between 16 to 64kb and holds both instructions and data, while a L2 cache can have a size of a few MB and caches data from RAM.
The L3 cache is currently used for cache consistency by the CPU itself and therefore not for interest at this time. Both cache size and data alignment can also effect the computation time, but this is another topic.

This in practice means, that if your code changes a variable (which is not atomic) doesn’t change globally at this time - even if it is globally declared and therefore visible to all threads (cores).

Simply said, by the use of atomics it is ensured, that an atomic store writes the cached data of your thread to the RAM and so gets visible to other threads, while an atomic load gets the data from RAM.
By using memory order settings at the atomic operations this can be controlled more precisely and even allows the synchronization of non-atomic data.

Mutex objects

Both problems of reordering and caching can simply be resolved by using mutex objects:

std::mutex mx;
int data = 0;
bool has_data = false;

void producer() {
   int i = 100;
   while(i) {
      { // enter lock scope, begin of "critical section"
         std::unique_lock _(mx);
         if (!has_data) {
            data = --i;
            has_data = true;
         }
      } // leave lock scope, end of "critical section"
      std::thread::this_thread::yield();
   }
}

void consumer() {
   while(true) {
      int i = -1;
      { // enter lock scope, begin of "critical section"
         std::unique_lock _(mx);
         if (has_data) {
            i = data;
            has_data = false;
         }
      } // leave lock scope , end of "critical section"
      
      if (i == 0)
         break;

      std::thread::this_thread::yield();
   }
}

int main() {
   // create and start 2 consumer and 1 producer thread
   std::jthread c1(consumer), c2(consumer), prd(producer);
   return 0;
}

Both compiler and CPU won’t reorder any operations from inside a critical section to the outside, or any operation before the critical section after it and vice versa.
It also ensures that global date changed inside the critical section will become visible to other threads when the current thread is leaving it.

A mutex is an easy and good maintainable (because understandable) possibility for synchronizing data access in different threads.

But when it comes to performance, there may be better solutions, especially for small data like in the example above.

Acquire-Release semantics

There are a few memory orders in C++ you can use with std::atomic which can solve the upper problem without mutexes and therefore a bit faster:

 1int data = 0;
 2std::atomic<bool> has_data{false};
 3
 4void producer() {
 5   int i = 100;
 6   while(i) {
 7      // acquire means that any store with release on has_data has to be
 8      // accomplished before (so that any changed data is visible)
 9      if (!has_data.load(std::memory_order_acquire)) {
10         data = --i;
11         // release means to make all data in this thread visible
12         // to other threads that are using has_data, too
13         has_data.exchange(true, std::memory_release);
14      }
15      std::thread::this_thread::yield();
16   }
17}
18
19void consumer() {
20   while(true) {
21      int i = -1;
22      if (has_data.load(std::memory_order_acquire)) {
23         i = data;
24         has_data.exchange(false, std::memory_release);
25      }
26
27      if (i == 0)
28         break;
29
30      std::thread::this_thread::yield();
31   }
32}
33
34int main() {
35   // create and start 2 consumer and 1 producer thread
36   std::jthread c1(consumer), c2(consumer), prd(producer);
37   return 0;
38}

Here has_data is an atomic which is exchanged by the executing thread (consumer or producer) depending on its state.

By loading has_data with acquire, all writes in other threads using has_data will be visible to the loading thread. This also means, that data mustn’t be synchronized especially, because the write to data happens before the write to has_data due to the release memory order. Release ensures making all writes in the current thread beeing visible to others using the same atomic - in this case has_data. And of course has_data itself won’t be invalidated to the current thread inbetween the acquire-release (line 8 to 11 and 20 to 22 in the example).

Both acquire and release don’t have to be together as in the upper example, it can be also used over the boundaries of a callable since it is a ordering mechanism which effects compilation and runtime in whole.

std::atomic<unsigned int> lock{0};
std::string foo{};

void producer() {
   lock.fetch_add(1, std::memory_order_release); // first add, lock is odd
   // now the lock is odd for all other threads
   foo = compute_it();
   lock.fetch_add(1, std::memory_order_release); // second add, lock is even
   // at this point lock is even and foo is set,
   // both changes are now visible to other threads
}

void consumer() {
   // do a busy loop as long as `lock` is odd
   // by using acquire we get synchronized with the release in the other thread
   while (lock.load(std::memory_order_acquire) % 2) {}
   std::string my_foo = get_foo();
}

By the way - in the sense of locking acquire and release are hard to understand when it comes to memory orders.

I myself remind that this way: \

release ‘releases’ the cache to memory,
acquire ‘acquires’ the memory to the cache.

Sequential consistency ordering

Another - but stricter - ordering is sequential ordering, std::memory_order_seq_cst.
This guarantees that all atomics (not even the same one) using this order are seen by other threads in the same order as they where performed.
Neither CPU nor compiler will reorder them.
This order can be used for load (then it behaves like acquire), store (release) and load-modify-store operations (acquire + release).

There is a big section about this in the CPP reference with a lot of examples.