A few days ago I came up with refactoring a legacy code string class, especially with the goal to speed up some static stuff using constexpr. While adapting each method one by one, I came up with one called utf8_len - this returns the count of characters of the contained utf8 - string. Ok, a better name would be something like char_count or maybe utf8_char_count but changing the API itself was not an option.

I can remind myself having this refactored a few years ago by using a suggestion I found on stackoverflow:

The simple c-while loop

# include <cstdint>
# include <string>

size_t utf8_length(const std::string& source) {
  const char* _s = content.c_str();

  size_t len = 0;
  while (*_s)
    len += (uint8_t)((*_s++ & 0xc0) != 0x80);

  return len;
}

As all of the used utf8 strings are well formed and error checked, so this works quite well and is pretty fast for both compilers / OS.

chars (bytes) gcc1 msvc1
1,000 (1,412) 0.869 0.710
100 (145) 0.115 0.7592
10 (18) 0.033 0.096
0 (0) 0.016 0.016

The C++20 solution

I currently joined a C++20 training held by Rainer Grimm. Amongst other things he showed us the usage of std::ranges. So I tried implementing the same behaviour using std::ranges:

# include <cstdint>
# include <string>
# include <ranges>
# include <algorithm>

size_t utf8len(const std::string& content) {
  return std::ranges::count_if(content, [](auto c) { return (c & 0xc0) != 0x80; });
}

At gcc: wow! This is half the time the C-while loop used - and its constexpr usable! With msvc this takes a bit more time, presumably gcc optimizes better or the linux developers have written a faster lib…

chars (bytes) gcc1 msvc1
1,000 (1,412) 0.362 0.919
100 (145) 0.057 0.917
10 (18) 0.020 0.110
0 (0) 0.014 0.012

The C++17 algorithm solution

Ok, lets try std::accumulate from <numeric>:

# include <cstdint>
# include <string>
# include <numeric>

__attribute__ ((noinline)) size_t utf8len(const std::string& content) {
  return std::accumulate(
    content.cbegin(), content.cend(), (size_t)0,
    [](size_t count, uint8_t c)->size_t {
      return count + (size_t)((c & 0xc0) != 0x80);
    });
}
chars (bytes) gcc1 msvc1
1,000 (1,412) 0.335 0.653
100 (145) 0.041 0.7052
10 (18) 0.021 0.095
0 (0) 0.015 0.011

Suprisingly (but slightly) faster than the std::ranges solution, at msvc it also beats the c-while.

Cache optimized

A few weeks ago I watched a presentation from Fedor Pikus he held at the CppCon 2021 about branch prediction (youtube link) and another about memory alignment (honestly I forgot the presenter…).

So I wrote an algorithm 3 which I thought it would have nearly no branch misses and would use the cache optimally. Unfortunately my perf installation doesn’t show cycles and branch misses, but I’m still on it.

# include <cstdint>
# include <string>

__attribute__ ((noinline)) size_t utf8len(const std::string& content) {
  using block_type = uint64_t;

  constexpr block_type all_bits = ~(block_type)0;
  constexpr block_type xor_mask = static_cast<block_type>(0x8080'8080'8080'8080);
  constexpr block_type bit_mask = static_cast<block_type>(0xC0C0'C0C0'C0C0'C0C0);

  size_t result = 0;
  const block_type* ptr = (const block_type*)(content.c_str()),
    * end = ptr + (content.size() / sizeof(block_type));

  block_type i{};
  while (ptr < end) {
    i = (*ptr ^ xor_mask) & bit_mask;
    i = (i | i << 1 | i >> 1) & xor_mask;
    result += std::popcount(i);
    ++ptr;
  }

  block_type mask{};
  size_t bytes_left = content.size() % sizeof(block_type);
  
  if constexpr (std::endian::native == std::endian::little) {
    mask = ~(all_bits << (bytes_left * 8));
  }
  else {
    mask = ~(all_bits >> (bytes_left * 8));
  }

  i = (*ptr ^ xor_mask) & bit_mask & mask;
  i = (i | i << 1 | i >> 1) & xor_mask;
  result += std::popcount(i);
  return result;
}

This took me a few hours to write but the result is so disillusioning when using gcc. The msvc compiled binaries are fast as hell on Windows for larger string - this is nearly 5x faster than the c-while loop when using strings sized more than 100 bytes!

chars (bytes) gcc1 msvc1
1,000 (1,412) 0.523 0.153
100 (145) 0.074 0.1702
10 (18) 0.031 0.045
0 (0) 0.022 0.033

Cache optimized using std::accumulate

Since std::accumulate was the fastest gcc / Linux solution so far, I tried to combine it with the cache optimized one:

size_t utf8len(const std::string& content) {
  using block_type = uint64_t;
  constexpr block_type all_bits = ~(block_type)0;
  constexpr block_type xor_mask = static_cast<block_type>(0x8080'8080'8080'8080);
  constexpr block_type bit_mask = static_cast<block_type>(0xC0C0'C0C0'C0C0'C0C0);

  const block_type* ptr = (const block_type*)(content.c_str()),
    * end = ptr + (content.size() / sizeof(block_type));

  size_t result = std::accumulate(
    (const block_type*)ptr, end, (size_t)0,
    [](size_t count, const block_type p)->size_t {
      block_type i = (p ^ xor_mask) & bit_mask;
      i = (i | i << 1 | i >> 1) & xor_mask;
      return count + std::popcount(i);
    });

  size_t bytes_left = content.size() % sizeof(block_type);

  block_type mask{};
  if constexpr (std::endian::native == std::endian::little) {
    mask = ~(all_bits << (bytes_left * 8));
  }
  else {
    mask = ~(all_bits >> (bytes_left * 8));
  }

  block_type i = (*end ^ xor_mask) & bit_mask & mask;
  i = (i | i << 1 | i >> 1) & xor_mask;
  result += std::popcount(i);
  return result;
}
chars (bytes) gcc1 msvc1
1,000 (1,412) 0.549 0.154
100 (145) 0.078 0.2182
10 (18) 0.031 0.045
0 (0) 0.027 0.033

Ok, time is money, let’s stop this here. It performs like the algorithm I wrote as cache optimized on both systems. The pure std::accumalate bytewise character counting is the fastest of these on my linux machine, whilst my cache optimized algorithm performs best under Windows. It seems as if the GCC std lib is better optimised (or optimisable?) than the Microsoft implementiation. I’ll use them conditionally.

How it works

Counting of utf8 characters depends on the multibyte definition of this encoding (see also this wikipedia article). The only interesting bits in a byte of an utf8 string for character counting are the first two (msb) bits. If the first bit is set (0x80) but the second not, the byte is a multibyte character chunk and thus must not be counted::

11xx xxxx -> 1 (multibyte character start byte)
10xx xxxx -> 0 (multibyte character subsequent byte, always preceeded by a start- or another subsequent byte - do not count)
00xx xxxx -> 1 (single byte character)
01xx xxxx -> 1 (single byte character)

To calculate this, each byte is xor’ed by 10xxx xxxx and masked by the first two bits. This leads to:

11xx --> 01xx (start byte)
10xx --> 00xx (subsequ. byte)
01xx --> 11xx (single byte)
00xx --> 10xx (single byte)

Then any of the first two bits is expanded by or’ing it with the left and right shift value of the xor’ed and masked byte from previous step:

01xx --> 11xx (start byte)
00xx --> 00xx (subsequ. byte)
11xx --> 11xx (single byte)
10xx --> 11xx (single byte)

At least the first bit is masked and all bits set in the result are counted:

11xx --> 10xx --> 1 (start byte and single byte, 1 bit)
00xx --> 00xx --> 0 (subsequence byte, 0 bits)

For optimally using the cpu cache on an x86_64 architecture, the string is read in 8byte chunks und the algorithm doesn’t use branches (to allow best branch prediction success).

In this way 8 bytes can be evaluated in less steps with a drastically reduced loop (simplified example with 4 bytes of string "onÄ"):

o n Ä (first) Ä (chunk)
01101111 01101110 11000011 10000100 Original value
11000000 11000000 01000000 00000000 First 2 bits masked and xor’ed by 01xx
11000000 11000000 11000000 00000000 Expanding first 2 bits
10000000 10000000 10000000 00000000 Masked by first bit, 3 bits left = 3 chars

Resume

It’s difficult to say what is the best solution here. If you’re a linux developer using gcc, you should prefer the std::ranges solution. It is best in all aspects (performance, maintenance). If you’re on Microsoft using MSVC and performance isn’t your biggest problem, you should have an eye on std::accumulate as an optimal solution in terms of maintainability (readability) and performance.

gcc

chars (bytes) c-while std::ranges std::accumulate cache optimized
1,000 (1,412) 0.869 0.362 0.335 0.523
100 (145) 0.115 0.057 0.041 0.074
10 (18) 0.033 0.020 0.021 0.031
0 (0) 0.016 0.014 0.015 0.022

msvc

chars (bytes) c-while std::ranges std::accumulate cache optimized
1,000 (1,412) 0.710 0.919 0.653 0.153
100 (145) 0.759 0.917 0.705 0.170
10 (18) 0.096 0.110 0.095 0.045
0 (0) 0.016 0.012 0.011 0.033

  1. duration in seconds for 1,000,000 repititions. Note that the different compiler results can’t be compared to each other, since I checked both on different machines. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎

  2. I measured this a dozen times: this algorithm performs on 1.000 bytes better than on 100! It must have something to do with the test string allocation / memory allocation algorithm (SSO is surely not the reason). I will research this later. ↩︎ ↩︎ ↩︎ ↩︎

  3. this algorithm may read over the memory boundaries of the given string. But since this string is aligned from its beginning as well as the size of the memory block allocated for it (except in case of SSO), on most environments this over boundary reading won’t cause a SIGSEV. I personally tried this on Windows 10 (msvc 2022, toolset 14.x) as well as on Ubuntu 22.04 (gcc 9, 11 & 13) and didn’t got any problems. ↩︎