An odyssey of utf8 string character counting optimization // dev/log

A few days ago I came up with refactoring a legacy code string class, especially with the goal to speed up some static stuff using constexpr. While adapting each method one by one, I came up with one called utf8_len - this returns the count of characters of the contained utf8 - string. Ok, a better name would be something like char_count or maybe utf8_char_count but changing the API itself was not an option.

I can remind myself having this refactored a few years ago by using a suggestion I found on stackoverflow:

The simple c-while loop

# include <cstdint>
# include <string>

size_t utf8_length(const std::string& source) {
  const char* _s = content.c_str();

  size_t len = 0;
  while (*_s)
    len += (uint8_t)((*_s++ & 0xc0) != 0x80);

  return len;
}

As all of the used utf8 strings are well formed and error checked, so this works quite well and is pretty fast for both compilers / OS.

chars (bytes)	gcc¹	msvc¹
`1,000 (1,412)`	`0.869`	`0.710`
`100 (145)`	`0.115`	`0.759`²
`10 (18)`	`0.033`	`0.096`
`0 (0)`	`0.016`	`0.016`

The C++20 solution

I currently joined a C++20 training held by Rainer Grimm. Amongst other things he showed us the usage of std::ranges. So I tried implementing the same behaviour using std::ranges:

# include <cstdint>
# include <string>
# include <ranges>
# include <algorithm>

size_t utf8len(const std::string& content) {
  return std::ranges::count_if(content, [](auto c) { return (c & 0xc0) != 0x80; });
}

At gcc: wow! This is half the time the C-while loop used - and its constexpr usable! With msvc this takes a bit more time, presumably gcc optimizes better or the linux developers have written a faster lib…

chars (bytes)	gcc¹	msvc¹
`1,000 (1,412)`	`0.362`	`0.919`
`100 (145)`	`0.057`	`0.917`
`10 (18)`	`0.020`	`0.110`
`0 (0)`	`0.014`	`0.012`

The C++17 algorithm solution

Ok, lets try std::accumulate from <numeric>:

# include <cstdint>
# include <string>
# include <numeric>

__attribute__ ((noinline)) size_t utf8len(const std::string& content) {
  return std::accumulate(
    content.cbegin(), content.cend(), (size_t)0,
    [](size_t count, uint8_t c)->size_t {
      return count + (size_t)((c & 0xc0) != 0x80);
    });
}

chars (bytes)	gcc¹	msvc¹
`1,000 (1,412)`	`0.335`	`0.653`
`100 (145)`	`0.041`	`0.705`²
`10 (18)`	`0.021`	`0.095`
`0 (0)`	`0.015`	`0.011`

Suprisingly (but slightly) faster than the std::ranges solution, at msvc it also beats the c-while.

Cache optimized

A few weeks ago I watched a presentation from Fedor Pikus he held at the CppCon 2021 about branch prediction (youtube link) and another about memory alignment (honestly I forgot the presenter…).

So I wrote an algorithm ³ which I thought it would have nearly no branch misses and would use the cache optimally. Unfortunately my perf installation doesn’t show cycles and branch misses, but I’m still on it.

# include <cstdint>
# include <string>

__attribute__ ((noinline)) size_t utf8len(const std::string& content) {
  using block_type = uint64_t;

  constexpr block_type all_bits = ~(block_type)0;
  constexpr block_type xor_mask = static_cast<block_type>(0x8080'8080'8080'8080);
  constexpr block_type bit_mask = static_cast<block_type>(0xC0C0'C0C0'C0C0'C0C0);

  size_t result = 0;
  const block_type* ptr = (const block_type*)(content.c_str()),
    * end = ptr + (content.size() / sizeof(block_type));

  block_type i{};
  while (ptr < end) {
    i = (*ptr ^ xor_mask) & bit_mask;
    i = (i | i << 1 | i >> 1) & xor_mask;
    result += std::popcount(i);
    ++ptr;
  }

  block_type mask{};
  size_t bytes_left = content.size() % sizeof(block_type);
  
  if constexpr (std::endian::native == std::endian::little) {
    mask = ~(all_bits << (bytes_left * 8));
  }
  else {
    mask = ~(all_bits >> (bytes_left * 8));
  }

  i = (*ptr ^ xor_mask) & bit_mask & mask;
  i = (i | i << 1 | i >> 1) & xor_mask;
  result += std::popcount(i);
  return result;
}

This took me a few hours to write but the result is so disillusioning when using gcc. The msvc compiled binaries are fast as hell on Windows for larger string - this is nearly 5x faster than the c-while loop when using strings sized more than 100 bytes!

chars (bytes)	gcc¹	msvc¹
`1,000 (1,412)`	`0.523`	`0.153`
`100 (145)`	`0.074`	`0.170`²
`10 (18)`	`0.031`	`0.045`
`0 (0)`	`0.022`	`0.033`

Cache optimized using std::accumulate

Since std::accumulate was the fastest gcc / Linux solution so far, I tried to combine it with the cache optimized one:

size_t utf8len(const std::string& content) {
  using block_type = uint64_t;
  constexpr block_type all_bits = ~(block_type)0;
  constexpr block_type xor_mask = static_cast<block_type>(0x8080'8080'8080'8080);
  constexpr block_type bit_mask = static_cast<block_type>(0xC0C0'C0C0'C0C0'C0C0);

  const block_type* ptr = (const block_type*)(content.c_str()),
    * end = ptr + (content.size() / sizeof(block_type));

  size_t result = std::accumulate(
    (const block_type*)ptr, end, (size_t)0,
    [](size_t count, const block_type p)->size_t {
      block_type i = (p ^ xor_mask) & bit_mask;
      i = (i | i << 1 | i >> 1) & xor_mask;
      return count + std::popcount(i);
    });

  size_t bytes_left = content.size() % sizeof(block_type);

  block_type mask{};
  if constexpr (std::endian::native == std::endian::little) {
    mask = ~(all_bits << (bytes_left * 8));
  }
  else {
    mask = ~(all_bits >> (bytes_left * 8));
  }

  block_type i = (*end ^ xor_mask) & bit_mask & mask;
  i = (i | i << 1 | i >> 1) & xor_mask;
  result += std::popcount(i);
  return result;
}

chars (bytes)	gcc¹	msvc¹
`1,000 (1,412)`	`0.549`	`0.154`
`100 (145)`	`0.078`	`0.218`²
`10 (18)`	`0.031`	`0.045`
`0 (0)`	`0.027`	`0.033`

Ok, time is money, let’s stop this here. It performs like the algorithm I wrote as cache optimized on both systems. The pure std::accumalate bytewise character counting is the fastest of these on my linux machine, whilst my cache optimized algorithm performs best under Windows. It seems as if the GCC std lib is better optimised (or optimisable?) than the Microsoft implementiation. I’ll use them conditionally.

How it works

Counting of utf8 characters depends on the multibyte definition of this encoding (see also this wikipedia article). The only interesting bits in a byte of an utf8 string for character counting are the first two (msb) bits. If the first bit is set (0x80) but the second not, the byte is a multibyte character chunk and thus must not be counted::

11xx xxxx -> 1 (multibyte character start byte)
10xx xxxx -> 0 (multibyte character subsequent byte, always preceeded by a start- or another subsequent byte - do not count)
00xx xxxx -> 1 (single byte character)
01xx xxxx -> 1 (single byte character)

To calculate this, each byte is xor’ed by 10xxx xxxx and masked by the first two bits. This leads to:

11xx --> 01xx (start byte)
10xx --> 00xx (subsequ. byte)
01xx --> 11xx (single byte)
00xx --> 10xx (single byte)

Then any of the first two bits is expanded by or’ing it with the left and right shift value of the xor’ed and masked byte from previous step:

01xx --> 11xx (start byte)
00xx --> 00xx (subsequ. byte)
11xx --> 11xx (single byte)
10xx --> 11xx (single byte)

At least the first bit is masked and all bits set in the result are counted:

11xx --> 10xx --> 1 (start byte and single byte, 1 bit)
00xx --> 00xx --> 0 (subsequence byte, 0 bits)

For optimally using the cpu cache on an x86_64 architecture, the string is read in 8byte chunks und the algorithm doesn’t use branches (to allow best branch prediction success).

In this way 8 bytes can be evaluated in less steps with a drastically reduced loop (simplified example with 4 bytes of string "onÄ"):

o	n	Ä (first)	Ä (chunk)
`01101111`	`01101110`	`11000011`	`10000100`	Original value
`11000000`	`11000000`	`01000000`	`00000000`	First 2 bits masked and xor’ed by `01xx`
`11000000`	`11000000`	`11000000`	`00000000`	Expanding first 2 bits
`10000000`	`10000000`	`10000000`	`00000000`	Masked by first bit, 3 bits left = 3 chars

Resume

It’s difficult to say what is the best solution here. If you’re a linux developer using gcc, you should prefer the std::ranges solution. It is best in all aspects (performance, maintenance). If you’re on Microsoft using MSVC and performance isn’t your biggest problem, you should have an eye on std::accumulate as an optimal solution in terms of maintainability (readability) and performance.

gcc

chars (bytes)	c-while	std::ranges	std::accumulate	cache optimized
`1,000 (1,412)`	`0.869`	`0.362`	`0.335`	`0.523`
`100 (145)`	`0.115`	`0.057`	`0.041`	`0.074`
`10 (18)`	`0.033`	`0.020`	`0.021`	`0.031`
`0 (0)`	`0.016`	`0.014`	`0.015`	`0.022`

msvc

chars (bytes)	c-while	std::ranges	std::accumulate	cache optimized
`1,000 (1,412)`	`0.710`	`0.919`	`0.653`	`0.153`
`100 (145)`	`0.759`	`0.917`	`0.705`	`0.170`
`10 (18)`	`0.096`	`0.110`	`0.095`	`0.045`
`0 (0)`	`0.016`	`0.012`	`0.011`	`0.033`

duration in seconds for 1,000,000 repititions. Note that the different compiler results can’t be compared to each other, since I checked both on different machines. ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎ ↩︎
I measured this a dozen times: this algorithm performs on 1.000 bytes better than on 100! It must have something to do with the test string allocation / memory allocation algorithm (SSO is surely not the reason). I will research this later. ↩︎ ↩︎ ↩︎ ↩︎
this algorithm may read over the memory boundaries of the given string. But since this string is aligned from its beginning as well as the size of the memory block allocated for it (except in case of SSO), on most environments this over boundary reading won’t cause a SIGSEV. I personally tried this on Windows 10 (msvc 2022, toolset 14.x) as well as on Ubuntu 22.04 (gcc 9, 11 & 13) and didn’t got any problems. ↩︎