Performance of std algorithms // dev/log

Recently I came up with the question if using the std algorithms with lambdas makes my code possibly better readable but slower.
So I did a small benchmark comparing it.

As subject I’ve choosen conditionally copying the content of one vector into another.

At first the setup for each following benchmark run:

#include <benchmark/benchmark.h>
#include <vector>
#include <numeric>

using source_type = std::vector<int>;

source_type source;
source_type target;

static void DoSetup(const benchmark::State& state) {
    source.clear();
    // state.range(0) contains what gets passed when using ->Args(..)
    source.resize(state.range(0));
    // here we simply fill the source vector
    // starting from 0 increased by one until its end
    std::iota(std::begin(source), std::end(source), 0);
}

Now the first algorithm which operates directly and doesn’t use lambdas:

void CopyDirect() {
    for (auto v:source) {
        if (v % 3) {
            target.push_back(v);
        }
    }
}

static void BM_CopyDirect(benchmark::State& state) {
    for (auto _:state) {
        state.PauseTiming();
        source_type().swap(target);
        state.ResumeTiming();
        CopyDirect();
    }
}

BENCHMARK(BM_CopyDirect)->Arg(1)->Arg(100)->Arg(1'000)->Arg(10'000)->Setup(DoSetup);

And the one that uses std algorithms for getting the same result:

void CopyAlg() {
    std::copy_if(
        std::begin(source), std::end(source),
        std::back_inserter(target),
        [](const auto& v) {
            return v % 3 != 0;
        }
    );
}

static void BM_CopyAlg(benchmark::State& state) {
    for (auto _:state) {
        state.PauseTiming();
        source_type().swap(target);
        state.ResumeTiming();
        CopyAlg();
    }
}

BENCHMARK(BM_CopyAlg)->Arg(1)->Arg(100)->Arg(1'000)->Arg(10'000)->Setup(DoSetup);

I ran the benchmark compiled with gcc 13.2 and got a not really surprising result:

--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_CopyDirect/1           2173 ns         2173 ns       326137
BM_CopyDirect/100         5997 ns         5996 ns       106952
BM_CopyDirect/1000       28822 ns        28815 ns        24120
BM_CopyDirect/10000     252893 ns       252505 ns         2741
BM_CopyAlg/1              2204 ns         2199 ns       320805
BM_CopyAlg/100            7000 ns         6985 ns        97109
BM_CopyAlg/1000          37553 ns        37538 ns        18609
BM_CopyAlg/10000        340517 ns       339462 ns         2052

The bigger the loop, the bigger the difference between the std algorithm and the raw loop gets.
I think the problem at this point can be both the back_inserter as the lambda as well, since each may require an additional call.

The clang compiled binary creates the same result, but as always runs slower than the gcc compiled one.