Performance of std algorithms
Recently I came up with the question if using the std algorithms with lambdas makes my code possibly better readable but slower.
So I did a small benchmark comparing it.
As subject I’ve choosen conditionally copying the content of one vector into another.
At first the setup for each following benchmark run:
#include <benchmark/benchmark.h>
#include <vector>
#include <numeric>
using source_type = std::vector<int>;
source_type source;
source_type target;
static void DoSetup(const benchmark::State& state) {
source.clear();
// state.range(0) contains what gets passed when using ->Args(..)
source.resize(state.range(0));
// here we simply fill the source vector
// starting from 0 increased by one until its end
std::iota(std::begin(source), std::end(source), 0);
}
Now the first algorithm which operates directly and doesn’t use lambdas:
void CopyDirect() {
for (auto v:source) {
if (v % 3) {
target.push_back(v);
}
}
}
static void BM_CopyDirect(benchmark::State& state) {
for (auto _:state) {
state.PauseTiming();
source_type().swap(target);
state.ResumeTiming();
CopyDirect();
}
}
BENCHMARK(BM_CopyDirect)->Arg(1)->Arg(100)->Arg(1'000)->Arg(10'000)->Setup(DoSetup);
And the one that uses std algorithms for getting the same result:
void CopyAlg() {
std::copy_if(
std::begin(source), std::end(source),
std::back_inserter(target),
[](const auto& v) {
return v % 3 != 0;
}
);
}
static void BM_CopyAlg(benchmark::State& state) {
for (auto _:state) {
state.PauseTiming();
source_type().swap(target);
state.ResumeTiming();
CopyAlg();
}
}
BENCHMARK(BM_CopyAlg)->Arg(1)->Arg(100)->Arg(1'000)->Arg(10'000)->Setup(DoSetup);
I ran the benchmark compiled with gcc 13.2 and got a not really surprising result:
--------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------
BM_CopyDirect/1 2173 ns 2173 ns 326137
BM_CopyDirect/100 5997 ns 5996 ns 106952
BM_CopyDirect/1000 28822 ns 28815 ns 24120
BM_CopyDirect/10000 252893 ns 252505 ns 2741
BM_CopyAlg/1 2204 ns 2199 ns 320805
BM_CopyAlg/100 7000 ns 6985 ns 97109
BM_CopyAlg/1000 37553 ns 37538 ns 18609
BM_CopyAlg/10000 340517 ns 339462 ns 2052
The bigger the loop, the bigger the difference between the std algorithm and the raw loop gets.
I think the problem at this point can be both the back_inserter as the lambda as well, since each may require an additional call.
The clang compiled binary creates the same result, but as always runs slower than the gcc compiled one.