$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-06-12 11:15:26
On Wed, Jun 11, 2025 at 6:33â¯PM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:
>
> Unlike fast_multiblock*, [multi]block is branchless, so your results
> make sense. I wonder if multiblock<K> should be made branchful as
> the working memory area can get quite large for large values of K.
>
> Can you elaborate why would you want to change multiblock?
To save memory accesses that are unnecessary and slow down rest of the
program because they occupy cache or purely to reduce number of
instructions executed?
If that is the case on my CPU it does not seem the matter a lot, i.e.
changing AVX2 to be branchless does not help or hurt performance, maybe CPU
anyway decides to fetch data...
This is what I did to make avx code branchless, did not affect performance
in a way I could detect:
static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash)
{
bool found = true;
for(int i=0;i<k/8;++i){
found = found && check_m256ix2(x[i],hash,8);
hash=detail::mulx64(hash);
}
if constexpr (k%8){
found = found && check_m256ix2(x[k/8],hash,k%8);
}
return found;
}
But back to the multiblock and making it branchful:
With my CPU/compiler(clang is ð for this benchmark compared to gcc) it
outperforms fast_multiblock for L2, L3, RAM size filters when it comes to
mixed lookups, so I am not certain it needs changing.
Now for unsuccessful lookups fast_multiblock wins a little, but I am not
sure for all use cases unsuccessful lookups will be close 100%.
Maybe best option is to allow users to configure if filter is branchy or
not, but with so many filters, not sure if another config option is best,
as it may confuse users.
In any case here are numbers for 2 similar filters on my machine: 1M, 3M,
35M(I have lot of L3 cache so I needed large number of values to go over
L3).
*CLANG*
*1M/L2*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0581
insertion_time 4.30731
successful_lookup_time 4.62453
unsuccessful_lookup_time 4.64252
mixed_lookup_time 4.6812
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0629
insertion_time 4.33433
successful_lookup_time 4.47549
unsuccessful_lookup_time 3.46791
mixed_lookup_time 11.0634
*3M/L3*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.72205
FPR 0.0609667
insertion_time 7.01126
successful_lookup_time 6.40535
unsuccessful_lookup_time 6.30024
mixed_lookup_time 6.20703
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.72205
FPR 0.0632
insertion_time 7.17907
successful_lookup_time 6.13649
unsuccessful_lookup_time 6.17347
mixed_lookup_time 12.0361
*35M/RAM*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0635571
insertion_time 19.3996
successful_lookup_time 18.6527
unsuccessful_lookup_time 18.1346
mixed_lookup_time 18.2745
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.06368
insertion_time 18.9966
successful_lookup_time 18.729
unsuccessful_lookup_time 16.1566
mixed_lookup_time 21.5246
in case somebody is curious GCC 14 numbers:
*GCC*
*1M*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0581
insertion_time 13.0513
successful_lookup_time 8.06424
unsuccessful_lookup_time 8.09518
mixed_lookup_time 8.21242
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0629
insertion_time 5.68284
successful_lookup_time 5.7248
unsuccessful_lookup_time 4.32124
mixed_lookup_time 11.7601
*3M*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.72205
FPR 0.0609667
insertion_time 14.0387
successful_lookup_time 9.44954
unsuccessful_lookup_time 9.3228
mixed_lookup_time 9.41488
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.72205
FPR 0.0632
insertion_time 6.97644
successful_lookup_time 6.69201
unsuccessful_lookup_time 5.9029
mixed_lookup_time 12.6044
*35M*
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0635571
insertion_time 26.4349
successful_lookup_time 32.8604
unsuccessful_lookup_time 33.0485
mixed_lookup_time 33.0334
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.06368
insertion_time 20.6438
successful_lookup_time 18.1056
unsuccessful_lookup_time 16.2122
mixed_lookup_time 23.2072
// in some cases over 50% slower than clang
What I find quite interesting is that if you only benchmark with clang or
only with gcc you will get totally different conclusions about which filter
is faster, not to mention that before measuring I would assume that for 35M
case memory latency
dominates so differences in optimization will not matter...
For example:
35M
clang:
mixed_lookup_time 18.2745
vs
mixed_lookup_time 21.5246
gcc:
mixed_lookup_time 33.0334
vs
mixed_lookup_time 23.2072
P.S. I am obviously comparing clang and gcc for just 1 benchmark, please do
not extrapolate this to general optimizer performance.