$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-06-13 07:23:39
On Thu, Jun 12, 2025 at 8:56â¯PM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:
> This is short circuiting (at least in theory). You may try using bitwise
> AND (&). Also,
>
Doh. :) Thank you for catching that.
On my machine when I do that difference is miniscule/noise.
I have bumped up now K to 14, so numbers are not directly comparable to
previous numbers, but most important thing is that there is no
difference(for Clang).
Clang
*1M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0737
insertion_time 4.97109
successful_lookup_time 5.75406
unsuccessful_lookup_time 5.73877
mixed_lookup_time 5.77369
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0736
insertion_time 5.19262
successful_lookup_time 5.8273
unsuccessful_lookup_time 5.81782
mixed_lookup_time 5.81926
*35M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0752343
insertion_time 19.7033
successful_lookup_time 19.8497
unsuccessful_lookup_time 20.9164
mixed_lookup_time 20.6151
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0748914
insertion_time 19.734
successful_lookup_time 20.3514
unsuccessful_lookup_time 19.5756
mixed_lookup_time 19.375
as before GCC performance for multiblock is bad, so now SIMD code easily
wins
GCC
*1M*filter filter<int, 1ul, multiblock<unsigned long,
14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0737
insertion_time 11.846
successful_lookup_time 9.76772
unsuccessful_lookup_time 9.80503
mixed_lookup_time 9.81788
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0736
insertion_time 6.17691
successful_lookup_time 6.43346
unsuccessful_lookup_time 6.45574
mixed_lookup_time 6.4289
*35M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0752343
insertion_time 30.3168
successful_lookup_time 36.3754
unsuccessful_lookup_time 37.8927
mixed_lookup_time 38.045
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0748914
insertion_time 21.2682
successful_lookup_time 21.2692
unsuccessful_lookup_time 21.0058
mixed_lookup_time 19.7536
As before unsuccessful lookups get worse slightly compared to branchful
version, as expected, since much more extra work is done.
SIMD code branchful vs branchless for 3M(Clang):
successful_lookup_time 6.5231
unsuccessful_lookup_time 5.71519
mixed_lookup_time 12.8607
insertion_time 6.44847
successful_lookup_time 6.68598
unsuccessful_lookup_time 6.66759
mixed_lookup_time 6.71642
> Again, an interesting area to investigate. You're giving
> me a lot of work :-)
>
Too hard work for me to help with :), but one speculative idea: it may be
reasonable to prepare lookups(i.e. make_m256ix2(hash,kp); ) for all hashes
unconditionally, but still have branchy code
in terms of breaking when first _mm256_testc_si256 detects no match.
Of simpler tasks: if you want I can try to make pull request for adding
mixed results to benchmark.
Downside is that benchmark already has a lot of columns and rows so users
might find it confusing.
On the other hand I think mixed benchmark is valuable info for people who
will not have 90% + of hits or misses in their usages.
>
> I'd rather release the lib now in time for Boost 1.89 and then postpone
> this analysis for 1.90, as it's going to take some work.
>
I agree, I never meant to imply this is critical fix needed situation, I
just found it quite interesting.
for the record this is check now:
static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash)
{
bool found = true;
for(int i=0;i<k/8;++i){
found &= check_m256ix2(x[i],hash,8);
hash=detail::mulx64(hash);
}
if constexpr (k%8){
found &= check_m256ix2(x[k/8],hash,k%8);
}
return found;
}