From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-06-13 07:23:39


On Thu, Jun 12, 2025 at 8:56 PM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:

> This is short circuiting (at least in theory). You may try using bitwise
> AND (&). Also,
>

Doh. :) Thank you for catching that.
On my machine when I do that difference is miniscule/noise.
I have bumped up now K to 14, so numbers are not directly comparable to
previous numbers, but most important thing is that there is no
difference(for Clang).

Clang
*1M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0737
insertion_time 4.97109
successful_lookup_time 5.75406
unsuccessful_lookup_time 5.73877
mixed_lookup_time 5.77369
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0736
insertion_time 5.19262
successful_lookup_time 5.8273
unsuccessful_lookup_time 5.81782
mixed_lookup_time 5.81926

*35M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0752343
insertion_time 19.7033
successful_lookup_time 19.8497
unsuccessful_lookup_time 20.9164
mixed_lookup_time 20.6151
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0748914
insertion_time 19.734
successful_lookup_time 20.3514
unsuccessful_lookup_time 19.5756
mixed_lookup_time 19.375

as before GCC performance for multiblock is bad, so now SIMD code easily
wins

GCC

*1M*filter filter<int, 1ul, multiblock<unsigned long,
14ul>, 1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0737
insertion_time 11.846
successful_lookup_time 9.76772
unsuccessful_lookup_time 9.80503
mixed_lookup_time 9.81788
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 1.90735
FPR 0.0736
insertion_time 6.17691
successful_lookup_time 6.43346
unsuccessful_lookup_time 6.45574
mixed_lookup_time 6.4289

*35M*
filter filter<int, 1ul, multiblock<unsigned long, 14ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0752343
insertion_time 30.3168
successful_lookup_time 36.3754
unsuccessful_lookup_time 37.8927
mixed_lookup_time 38.045
filter filter<int, 1ul, fast_multiblock64<14ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0748914
insertion_time 21.2682
successful_lookup_time 21.2692
unsuccessful_lookup_time 21.0058
mixed_lookup_time 19.7536

As before unsuccessful lookups get worse slightly compared to branchful
version, as expected, since much more extra work is done.

SIMD code branchful vs branchless for 3M(Clang):

successful_lookup_time 6.5231
unsuccessful_lookup_time 5.71519
mixed_lookup_time 12.8607

insertion_time 6.44847
successful_lookup_time 6.68598
unsuccessful_lookup_time 6.66759
mixed_lookup_time 6.71642

> Again, an interesting area to investigate. You're giving
> me a lot of work :-)
>

Too hard work for me to help with :), but one speculative idea: it may be
reasonable to prepare lookups(i.e. make_m256ix2(hash,kp); ) for all hashes
unconditionally, but still have branchy code
in terms of breaking when first _mm256_testc_si256 detects no match.

Of simpler tasks: if you want I can try to make pull request for adding
mixed results to benchmark.
Downside is that benchmark already has a lot of columns and rows so users
might find it confusing.
On the other hand I think mixed benchmark is valuable info for people who
will not have 90% + of hits or misses in their usages.

>
> I'd rather release the lib now in time for Boost 1.89 and then postpone
> this analysis for 1.90, as it's going to take some work.
>

I agree, I never meant to imply this is critical fix needed situation, I
just found it quite interesting.

for the record this is check now:

static BOOST_FORCEINLINE bool check(const value_type& x,boost::uint64_t hash)
{
  bool found = true;
  for(int i=0;i<k/8;++i){
    found &= check_m256ix2(x[i],hash,8);
    hash=detail::mulx64(hash);
  }
  if constexpr (k%8){
    found &= check_m256ix2(x[k/8],hash,k%8);
  }
  return found;
}