From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-06-14 17:32:17


One more datapoint that confirms what you suggested(and what intuitively
makes sense), that for large K effect of breaking early for unsuccessful
lookups looks more impressive(when filter fits in L2).

Here is K == 19 results. I picked K == 19 because it is bigger than 16, no
other reason, maybe it is silly number to use for real bloom filter. Clang
results only.

*1M*
branchless:
filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 2.38419
FPR 0.0174
insertion_time 6.89076
successful_lookup_time 7.72433
unsuccessful_lookup_time 7.76004
mixed_lookup_time 7.78691

branchful:
filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 2.38419
FPR 0.0174
insertion_time 6.808
successful_lookup_time 7.09307
unsuccessful_lookup_time 4.9642
mixed_lookup_time 12.8065

For L3 sizes branchless code performs much better relatively even though
branchy code still wins for unsuccessful, I would assume because of large
memory latencies.
*3M*
branchless:
filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 7.15256
FPR 0.0153667
insertion_time 10.1734
successful_lookup_time 9.1581
unsuccessful_lookup_time 9.0949
mixed_lookup_time 9.10151
branchful:
filter filter<int, 1ul, fast_multiblock64<19ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 7.15256
FPR 0.0153667
insertion_time 8.9318
successful_lookup_time 8.79925
unsuccessful_lookup_time 8.34029
mixed_lookup_time 14.4601

There is also another interesting fact about performance that will take a
bit to explain so I ask for patience:
fast_multiblock<8> is much faster than fast_block<11> for
unsuccessful_lookup_time, this does not initially make sense because we
would expect we execute same amount of instructions, and branch is easily
predictable since 100% of lookups fail(beside FPR).
I am not asm expert, but it seems clang interleaves SIMD for for 11 case so
some stuff that is not strictly necessary before branch is done before
branch. Now this is great when we do not branch(early return) but is waste
when we have close to 100% unsuccessful lookups.
This made me investigate a bit and on my machine I can get really great
performance for unsuccessful_lookup_time when I combine SIMD and scalar
filter for "remainder"(e.g. SIMD 8 , scalar 3 for K==11).

This combined filter(bobw_filter) has really great performance for
unsuccessful lookups, unclear if this is just microbenchmark anomaly(if
there was some extra code as in real application maybe extra simd code for
fast_multiblock64<11ul> being executed would not matter).

*3M*
filter bobw_filter<11ul> // fast_multiblock64<8> +
multiblock<3>
capacity MB 5.72205
FPR 0.0555667
insertion_time 7.7845
successful_lookup_time 6.48931
unsuccessful_lookup_time 3.45824
mixed_lookup_time 12.2271
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.72205
FPR 0.0609667
insertion_time 6.6754
successful_lookup_time 6.161
unsuccessful_lookup_time 6.19519
mixed_lookup_time 6.09886
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange> // branchful
capacity MB 5.72205
FPR 0.0632
insertion_time 6.33386
successful_lookup_time 5.87341
unsuccessful_lookup_time 5.86074
mixed_lookup_time 12.0457

I would not bother you for small difference, but 3.46 vs 6.20 and 5.86
looks quite impressive, with disclaimer I mentioned before that it may be
just tight microbenchmark difference, and this is just for cases when
people are purely focused on unsuccessful lookups.

Advantage holds even for large number of elements for unsuccessful lookup,
but others look quite bad relatively

*35M*
filter bobw_filter<11ul>
capacity MB 66.7572
FPR 0.0569229
insertion_time 23.9595
successful_lookup_time 22.1188
unsuccessful_lookup_time 10.6815
mixed_lookup_time 24.646
filter filter<int, 1ul, multiblock<unsigned long, 11ul>,
1ul, hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.0635571
insertion_time 19.1072
successful_lookup_time 18.0308
unsuccessful_lookup_time 18.0789
mixed_lookup_time 18.0465
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 66.7572
FPR 0.06368
insertion_time 18.6431
successful_lookup_time 16.2872
unsuccessful_lookup_time 14.8029
mixed_lookup_time 21.1569

I understand that mixing 2 filters instead of modifying one to have mix of
simd and scalar code is not really greatest comparison, but it was much
easier for me to make combined filter, than to figure out how to mix scalar
and simd filter code in one filter.

template<std::size_t K>
struct bobw_filter
{
    static constexpr bool has_scalar = K%4;
    bobw_filter() = default;
    static constexpr size_t scalar_capacity(size_t full_capacity)
    {
        return full_capacity*(K%4)/K;
    }
    static constexpr size_t simd_capacity(const size_t full_capacity)
    {
        return full_capacity*(K/4*4)/K;
    }
    explicit bobw_filter(size_t capacity) : simd_f(simd_capacity(capacity))
    {
        if constexpr(has_scalar)
        {
            assert(scalar_capacity(capacity)!=0);
            scalar_f.reset(scalar_capacity(capacity));
        }
    }

    // TODO: is this mulx64 correct thing to do, or we should apply it
on hash of x?
    BOOST_FORCEINLINE void insert(const int& x)
    {
        simd_f.insert(x);
        if constexpr(has_scalar)
        {
            scalar_f.insert(detail::mulx64(x));
        }
    }
    BOOST_FORCEINLINE bool may_contain(const int& x) {
        if constexpr(has_scalar)
        {
            return simd_f.may_contain(x) &&
scalar_f.may_contain(detail::mulx64(x));
        }
        else
        {
            return simd_f.may_contain(x);
        }
    }
    auto capacity()
    {
        if constexpr(has_scalar)
        {
            return scalar_f.capacity() + simd_f.capacity();
        }
        else
        {
            return simd_f.capacity();
        }
    }
    [[no_unique_address]]
    std::conditional_t<has_scalar, filter<int, 1,
multiblock<boost::uint64_t, K % 4>, 1 >, std::monostate> scalar_f;
    filter<int, 1, fast_multiblock64<K / 4 * 4>, 1> simd_f;
    using value_type = typename decltype(simd_f)::value_type;
};