From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-05-20 19:07:20


On Tue, May 20, 2025 at 6:31 PM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:

> > User might think: my CPU supports AVX2, so surely it will use SIMD
> > algorithms. But available here refers to compiler options(and
> > obviously CPU support when binary is started), not just on CPU
> > support. I know I am not telling you anything you do not know, I just
> > think large percentage of users might misunderstand what available means.
>
> Yes, you're right, I can rewite "are available" as "are enabled at compile
> time".
>

Thank you, I believe that is big improvement.

>
> Umm, yes, maybe. Anyway, scratch what I said about compilers
> not really caring about const vs. static const: adding static to your
> snippet severely pessimizes the codegen, with static initialization
> guards and all. So there goes your explanation to why static
> was not used :-)

Yes, I have noticed static messes it up, although for ints
<https://godbolt.org/z/oYM15zYoW> compiler is smart enough to not emit that
guard. That is one of reasons why I am so paranoid this optimization might
stop working with some future compiler.
simd intrisics may be harder for compiler to reason about that "just" ints.

> For the record, during develpment I examined
> the gencode for all fast_multiblockXX classes with the three
> major compilers, Intel and ARM to check that nothing looked bad.
>
I agree that 99% it will never break, since I presume compilers will rarely
regress in this manner... but I still think there is tiny chance they
might. :)

One more question:
I have some handcrafted tests (where bloom filter is so small it fits in
L1/L2 cache, and hit rate of lookups is 0%(beside false positives) ) and
simd one is a bit slower than no simd for certain values of K.

constexpr size_t num_inserted = 10'000;
constexpr double fpr = 1e-5;
constexpr size_t K = 5;
using vanilla_filter = boost::bloom::filter<uint64_t, 1,
boost::bloom::multiblock<uint64_t, K>, 1>;
using simd_filter = boost::bloom::filter<uint64_t, 1,
boost::bloom::fast_multiblock64<K>, 1>;

I presume that is expected since it is hard to make sure SIMD is always
faster, but just wanted to double check with you that this is not a
unexpected result.
So to recap my question: If bloom filter fits in L1 or L2 cache is it best
practice to check if SIMD or normal version is faster instead of assuming
SIMD always wins?