From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-06-11 16:33:37


El 10/06/2025 a las 14:52, Ivan Matek escribió:
> One more thing I noticed/checked and I believe may be of interest,
> again not in particular for this proposed Knuth multiplier change, but
> in general.
>
> Some filters have much weaker performance for mixed lookup.
> To be honest I have expected that for some(branch misprediction) but I
> do not know enough about all filters to say if slowdowns I have
> observed are as expected.
> Interesting that some fast_ have bad performance for mixed lookup,
> although naively I would assume SIMD code is branchless. I presume the
> issue is that sometimes we operate on more than 256 bits so there is
> branching...

fast_multiblock* has branches (early termination on lookup)
at k = 8, 16, ..., that is, whenever a full subblock (a pair of __m128is,
a __m256i, a pair of __m256is or a uint32x4x2_t, depending on the
subfilter and the SIMD technology) has been processed.

> What I find interesting beside this slowdown is that
> unordered_flat_set  also has bad performance for mixed lookup so bloom
> filters that do not degrade on mixed lookup look much better than if
> we just compare 100%/0% cases.
> For example if we round performance to 1 decimal digit and compare
> unordered_flat set
> 4.7/3.1/*9.4*
> vs
> filter<int, 1ul, block<unsigned long, 7ul>, 1ul, hash<int>,
> allocator<int>, mcg_and_fastrange>
> 3.3/3.3/*3.3*
>
> comparison looks much much different if we drop mixed
> performance(bolded part).

Unlike fast_multiblock*, [multi]block is branchless, so your results
make sense. I wonder if multiblock<K> should be made branchful as
the working memory area can get quite large for large values of K.

Joaquin M Lopez Munoz