Boost mailing page: Re: [bloom] Benchmarks with Knuth multiplier-based hash production

Date view	Thread view	Subject view	Author view

From: Joaquin M LÃ³pez MuÃ±oz (joaquinlopezmunoz_at_[hidden])
Date: 2025-06-12 18:56:09

Next message: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Previous message: Peter Dimov: "Re: [atomic] RFC: Bitwise operations for enums"
In reply to: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Next in thread: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Reply: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Reply: Kostas Savvidis: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"

El 12/06/2025 a las 13:15, Ivan Matek escribiÃ³:
>
>
> On Wed, Jun 11, 2025 at 6:33â€¯PM Joaquin M LÃ³pez MuÃ±oz via Boost
> <boost_at_[hidden]> wrote:
>
>
> Unlike fast_multiblock*, [multi]block is branchless, so your results
> make sense. I wonder if multiblock<K> should be made branchful as
> the working memory area can get quite large for large values of K.
>
> Can you elaborate why would you want to change multiblock?
> To save memory accesses that are unnecessary and slow down rest of the
> program
> because they occupy cache or purely to reduce number of instructions
> executed?
>

Well, both. Consider the following cost model:

* Succesful lookup executes in A*K cycles for some constant factor A.
* Branchless unsuccesful lookup is also A*K, exactly the same as succesful
lookup.
* Branchful unsuccesful lookup when the array is half full executes in
B*(1*0.5 + 2*0.25 + 3*0.125 + ... + K*2^-K-1) cycles . Since the sum
decays exponentially, this is roughly equivalent to some constant C
irrespective of K.

Now, the thing is that for a sufficiently large K, A*K (branchless) is
going to
be larger than C (branchful). The best approach would be then to have a
hybrid algorithm that branches every L operations. Determining such a
value is an interesting experiment.

> If that is the case on my CPU it does not seem the matter a lot, i.e.
> changing AVX2
> to be branchless does not help or hurt performance, maybe CPU anyway
> decides
> to fetch data...
>
> This is what I did to make avx code branchless, did not affect
> performance in a way
> I could detect:
> Â [...]
> found = found && check_m256ix2(x[i],hash,8);

This is short circuiting (at least in theory). You may try using bitwise
AND (&). Also,
try with large values of K: you should reach a point where branchless is
worse.

> But back to the multiblock and making it branchful:
> With my CPU/compiler(clang isÂ ðŸš€Â for this benchmark compared to gcc)
> it outperforms
> fast_multiblock for L2, L3, RAM size filters when it comes to mixed
> lookups, so I am not
> certain it needs changing.

Yes, it's hard to decide based on these numbers. For branchless
lookup, mixed lookup obviously takes the same as successful or
unsuccesful lookup. For branchful lookup, a cost model for mixed
lookup would be

P*M*(A*K) + (1-P)*M*C

where P is the percentange of successful lookups and M is an
extra cost factor on top of the non-mixed approach due to branch
mispredictions. Again, an interesting area to investigate. You're giving
me a lot of work :-)

I'd rather release the lib now in time for Boost 1.89 and then postpone
this analysis for 1.90, as it's going to take some work.

Joaquin M Lopez Munoz

Next message: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Previous message: Peter Dimov: "Re: [atomic] RFC: Bitwise operations for enums"
In reply to: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Next in thread: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Reply: Ivan Matek: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"
Reply: Kostas Savvidis: "Re: [bloom] Benchmarks with Knuth multiplier-based hash production"

Date view	Thread view	Subject view	Author view