$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-06-12 18:56:09
El 12/06/2025 a las 13:15, Ivan Matek escribió:
>
>
> On Wed, Jun 11, 2025 at 6:33â¯PM Joaquin M López Muñoz via Boost
> <boost_at_[hidden]> wrote:
>
>
> Unlike fast_multiblock*, [multi]block is branchless, so your results
> make sense. I wonder if multiblock<K> should be made branchful as
> the working memory area can get quite large for large values of K.
>
> Can you elaborate why would you want to change multiblock?
> To save memory accesses that are unnecessary and slow down rest of the
> program
> because they occupy cache or purely to reduce number of instructions
> executed?
>
Well, both. Consider the following cost model:
* Succesful lookup executes in A*K cycles for some constant factor A.
* Branchless unsuccesful lookup is also A*K, exactly the same as succesful
lookup.
* Branchful unsuccesful lookup when the array is half full executes in
B*(1*0.5 + 2*0.25 + 3*0.125 + ... + K*2^-K-1) cycles . Since the sum
decays exponentially, this is roughly equivalent to some constant C
irrespective of K.
Now, the thing is that for a sufficiently large K, A*K (branchless) is
going to
be larger than C (branchful). The best approach would be then to have a
hybrid algorithm that branches every L operations. Determining such a
value is an interesting experiment.
> If that is the case on my CPU it does not seem the matter a lot, i.e.
> changing AVX2
> to be branchless does not help or hurt performance, maybe CPU anyway
> decides
> to fetch data...
>
> This is what I did to make avx code branchless, did not affect
> performance in a way
> I could detect:
> Â [...]
> found = found && check_m256ix2(x[i],hash,8);
This is short circuiting (at least in theory). You may try using bitwise
AND (&). Also,
try with large values of K: you should reach a point where branchless is
worse.
> But back to the multiblock and making it branchful:
> With my CPU/compiler(clang is ð for this benchmark compared to gcc)
> it outperforms
> fast_multiblock for L2, L3, RAM size filters when it comes to mixed
> lookups, so I am not
> certain it needs changing.
Yes, it's hard to decide based on these numbers. For branchless
lookup, mixed lookup obviously takes the same as successful or
unsuccesful lookup. For branchful lookup, a cost model for mixed
lookup would be
P*M*(A*K) + (1-P)*M*C
where P is the percentange of successful lookups and M is an
extra cost factor on top of the non-mixed approach due to branch
mispredictions. Again, an interesting area to investigate. You're giving
me a lot of work :-)
I'd rather release the lib now in time for Boost 1.89 and then postpone
this analysis for 1.90, as it's going to take some work.
Joaquin M Lopez Munoz