From: Ivan Matek (libbooze_at_[hidden])
Date: 2025-06-16 01:27:31


On Sun, Jun 15, 2025 at 6:45 PM Joaquin M López Muñoz via Boost <
boost_at_[hidden]> wrote:

> El 14/06/2025 a las 19:32, Ivan Matek escribió:
> Yes, this is odd. If your thesis is correct, the budget optimizer of the
> compiler seems
> to have decided to go branchless despite what the code says.
>

I was wrong, just checked with VTune asm view and branch is definitely
there(if I can read asm correctly), this is for K==11,
unmodified(branchful) avx2 code.
I will need to investigate further to respond to other comments you made,
just wanted to say asap that this speculation of mine was wrong.
I now believe prefetch is the cause of difference in performance, as you
can see for K==11 we have 3 prefetches and that is why we suffer for
unsuccessful lookup case.

Address Source Line Assembly CPU Time Instructions Retired
0x6740 0 Block 7:
0x6740 0 xor %edx, %edx
0x6742 246 add %rdx, %rax
0x6745 0 add $0x4, %rcx
0x6749 0 cmp %rsi, %rcx
0x674c 246 jz 0x681c <Block 10>
0x6752 0 Block 8:
0x6752 0 movsxdl (%rcx), %rdx 0usec 117,450,000
0x6755 0 mulx %r12, %r9, %rdx 0usec 143,550,000
0x675a 0 xor %r9, %rdx 0usec 136,300,000
0x675d 0 or $0x1, %rdx 0usec 118,900,000
0x6761 0 mulx %rdi, %rdx, %r9 0usec 130,790,000

*0x6766 0 prefetcht0z (%r8,%r9,1) 0usec 111,360,0000x676b 0 prefetcht0z
 0x40(%r8,%r9,1) 0usec 147,320,0000x6771 0 prefetcht0z 0x80(%r8,%r9,1)
0usec 132,820,000*
0x677a 0 vmovdquy (%r8,%r9,1), %ymm0 198.739usec 109,910,000
0x6780 0 vmovdquy 0x20(%r8,%r9,1), %ymm1 0usec 134,270,000
0x6787 0 vmovq %rdx, %xmm2 99.370usec 128,180,000
0x678c 0 vpbroadcastq %xmm2, %ymm2 0usec 122,670,000
0x6791 0 vpsllvq %ymm4, %ymm2, %ymm2 0usec 133,980,000
0x6796 0 vpsrld $0x1a, %ymm2, %ymm2 99.370usec 100,920,000
0x679b 0 vpmovzxdq %xmm2, %ymm3 0usec 163,850,000
0x67a0 0 vpsllvq %ymm3, %ymm5, %ymm3 0usec 117,450,000
0x67a5 0 vextracti128 $0x1, %ymm2, %xmm2 0usec 130,790,000
0x67ab 0 vpmovzxdq %xmm2, %ymm2 0usec 139,200,000
0x67b0 0 vptest %ymm3, %ymm0 0usec 97,440,000
0x67b5 0 setb %r10b 0usec 175,450,000
0x67b9 0 vpsllvq %ymm2, %ymm5, %ymm0 0usec 126,150,000
0x67be 0 vptest %ymm0, %ymm1 0usec 120,060,000
0x67c3 0 setb %r11b 0usec 122,960,000
0x67c7 0 test %r10b, %r11b 99.370usec 258,390,000

*0x67ca 0 jz 0x6740 <Block 7> 99.370usec 0*0x67d0 0 Block 9:
0x67d0 0 add %r8, %r9 0usec 118,610,000
0x67d3 0 add $0x40, %r9 0usec 124,120,000
0x67d7 0 vmovdquy (%r9), %ymm0 0usec 131,660,000
0x67dc 0 mulx %r12, %rdx, %r9 0usec 131,080,000
0x67e1 0 xor %rdx, %r9 0usec 112,230,000
0x67e4 0 vmovq %r9, %xmm1 0usec 140,650,000
0x67e9 0 vpbroadcastq %xmm1, %xmm1 0usec 124,410,000
0x67ee 0 vpsllvq %xmm6, %xmm1, %xmm1 0usec 136,590,000
0x67f3 0 vpsrld $0x1a, %xmm1, %xmm1 0usec 109,910,000
0x67f8 0 vpmovzxdq %xmm1, %ymm1 0usec 134,850,000
0x67fd 0 vpsllvq %ymm1, %ymm7, %ymm1 0usec 118,030,000
0x6802 0 xor %edx, %edx 0usec 103,530,000
0x6804 0 vptest %ymm1, %ymm0 0usec 168,780,000
0x6809 0 setb %dl 0usec 96,860,000
0x680c 246 add %rdx, %rax 0usec 143,550,000
0x680f 0 add $0x4, %rcx 0usec 140,940,000
0x6813 0 cmp %rsi, %rcx
0x6816 246 jnz 0x6752 <Block 8> 0usec 285,360,000

If we prefetch 50% more data(K==8 prefetches only 2 cache lines, K==11
prefetches 3 cache lines), and we never use it(when we are benchmarking
unsucessful lookup) performance suffers.

So if I just change one line (commenting out prefetch) unsuccessful lookups
performance changes drastically(presumably because 3M is in the area where
we are over L2 cache size of my CPU)

*3M*
with prefetch
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.00679
FPR 0.159833
insertion_time 6.07411
successful_lookup_time 5.62967

*unsuccessful_lookup_time 5.64557*mixed_lookup_time 11.9758

without prefetch

const unsigned char* next_element(boost::uint64_t& h)const noexcept
{
  auto p=ar.buckets+hs.next_position(h)*bucket_size;
  for(std::size_t i=0;i<prefetched_cachelines;++i){
    //BOOST_BLOOM_PREFETCH((unsigned char*)p+i*cacheline);
  }
  return p;
}

filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 5.00679
FPR 0.159833
insertion_time 6.02206
successful_lookup_time 5.03205

*unsuccessful_lookup_time 3.58734*mixed_lookup_time 11.3664

For 1M there is no noticable difference between prefetch and no prefetch(as
all data fits in L2), to confirm this is all about memory I tried 35M and
here again commenting out prefetch helps with unsuccessful lookups.

*35M*
with prefetch:
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 58.4126
FPR 0.160926
insertion_time 17.7846
successful_lookup_time 15.6446

*unsuccessful_lookup_time 14.7353*mixed_lookup_time 21.5229

without prefetch:
filter filter<int, 1ul, fast_multiblock64<11ul>, 1ul,
hash<int>, allocator<int>, mcg_and_fastrange>
capacity MB 58.4126
FPR 0.160926
insertion_time 18.0509
successful_lookup_time 13.8877

*unsuccessful_lookup_time 11.4105*mixed_lookup_time 19.807

In the end this is kind of obvious, I was so focused on codegen that I
forgot obvious things about caching.

Still I do not think all this measurements were useless:
I can imagine a world in which you expect most of lookups to be
unsuccessful and then you disable prefetch except maybe for initial block
(that is always needed).