$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] interest in structure of arrays container?
From: Michael Marcin (mike.marcin_at_[hidden])
Date: 2016-10-27 02:10:07
On 10/26/2016 1:33 PM, Chris Glover wrote:
>>
>>  I guess some optimisation from way yonder (something modern compilers do
>> routinely, even on a Monday morning!)... but more than probable irrelevant
>> nowadays...
>>
>> degski
>>
>
> I might be pessimistic, but I never trust the compiler and generally check
> what's being output. In this case, FWIW, on MSVC2015, the bit-twiddling
> version generates faster code than the mod version -- about 25% faster. I
> didn't test gcc or clang.
>
> Using google benchmark:
>
> Code:
>
> static void AlignedMod(benchmark::State& state)
> {
>     while (state.KeepRunning())
>     {
>         for(int i = state.range_x(); i < 128; i += state.range_y())
>         {
>             bool aligned = (i % 16) == 0;
>             benchmark::DoNotOptimize(aligned);
>         }
>     }
> }
> BENCHMARK(AlignedMod)->ArgPair(1, 1);
>
> static void AlignedAnd(benchmark::State& state)
> {
>     while (state.KeepRunning())
>     {
>         for(int i = state.range_x(); i < 128; i += state.range_y())
>         {
>             bool aligned = ((i - 1) & 15) == 0;
>             benchmark::DoNotOptimize(aligned);
>         }
>     }
> }
> BENCHMARK(AlignedAnd)->ArgPair(1, 1);
>
> Generated code of the inner loop:
>
> Mod version:
> mov         eax,ebx
> and         eax,8000000Fh
> jge         AlignedMod+50h
> dec         eax
> or          eax,0FFFFFFF0h
> inc         eax
> test        eax,eax
> lea         rcx,[aligned]
> sete        byte ptr [aligned]
> call        07FF73B84A180h
> add         ebx,dword ptr [rdi+1Ch]
> cmp         ebx,80h
> jl          AlignedMod+40h
>
>
> And version:
> lea         eax,[rbx-1]
> test        al,0Fh
> lea         rcx,[aligned]
> sete        byte ptr [aligned]
> call        07FF73B84A180h
> add         ebx,dword ptr [rdi+1Ch]
> cmp         ebx,80h
> jl          AlignedAnd+40h
>
>
> Result:
> Benchmark                                   Time           CPU Iterations
> -------------------------------------------------------------------------
> AlignedMod/1/1                            204 ns        203 ns    4072727
> AlignedAnd/1/1                            153 ns        154 ns    4977778
>
It's very dangerous to draw conclusions from benchmarks like this.
The operation and elapsed time are way too small for the results to be 
distinguishable from noise on the system.
AlignedAnd has a random '- 1' in it so it's not even checking if i is 16 
byte aligned.
AlignedMod is taking the mod of a signed integer. Make it use an 
unsigned integer and some very familiar code gets generated.
         for ( unsigned i = state.range_x(); i < 128; i += state.range_y() )
00007FF7ACBA316C  mov         ebx,dword ptr [rdi+14h]
00007FF7ACBA316F  cmp         ebx,80h
00007FF7ACBA3175  jae         AlignedMod+12h (07FF7ACBA3132h)
00007FF7ACBA3177  nop         word ptr [rax+rax]
         {
             bool aligned = (i % 16) == 0;
00007FF7ACBA3180  test        bl,0Fh
             benchmark::DoNotOptimize( aligned );
00007FF7ACBA3183  lea         rcx,[aligned]
         {
             bool aligned = (i % 16) == 0;
00007FF7ACBA3188  sete        byte ptr [aligned]
             benchmark::DoNotOptimize( aligned );
00007FF7ACBA318D  call        benchmark::internal::UseCharPointer 
(07FF7ACBA1019h)
     {
         for ( unsigned i = state.range_x(); i < 128; i += state.range_y() )
00007FF7ACBA3192  add         ebx,dword ptr [rdi+1Ch]
00007FF7ACBA3195  cmp         ebx,80h
00007FF7ACBA319B  jb          AlignedMod+60h (07FF7ACBA3180h)
         }
     }
00007FF7ACBA319D  jmp         AlignedMod+12h (07FF7ACBA3132h)
}
00007FF7ACBA319F  mov         rbx,qword ptr [rsp+38h]
00007FF7ACBA31A4  mov         rsi,qword ptr [rsp+40h]
00007FF7ACBA31A9  add         rsp,20h
00007FF7ACBA31AD  pop         rdi
00007FF7ACBA31AE  ret
I don't agree that it is more obscure. I think it's perfectly natural to 
understand that the bottom 4 bits must be zero in order to be multiple 
of 16.
I recommend to everyone the book Hacker's Delight which will make you 
truly appreciate how wonderful integers are.