Subject: Re: [boost] interest in structure of arrays container?
From: Andreas Schäfer (gentryx_at_[hidden])
Date: 2016-10-26 04:23:02


On 22:13 Tue 25 Oct , Michael Marcin wrote:
> On 10/25/2016 12:22 PM, Larry Evans wrote:
> >
> > Hmmm. I didn't realize you'd have to run the benchmark
> > several times to get stable results. I guess that reflect
> > my ignorance of how benchmarks should be run.
>
> The code was just a quick example hacked up to show large difference
> between different techniques.
>
> If you want to compare similar techniques you'll need a more robust
> benchmark.
>
> It would be easy to convert it to use:
> https://github.com/google/benchmark
>
> Which is quite good.

When doing performance measurements you have to take into account the
most common sources of noise:

1. Other processes might eat up CPU time or memory bandwidth.

2. The OS might decide to move your benchmark from one core to
   another, so you're loosing all L1+L2 cache entries. (Solution:
   thread pinning)

3. Thermal conditions and thermal inertia may affect if/when the CPU
   increases its clock speed. (Solution: either disable turbo mode or
   run the benchmark long enough to even out the thermal
   fluctuations.)

AFAIK Google Benchmark doesn't to thread pinning and cannot affect the
turbo mode. LIKWID ( https://github.com/RRZE-HPC/likwid ) can be used
to set clock frequencies and pin threads, and can read the performance
counters of the CPU. Might be a good idea to use both, Google
Benchmark and LIKWID together.

> > Could you explain how running a couple of times achieves
> > stable results (actually, on some occassions, I've run the
> > benchmark and got results completely unexpected, I suspect
> > it was because some application deamon was stealing cycles
> > from the benchmark, leading to the unexpedted results).
> >
> >> Interestingly your SSE code is ~13% faster than the
> >> LibFlatArray code for large particle counts.
> >
> > Actually, the SSE code was the OP's.
> >
>
> Actually it originates from:
>
> https://software.intel.com/en-us/articles/creating-a-particle-system-with-streaming-simd-extensions

Ah, thanks for the info.

> > From the above, the LibFlatArray and SSE methods are the
> > fastest. I'd guess that a new "SoA block SSE" method, which
> > uses the _mm_* methods, would narrow the difference. I'll
> > try to figure out how to do that. I notice:
> >
> > #include <mmintrin.h>
> >
> > doesn't produce a compile error; however, that #include
> > doesn't have the _mm_add_ps used here:
> >
> > https://github.com/cppljevans/soa/blob/master/soa_compare.benchmark.cpp#L621
> >
> >
> > Do you know of some package I could install on my ubuntu OS
> > that makes those SSE functions, such as _mm_add_ps,
> > available?
> >
> > [snip]
>
> If you're using gcc I think the header <xmmintrin.h>

The header should not depend on the compiler, but on the CPU model. Or
rather: the vector ISA supported by the CPU:

  http://stackoverflow.com/questions/11228855/header-files-for-simd-intrinsics

Cheers
-Andreas

-- 
==========================================================
Andreas Schäfer
HPC and Supercomputing
Institute for Multiscale Simulation
Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany
+49 9131 85-20866
PGP/GPG key via keyserver
http://www.libgeodecomp.org
==========================================================
(\___/)
(+'.'+)
(")_(")
This is Bunny. Copy and paste Bunny into your
signature to help him gain world domination!