$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: [boost] Accelerating algorithms with SIMD - Segmented iterators and alternatives
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-10-11 11:19:13
Hi,
I have, along with Joel Falcou, been looking a bit at how we can 
integrate SIMD into iterator-based algorithms to accelerate them.
Unfortunately, there are a couple of issues.
The basic idea is to provide a range adaptor that adapts of range of 
PODs, where N consecutive PODs are stored contiguously, into a range of 
vectors (or "packs") of those PODs of size N.
Therefore, instead of:
std::vector<float> tab(20);
float sum = boost::accumulate(tab, 0);
we can write something like:
std::vector< float, simd::allocator<float> > tab(20);
float sum = simd::sum(
   boost::accumulate(
     simd::packed_range<4>(tab),
     simd::zero_
   )
);
Which does the same but using SSE instructions, adding four floats at a 
time, and typically providing a 4x speed-up.
This can be applied to have the same result for any binary operation 
which is commutative and associative (which unfortunately isn't the case 
of many operations, but can still be useful).
Here, however, it works because 1) the memory is well aligned thanks to 
the use of our custom allocator and 2) the size, 20, is a multiple of 
the pack size, 4.
We'd be interested in providing a mechanism where we could also support 
ranges of data aligned at an arbitrary address and of an arbitrary size. 
The SSE would only be used in the middle, the borders being handled 
element per element.
A first solution consists in padding the beginning and the end with zeroes.
Say you have 1 2 3 4 5 6 7
with only 3 being aligned well enough for SSE usage.
so we'd turn that into | 0 0 1 2 | 3 4 5 6 | 7 0 0 0 |.
However, it is very important that the code to deal with borders does 
not affect the efficiency of the code in the middle. This is not 
possible with the normal iterator/range design.
Our data is clearly cut into three segments: the beginning that is not 
aligned, the data that is rightly aligned and sized, and the data at the 
end that is not big enough to fill a whole pack.
Therefore, segmented iterators could provide us with a solution to do 
this efficiently; but so could other solutions as well.
Wouldn't it be better to be able to use the mechanism so that you could 
deal with the borders in an element-per-element way while you could deal 
with the middle bit in a vector-per-vector way?
With C++ iterators, you pull data, but if we could use an iteration 
paradigm where you push data instead, such as calling a function object 
for each element, we could call an overloaded function that would be 
able to accept both a single element and a pack.
Such a solution would also provide the speed benefits of segmented 
iterators, and all high-level standard algorithms can be expressed with 
this iteration paradigm just as well as with iterators.
I have already suggested such an alternative to segmented iterators 
(where I had a few misconceptions about segmented iterators, I thought 
they broke compatibility, but they don't, they still maintain the legacy 
non-segmented interface) there:
<http://thread.gmane.org/gmane.comp.lib.boost.user/61636/focus=61647>,
with an example of how you could use that to iterate over a binary tree 
there:
<http://www.boostpro.com/vault/index.php?action=downloadfile&filename=generator_iterator.zip&directory=>
Converting from those things to a pair of iterators is possible through 
the use of a coroutine, as demonstrated in the example.
I would like to have feedback on the general idea of providing tools to 
use SIMD with iterators and related algorithms, and ideas about whether 
there is sufficient interest in the alternative to iterators for me to 
develop them.
The intent would be to submit both the alternative to iterators and the 
SIMD helpers to Boost.