From: Joerg Walter (jhr.walter_at_[hidden])
Date: 2003-05-15 16:57:38


Hi Csaba,

you wrote:

> Once I played with comparing Intel Math kernel with some implementations I
> did.
> I found out that loop unrolling may double the speed.
> Still I could never quite get close to the speech of the Intel Math kernel
> (for large matrices), presumably due to insufficient caching.
> The above applies to row-major matrices.
> For column-major matrices loop unrolling achieved the same speed as the
> intel math kernel.

I've been playing with loop unrolling in the past, too (see
BOOST_UBLAS_USE_DUFF_DEVICE), but never found a satisfactory solution of
that performance problem. Compilers seem to be sufficiently stressed by the
templated code.

> Below is the code if anyone wants to give it a try.
> Maybe ublas could make use of some performance optimizations..

For small matrices I'm still waiting for the first (or next? ;-) compiler
to vectorize inlined template code (ICC the hottest candidate, never had a
chance to check KAI). For larger matrices I've been playing with some crude
high level optimizations, see

http://groups.yahoo.com/group/ublas-dev/message/461

I don't know, if they're really useful.

> (better it
> should be connected
> to some optimized blas implementations..?)

Yep. Either low level (using explicit bindings) or high level (using
specialized evaluators). Both discussed in the past and still undecided.

Thanks,

Joerg