From: Toon Knapen (toon.knapen_at_[hidden])
Date: 2002-06-27 19:36:35


On Thursday 27 June 2002 11:47, Martin Weiser wrote:
> Dear all,
>
> without giving a full review report (I haven't yet used uBLAS sufficiently
> to post a meaningful statement), I'd like to draw the attention to two
> points I haven't read so far in the discussion.
>
> [ If you're familiar with cache blocking techniques for linear algebra,
> you may wish to skip right to the end of the posting. ]
>
> Although the abstraction penalty is mostly insignificant compared to
> equivalent handwritten code, there remains a huge performance gap to
> optimized numerical kernels. I've done a preliminary comparison of dgemm
> performance with different matrix sizes N (see
> http://www.zib.de/weiser/ublas_review.gif), measuring the flops/s
> multiplying two square matrices of size N (N=1,...,1000). The competitors
> are a vendor optimized BLAS implementation (Sun Performance Library) and
> a straightforward naive "C" implementation. The environtment is GCC 3.1
> (options -O3 -funroll-loops) on UltraSparc 10/300.

This type of graphs would be useful as uBLAS doc !

I also noticed this performance drop and it was to be expected unless uBLAS
would contain blocking strategies. To obtain maximal performance, about 8
months ago, also involving andrew lumsdaine, 2 different approaches have been
discussed which that can serve to obtain this performance :

1) Bindings to BLAS
Most high-performance computer vendors put a lot of effort into optimising
BLAS on their machines, writing the kernels directly in assembler. I don't
think any compiler is ever able to match this performance (first of all, a
compiler does not know about the different cache sizes that can be exploited).
This is at least a short-time solution.
I've also discussed this with Joerg and, as you suggest if I understand you
correctly, it is very hard to detect in an expression the sub-expression for
which a blas call can be maid. So the user should state explicitly if he
wants a blas-gemm or a ublas-prod.
I've started providing such bindings yesterday and in attachment you can find
my preliminary bindings.

2) optimising the uBLAS implementation
The review is mostly about the interface. Optimising the C++ implementation
for the different compilers should be looked at in a following stage IMHO.
This is very tricky and very subtle (just look at Jeremy's thesis, stating
the differences using a '==' or '<' condition in for loops) and needs to be
updated every time a new compiler is released.
This is a long-term solution.
I do think it can really coexist with the blas-calls because for small
matrices it's easier and, if the ublas implementation is heavily optimised,
long expressions could be quicker (due to the loop-fusion) than several calls
to blas (and creating temporaries). For instance, when multiplying 3
matrices, you need 2 gemm-cals and a temporary to store the result of the
first gemm. Here an optimised ublas should be able to outpace blas (correct
me if I'm wrong please).

toon