$include_dir="/home/hyper-archives/ublas/include"; include("$include_dir/msg-header.inc") ?>
From: Frank Astier (fastier_at_[hidden])
Date: 2005-09-21 18:53:31
Ian -
Thanks for the correction. It got rid of the bizarre effect that  
ublas seemed to always to the same amount of work, regardless of  
sparsity :-)
I implemented my own sparse matrix too, optimized for multiplication  
with dense vectors (I didn't even write random access to elements, I  
don't need it!). Below 10% non zeros, I get at least 3 times better  
than the naive dense multiplication, and up to 40 times better than  
ublas on small matrix sizes (around 256x256). The matrices I handle  
don't have any structure I can take advantage of, unfortunately.
I wish ublas documentation ad enough details about what to expect.
Frank
On å¹³æ 17/09/21, at 15:26, Ian McCulloch wrote:
> Frank Astier wrote:
>
>
>> I'm still trying to multiply a sparse matrix by a vector efficiently,
>> comparing ublas to just looping over the whole matrix.
>>
>> For a square matrix of size 256x256 of floats with % non-zeros
>> between 10 and 90%, ublas does not seem to be efficient compared to
>> hand-crafted code that just carries out the naive matrix
>> multiplication. axpy_prod performs worse than prod. sparse_matrix in
>> 1_32 performs 3 times worse than compressed_matrix. I was hoping
>> ublas would be efficient even with matrices of that size, and not
>> that far off from "naive" code speed.
>>
>
> Firstly, your matrix is ALWAYS FULL, it is not sparse.  Your  
> initialization
> code is:
>
>     for (size_t i = 0; i < nrows; ++i)
>       for (size_t j = 0; j < ncols; ++j)
>         m(i, j) = vm[i][j] = float(rand()%10)/10.0 < pctNZ ? 1.0 :  
> 0.0;;
>
> Setting an element of a sparse matrix to zero really does insert an  
> element
> that is zero, it is not a no-operation (like in Matlab, for example).
>
> You should change the last line to
>
>          if (float(rand()%10)/10.0 < pctNZ) m(i,j) = 1.0;
>
> It might be worth re-trying axpy_prod with this in mind.  However,  
> the usual
> prod() in ublas is still slow for these fillings:
>
> Linux 2.6.12-gentoo-r4 x86_64 AMD Athlon(tm) 64 Processor 3200+
>
> %   prod dense
> 0.1 2.63 0.16
> 0.2 4.27 0.16
> 0.3 5.46 0.16
> 0.4 6.26 0.16
> 0.5 6.67 0.16
> 0.6 7.05 0.17
> 0.7 7.18 0.16
> 0.8 7.19 0.16
> 0.9 7.15 0.16
>
> (boost 1.32)
>
> However, for matrix size 256x256 is too small nowdays to get good
> performance from sparse algorithms - or rather, modern processors  
> are very
> fast at in-cache sequential accesses and penalize anything else very
> heavily.  Any time you have indirect adressing you will lose badly,  
> and it
> is very easy to stall the pipeline (eg, if the compiler cannot  
> guarantee
> there is no aliasing on the left hand side, it might pessimistially
> sequence the operations so that it updates the left hand side before
> loading the next right hand side element).
>
> Also, a filling of 0.1 is not very sparse.  This is probably the  
> cost of an
> indirect access (ie, my *very rough* guess would be that an optimized
> sparse multiply with 10% non-zeros would be about the same speed as an
> optimized dense multiply).  For a sparse matrix implemented as a  
> map or
> list structure, you will not get any speed improvement for matrix- 
> vector
> multiply until you have maybe 0.001 non-zero elements.
>
> Using one of my sparse matrix classes (implemented as a vector of  
> std::map,
> same as uBLAS I think (?) ), I get
>
> fil prod dense
> 0.1 0.79 0.16
> 0.2 1.82 0.16
> 0.3 2.62 0.16
> 0.4 3.21 0.17
> 0.5 3.55 0.16
> 0.6 3.73 0.16
> 0.7 3.83 0.16
> 0.8 3.91 0.16
> 0.9 3.89 0.16
>
> Still very bad compared with the dense case.  To optimize matrix- 
> vector
> multiplies, you need a much better sparse format, SparseBLAS CSR or
> something like that.  In this case, inserting and removing elements  
> are
> expensive, matrix-vector multiplies are cheap (but still involve  
> indirect
> addressing).  I think there is not such a storage format in uBLAS?   
> But it
> would be interesting to compare SparseBlas performance in this case.
>
> Even moving to 2560x2560 matrices still has poor performance:
>
> fil prod dense
> 0.1 292 22
> 0.2 440 21
> 0.3 562 20
> 0.4 632 21
> 0.5 678 21
> 0.6 714 21
> 0.7 731 21
> 0.8 734 21
> 0.9 725 21
>
> The uBLAS result is at least a factor 2 slower than it could be,  
> but even so
> the dense performance blows it away.  Interestingly, the naive
> vector-of-vectors dense matrix has almost exactly the same  
> performance as
> sgemv from BLAS (acml-2.5.0) - for this matrix size it is completely
> memory-bandwidth limited and further optimizing the dense-matrix  
> vector
> multiply is pointless.  For matrices that fit in the L1 cache,  
> sgemv is
> twice as fast (or more) than the naive version.
>
> In summary: if you want good sparse matrix-vector multiply  
> performance, you
> need to use a data structure optimized for access, rather than
> insertion/removal.  It would be very interesting to compare with  
> sparse
> BLAS performance http://math.nist.gov/spblas/ .
>
> Cheers,
> Ian
>
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
>