$include_dir="/home/hyper-archives/ublas/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-02-10 16:08:31
"doctool" finally completed to create a page which contains some benchmarks for
the multithreaded LU-factorization 
http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session7/page01.html
It also contains a plot that shows the potential of the recursive LU-factorization:
http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session7/page01.html#toc3
On 02 Feb 2016, at 23:26, Michael Lehn <michael.lehn_at_[hidden]> wrote:
> I also started to write down some notes on the rational behind the micro-kernel:
> 
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session6/page01.html
> 
> 
> On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn_at_[hidden]> wrote:
> 
>> (So the pages are re-done 
)
>> 
>> 
>> 
>> The bug is because NR is BlockSize<T>::NR divided by the vector length.  So this
>> should work
>> 
>> //---------------------------------------------------------------------------------
>> template <typename Index, typename T>
>> typename std::enable_if<std::is_floating_point<T>::value,
>> void>::type
>> ugemm(Index kc, T alpha,
>>                const T *A, const T *B,
>>      T beta,
>>      T *C, Index incRowC, Index incColC)
>> {
>>    A = (const T*) __builtin_assume_aligned (A, 128);
>>    B = (const T*) __builtin_assume_aligned (B, 128);
>>    static const unsigned div = 32/sizeof(T);
>>    static const Index MR = BlockSize<T>::MR;
>>    static const Index NR = BlockSize<T>::NR/div;
>> 
>>    typedef T vx __attribute__((vector_size (32)));
>> 
>>    vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>>    const vx *B_ = (vx *)B;
>>    for (Index l=0; l<kc; ++l) {
>>        for (Index i=0; i<MR; ++i) {
>>            for (Index j=0; j<(NR); ++j) {
>>                P[i * NR + j] += A[i]*B_[j];
>>            }
>>        }
>>        A += MR;
>>        B_ += NR;
>>    }
>> 
>>    T *P_ = (T *)P;
>>    for (Index j=0; j<NR*div; ++j) {
>>        for (Index i=0; i<MR; ++i) {
>>            C[i*incRowC+j*incColC] *= beta;
>>            C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>>        }
>>    }
>> }
>> //---------------------------------------------------------------------------------
>> 
>> However, as Karl pointed out it also should treat the case beta==0
>> as a special case.
>> 
>> But I dont get any speedup.  For the performance relevant is
>> mainly the loop
>> 
>>    for (Index l=0; l<kc; ++l) {
>>        for (Index i=0; i<MR; ++i) {
>>            for (Index j=0; j<(NR); ++j) {
>>                P[i * NR + j] += A[i]*B_[j];
>>            }
>>        }
>>        A += MR;
>>        B_ += NR;
>>    }
>> 
>> From my experience, only if this is optimized to the extrem a little bit
>> extra performance can be achieved by treating some special cases for
>> the update
>> 
>>    T *P_ = (T *)P;
>>    for (Index j=0; j<NR*div; ++j) {
>>        for (Index i=0; i<MR; ++i) {
>>            C[i*incRowC+j*incColC] *= beta;
>>            C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>>        }
>>    }
>> 
>> The main reason my code looks weird is that I also want to use it
>> for teaching.  The way it is coded you easily can map it forth and
>> back with the assembly code.
>> 
>> 
>>> new kernel:
>>> 
>>> template <typename Index, typename T>
>>> typename std::enable_if<std::is_floating_point<T>::value,
>>> void>::type
>>> ugemm(Index kc, T alpha,
>>>                const T *A, const T *B,
>>>      T beta,
>>>      T *C, Index incRowC, Index incColC)
>>> {
>>>    A = (const T*) __builtin_assume_aligned (A, 128);
>>>    B = (const T*) __builtin_assume_aligned (B, 128);
>>>    static const unsigned div = 32/sizeof(T);
>>>    static const Index MR = BlockSize<T>::MR;
>>>    static const Index NR = BlockSize<T>::NR/div;
>>> 
>>>    typedef T vx __attribute__((vector_size (32)));
>>> 
>>>    vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>>>    const vx *B_ = (vx *)B;
>>>    for (Index l=0; l<kc; ++l) {
>>>        for (Index i=0; i<MR; ++i) {
>>>            for (Index j=0; j<(NR); ++j) {
>>>                P[i * NR + j] += A[i]*B_[j];
>>>            }
>>>        }
>>>        A += MR;
>>>        B_ += NR;
>>>    }
>>> 
>>>    T *P_ = (T *)P;
>>>    for (Index j=0; j<NR; ++j) {
>>>        for (Index i=0; i<MR; ++i) {
>>>            C[i*incRowC+j*incColC] *= beta;
>>>            C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
>>>        }
>>>    }
>>> }
>> 
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
> 
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]