Ublas mailing page: Re: [ublas] Matrix multiplication performance

Date view	Thread view	Subject view	Author view

Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-02-10 16:08:31

Next message: palik imre: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Karl Rupp: "Re: [ublas] Matrix multiplication performance"

"doctool" finally completed to create a page which contains some benchmarks for
the multithreaded LU-factorization

http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session7/page01.html

It also contains a plot that shows the potential of the recursive LU-factorization:

http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session7/page01.html#toc3

On 02 Feb 2016, at 23:26, Michael Lehn <michael.lehn_at_[hidden]> wrote:

> I also started to write down some notes on the rational behind the micro-kernel:
>
> http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session6/page01.html
>
>
> On 02 Feb 2016, at 21:43, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>
>> (So the pages are re-done …)
>>
>>
>>
>> The bug is because NR is BlockSize<T>::NR divided by the vector length. So this
>> should work
>>
>> //---------------------------------------------------------------------------------
>> template <typename Index, typename T>
>> typename std::enable_if<std::is_floating_point<T>::value,
>> void>::type
>> ugemm(Index kc, T alpha,
>> const T *A, const T *B,
>> T beta,
>> T *C, Index incRowC, Index incColC)
>> {
>> A = (const T*) __builtin_assume_aligned (A, 128);
>> B = (const T*) __builtin_assume_aligned (B, 128);
>> static const unsigned div = 32/sizeof(T);
>> static const Index MR = BlockSize<T>::MR;
>> static const Index NR = BlockSize<T>::NR/div;
>>
>> typedef T vx __attribute__((vector_size (32)));
>>
>> vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>> const vx *B_ = (vx *)B;
>> for (Index l=0; l<kc; ++l) {
>> for (Index i=0; i<MR; ++i) {
>> for (Index j=0; j<(NR); ++j) {
>> P[i * NR + j] += A[i]*B_[j];
>> }
>> }
>> A += MR;
>> B_ += NR;
>> }
>>
>> T *P_ = (T *)P;
>> for (Index j=0; j<NR*div; ++j) {
>> for (Index i=0; i<MR; ++i) {
>> C[i*incRowC+j*incColC] *= beta;
>> C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>> }
>> }
>> }
>> //---------------------------------------------------------------------------------
>>
>> However, as Karl pointed out it also should treat the case beta==0
>> as a special case.
>>
>> But I don’t get any speedup. For the performance relevant is
>> mainly the loop
>>
>> for (Index l=0; l<kc; ++l) {
>> for (Index i=0; i<MR; ++i) {
>> for (Index j=0; j<(NR); ++j) {
>> P[i * NR + j] += A[i]*B_[j];
>> }
>> }
>> A += MR;
>> B_ += NR;
>> }
>>
>> From my experience, only if this is optimized to the extrem a little bit
>> extra performance can be achieved by treating some special cases for
>> the update
>>
>> T *P_ = (T *)P;
>> for (Index j=0; j<NR*div; ++j) {
>> for (Index i=0; i<MR; ++i) {
>> C[i*incRowC+j*incColC] *= beta;
>> C[i*incRowC+j*incColC] += alpha*P_[i * NR * div + j];
>> }
>> }
>>
>> The main reason my code looks weird is that I also want to use it
>> for teaching. The way it is coded you easily can map it forth and
>> back with the assembly code.
>>
>>
>>> new kernel:
>>>
>>> template <typename Index, typename T>
>>> typename std::enable_if<std::is_floating_point<T>::value,
>>> void>::type
>>> ugemm(Index kc, T alpha,
>>> const T *A, const T *B,
>>> T beta,
>>> T *C, Index incRowC, Index incColC)
>>> {
>>> A = (const T*) __builtin_assume_aligned (A, 128);
>>> B = (const T*) __builtin_assume_aligned (B, 128);
>>> static const unsigned div = 32/sizeof(T);
>>> static const Index MR = BlockSize<T>::MR;
>>> static const Index NR = BlockSize<T>::NR/div;
>>>
>>> typedef T vx __attribute__((vector_size (32)));
>>>
>>> vx P[MR*NR] __attribute__ ((aligned (128))) = {};
>>> const vx *B_ = (vx *)B;
>>> for (Index l=0; l<kc; ++l) {
>>> for (Index i=0; i<MR; ++i) {
>>> for (Index j=0; j<(NR); ++j) {
>>> P[i * NR + j] += A[i]*B_[j];
>>> }
>>> }
>>> A += MR;
>>> B_ += NR;
>>> }
>>>
>>> T *P_ = (T *)P;
>>> for (Index j=0; j<NR; ++j) {
>>> for (Index i=0; i<MR; ++i) {
>>> C[i*incRowC+j*incColC] *= beta;
>>> C[i*incRowC+j*incColC] += alpha*P_[i * NR + j];
>>> }
>>> }
>>> }
>>
>> _______________________________________________
>> ublas mailing list
>> ublas_at_[hidden]
>> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
>> Sent to: michael.lehn_at_[hidden]
>
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]

text/html attachment: attachment

Next message: palik imre: "Re: [ublas] Matrix multiplication performance"
Previous message: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
In reply to: Michael Lehn: "Re: [ublas] Matrix multiplication performance"
Next in thread: Karl Rupp: "Re: [ublas] Matrix multiplication performance"

Date view	Thread view	Subject view	Author view