$include_dir="/home/hyper-archives/ublas/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-29 11:13:28
Oh noo, grr.  Yes you are right.  Sorry, the stuff gets generated  
automatically and it again copied "fma.hpp" form the wrong  
server/directory.
Could you check again:
http://www.mathematik.uni-ulm.de/~lehn/test_ublas/fma.hpp
The kernel requires MR=4 and NR=12 and uses the vfmadd231pd instruction
Quoting palik imre <imre_palik_at_[hidden]>:
> Are you sure about the replacement?  this fma.hpp is practically   
> identical with your avx.hpp
>
>
>     On Friday, 29 January 2016, 0:21, Michael Lehn   
> <michael.lehn_at_[hidden]> wrote:
>
>
>  Could you also compile with -DHAVE_FMA?  However, I noticed that I   
> messed up the file ?fma.hpp? whenI created the page.  So you should   
> replace it with
>  http://www.mathematik.uni-ulm.de/~lehn/test_ublas/fma.hpp
> At the moment I regenerate the pages on the website
>
>
> On 28 Jan 2016, at 18:08, palik imre <imre_palik_at_[hidden]> wrote:
>
> Wow:
> $ cd session4
> $ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp   
> -DM_MAX=1500 matprod.cc
> matprod.cc: In function ???double estimateGemmResidual(const MA&,   
> const MB&, const MC0&, const MC1&)???:
> matprod.cc:94:40: warning: typedef ???TC0??? locally defined but not  
>  used [-Wunused-local-typedefs]
>      typedef typename MC0::value_type   TC0;
>                                         ^
> $ ./a.out
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2       
> MFLOPS        Diff  nrm1
>   100   100   100  0.000465896       4292.8      0.00405376       
> 493.369                  200   200   200    0.0046712       
> 3425.24      0.00244891      6533.51                  300   300    
> 300    0.0161551      3342.59      0.00314843      17151.4       
> 1.94151e-16
>   400   400   400     0.032688      3915.81      0.00240584         
> 53204      8.40477e-17
>   500   500   500    0.0544033      4595.31      0.00291218       
> 85846.4      3.52062e-17
>   600   600   600    0.0838317      5153.18      0.00453539       
> 95250.9      1.63452e-17
>   700   700   700     0.130899      5240.69      0.00585201        
> 117225      8.44769e-18
>   800   800   800     0.201121      5091.46        0.010872       
> 94186.5      4.72429e-18
>   900   900   900     0.286407      5090.65       0.0151735       
> 96088.7      2.80443e-18
>  1000  1000  1000     0.432211      4627.37       0.0707567       
> 28265.9       1.7626e-18
>  1100  1100  1100     0.511146       5207.9       0.0186911        
> 142420      1.14521e-18
>  1200  1200  1200     0.666975       5181.6        0.025109        
> 137640      7.79963e-19
>  1300  1300  1300     0.863769      5087.01       0.0283398        
> 155047      5.45468e-19
>  1400  1400  1400      1.09638      5005.57        0.143209       
> 38321.6      3.90302e-19
>  1500  1500  1500      1.40352      4809.33        0.120096       
> 56204.9       2.8667e-19
> $ cd ../session2
> $ g++ -Ofast -mavx -Wall -std=c++11 -DNDEBUG -DHAVE_AVX -fopenmp   
> -DM_MAX=1500 matprod.cc
> [ec2-user_at_ip-10-0-46-255 session2]$ ./a.out
> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2       
> MFLOPS        Diff  nrm1
>   100   100   100  0.000471888      4238.29     0.000231317       
> 8646.14                  200   200   200   0.00431625       
> 3706.92      0.00121122      13209.9                  300   300    
> 300    0.0153292      3522.69      0.00336464      16049.3       
> 1.07937e-06
>   400   400   400    0.0317138       4036.1      0.00712568       
> 17963.2      4.06488e-06
>   500   500   500     0.052809      4734.04       0.0121626       
> 20554.9      9.09947e-06
>   600   600   600    0.0828121      5216.63        0.020657         
> 20913      1.65243e-05
>   700   700   700     0.131053      5234.51       0.0318276       
> 21553.6      2.71365e-05
>   800   800   800     0.196825      5202.58       0.0482679       
> 21214.9      4.12109e-05
>   900   900   900     0.281006      5188.51       0.0671323       
> 21718.3      5.93971e-05
>  1000  1000  1000     0.386332      5176.89       0.0906054       
> 22073.7      8.23438e-05
>  1100  1100  1100      0.51667      5152.22        0.124346       
> 21408.1      0.000109566
>  1200  1200  1200     0.668425      5170.37        0.159701       
> 21640.5      0.000142817
>  1300  1300  1300     0.860445      5106.66        0.203472       
> 21595.1      0.000182219
>  1400  1400  1400      1.08691      5049.18        0.249427       
> 22002.4      0.000226999
>  1500  1500  1500      1.38244      4882.67        0.307519       
> 21949.9      0.000280338
>
> This is on Haswell
>
>
>     On Wednesday, 27 January 2016, 1:39, Michael Lehn   
> <michael.lehn_at_[hidden]> wrote:
>
>
>
> On 23 Jan 2016, at 18:53, palik imre <imre_palik_at_[hidden]> wrote:
>
> Hi All,
> what's next?  I mean what is the development process for ublas?
> Now we have a C-like implementation that outperforms both the   
> mainline, and the branch version (axpy_prod).  What will we do with   
> that?
> As far as I see we have the following options:
> 1) Create a C++ template magic implementation out of it.  But for   
> this, at the least we would need compile-time access to the target   
> instruction set.  Any idea how to do that?
> 2) Create a compiled library implementation out of it, and choose   
> the implementation run-time based on the CPU capabilities.
>
> 3) Include some good defaults/defines, and hope the user will use them.
> 4) Don't include it, and do something completely different.
>
>
> What do you think?
>
>
> At the moment I am not sure, but pretty sure, that you don?t have to  
>  rewrite uBLAS to support good performance.  uBLAS is a pretty   
> finelibrary and already has what you need.  So just extend it. At   
> least for 383 lines of code :-)
> As I was pretty busy the last days, so I could not continue until   
> today.  I made some minimal modifications to adopt the GEMM   
> implementationto take advantage of uBLAS:
> - The GEMM frame algorithm and the pack functions now work with any   
> uBLAS matrix that supports element access through the notation   
> A(i,j)- Only for matrix C in the operation C <- beta*C + alpha*A*B   
> it requires that the matrix is stored row-major or col-major.  The   
> other matrices canbe expressions. Here I need some help to get rid   
> of this ugly lines of code:
>     TC *C_ = &C(0,0);    const size_type incRowC = &C(1,0) -  
> &C(0,0);     const size_type incColC = &C(0,1) - &C(0,0);
> - Things like the following should work without performance penalty   
> (and in particularly without creating temporaries):    
>   blas::axpy_prod(3*A1+A2, B1 - B2, C, matprodUpdate) - And matrices  
> A  and B can be symmetric or whatever.  Of course if A or B is   
> triangular a specialized implementation can take advantage- The   
> storage format does not matter.  You can mix row-major and  
> col-major,  packed, ?
> Here is the page for this:
>  http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session4/page01.html
> Please note, creating the benchmarks for the symmetric matrix-matrix  
>  product really takes long because the current uBLAS implementation   
> seemsto be much slower than for general matrices.  So I reduced the   
> problem size for now.  It would be helpful if some people could   
> reproduce the benchmarksand also check different cases:
> - different expressions- different element types, e.g. A with   
> floats, B with double etc.  At the moment there is only a micro   
> kernel for double.  The performance depends on thecommon type of A   
> and B.  So with complex<..> the reference implementation.  But the   
> performance should at least be constant in this case.
> It would in particular be nice to have benchmarks from people with   
> an Intel Sandybridge or Haswell as the micro kernels are optimized   
> for thesearchitectures.  If interested I can extend the benchmarks  
> to  compare with Intel MKL.  For Linux there is a free  
> non-commercial  version available.
>
> Cheers,
> Michael
>
>
>
>
>
>
>
>
>
>
>