$include_dir="/home/hyper-archives/ublas/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-01-22 19:41:40
I re-run all the benchmarks with -Ofast -mavx (as the hardware on with I generate the
benchmarks does not have AVX2).  So now the original ublas::axpy_prod() is actually
doing way better ...
On 22 Jan 2016, at 21:27, Michael Lehn <michael.lehn_at_[hidden]> wrote:
> Wow Imre!
> 
> Ok, that is actually a significant difference :-)
> 
> I have just added a new version to my site.  Unfortunately the computer I used for creating
> the page does not have posix_memalign().  So I had to hack my own function for that.  But
> I think for a proof of concept it will do.
> 
> But most of all I also added micro-kernel that are (fairly) optimised for AVX and FMA.  The
> micro-kernel for AVX require MR=4, NR=8.  For FMA it requires MR=4, NR=16.  Otherwise
> the reference implementation gets selected.  For all parameters default values now can be
> overwritten when compiling, e.g.
> 
> 	g++ -O3 -Wall -std=c++11 -DHAVE_AVX  -DBS_D_MC=512  matprod.cc
> 
> The optimised micro kernels are only included when compiled with -DHAVE_AVX or -DHAVE_FMA
> 
> I put all this stuff here
> 
> 	http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2/page01.html
> 
> also the tar-ball 
> 
> 	http://www.mathematik.uni-ulm.de/~lehn/test_ublas/session2.tgz
> 
> contains all required files:
> 
> 	session2/avx.hpp
> 	session2/fma.hpp
> 	session2/gemm.hpp
> 	session2/matprod.cc
> 
> Of course I would be interested how all this performs on other platforms.
> 
> 
> Cheers,
> 
> Michael
> 
> 
> On 22 Jan 2016, at 16:20, palik imre <imre_palik_at_[hidden]> wrote:
> 
>> Hi All,
>> 
>> In the meantime I enabled avx2 ...
>> 
>> Theoretical CPU performance maximum: clock * vector-size * ALUs/Core * 2 ~ 69.6GFLOPS
>> So it is at ~25%
>> 
>> Compiler is gcc 4.8.3
>> 
>> vanilla run:
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod matprod.cc
>> $ ./matprod
>> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>>  100   100   100  0.000350402      5707.73      0.00116104      1722.59               0
>>  200   200   200   0.00445094      3594.74      0.00819996      1951.23               0
>>  300   300   300    0.0138515      3898.49       0.0266515      2026.15     1.06987e-06
>>  400   400   400    0.0266447      4803.96       0.0613506      2086.37     4.01475e-06
>>  500   500   500    0.0424372      5891.06        0.119345      2094.77     8.99605e-06
>>  600   600   600    0.0724648      5961.51        0.203187      2126.12     1.63618e-05
>>  700   700   700     0.115464      5941.25        0.325834      2105.36     2.69547e-05
>>  800   800   800     0.173003      5918.98        0.480655      2130.43     4.09449e-05
>>  900   900   900     0.248077       5877.2        0.689972      2113.13     5.87376e-05
>> 1000  1000  1000      0.33781      5920.49        0.930591      2149.17     8.16264e-05
>> 1100  1100  1100       0.5149      5169.93         1.25507         2121     0.000108883
>> 1200  1200  1200     0.670628      5153.38         1.62732      2123.74     0.000141876
>> 1300  1300  1300     0.852692      5153.09         2.06708      2125.71     0.000180926
>> 1400  1400  1400      1.06695      5143.65         2.56183      2142.22     0.000225975
>> 1500  1500  1500       1.3874       4865.2         3.16532      2132.49     0.000278553
>> 1600  1600  1600      1.77623      4612.03          3.8137      2148.05     0.000338106
>> 1700  1700  1700       2.3773      4133.26         4.56665      2151.69     0.000404458
>> 1800  1800  1800      3.06381      3807.03         5.40317      2158.73      0.00048119
>> 1900  1900  1900       3.9039      3513.92         6.37295      2152.53     0.000564692
>> 2000  2000  2000      4.79166      3339.13         7.43399      2152.28     0.000659714
>> 2100  2100  2100      6.04946      3061.76         8.62429      2147.65     0.000762223
>> 2200  2200  2200      7.39085       2881.4         9.86237      2159.32     0.000875624
>> 2300  2300  2300      9.00453      2702.42         11.2513      2162.78      0.00100184
>> 2400  2400  2400      10.3952      2659.68         12.7491      2168.62      0.00113563
>> 2500  2500  2500      12.2283      2555.55         14.4615      2160.92      0.00128336
>> 2600  2600  2600      13.8912      2530.51         16.1965      2170.34      0.00144304
>> 2700  2700  2700       15.391      2557.72         18.1998      2162.99      0.00161411
>> 2800  2800  2800      17.5673      2499.19         20.2171      2171.63      0.00180035
>> 2900  2900  2900      19.4621      2506.31         22.5482      2163.28      0.00199765
>> 3000  3000  3000      21.4506      2517.42         24.9477      2164.53      0.00221028
>> 3100  3100  3100        23.71      2512.95         27.5144      2165.48      0.00243877
>> 3200  3200  3200      25.9051      2529.85         30.2816      2164.22      0.00267766
>> 3300  3300  3300      28.1949      2549.18          33.176      2166.45      0.00293379
>> 3400  3400  3400      30.7235      2558.56         36.0156      2182.61       0.0032087
>> 3500  3500  3500      34.0419      2518.95         39.3929      2176.79      0.00349827
>> 3600  3600  3600      37.0562      2518.12         42.7524      2182.62      0.00380447
>> 3700  3700  3700      39.7885      2546.11         46.4748      2179.81      0.00412621
>> 3800  3800  3800      43.6607      2513.56         50.2119      2185.62       0.0044694
>> 3900  3900  3900      46.5104      2550.78         54.4822      2177.56      0.00482355
>> 4000  4000  4000      50.6098      2529.15         58.7686      2178.03      0.00520289
>> 
>> tuned run:
>> 
>> $ g++ -Wall -W -std=gnu++11 -Ofast -march=core-avx2 -mtune=core-avx2 -g -DNDEBUG -I gemm -o matprod2 matprod2.cc
>> $ ./matprod2
>> #   m     n     k  uBLAS:   t1       MFLOPS   Blocked:   t2      MFLOPS        Diff nrm1
>>  100   100   100  0.000351671      5687.13     0.000316612      6316.88               0
>>  200   200   200   0.00419531      3813.78      0.00159044      10060.1               0
>>  300   300   300    0.0141153      3825.62      0.00421113      12823.2     1.07645e-06
>>  400   400   400    0.0291599      4389.59      0.00858138        14916     4.00614e-06
>>  500   500   500    0.0483492      5170.72       0.0166519      15013.3     8.96808e-06
>>  600   600   600    0.0725783      5952.19       0.0279634      15448.7     1.63386e-05
>>  700   700   700     0.113891      6023.29        0.043077        15925     2.69191e-05
>>  800   800   800     0.171416      5973.79       0.0627796        16311     4.09782e-05
>>  900   900   900     0.243677      5983.32       0.0922766      15800.3     5.88092e-05
>> 1000  1000  1000     0.335158      5967.33        0.123339      16215.5     8.15988e-05
>> 1100  1100  1100     0.515776      5161.15         0.16578      16057.5     0.000108991
>> 1200  1200  1200     0.662706      5214.98        0.205989      16777.6     0.000141824
>> 1300  1300  1300     0.845952      5194.15         0.27637        15899      0.00018111
>> 1400  1400  1400      1.06712      5142.82        0.332118      16524.2     0.000225958
>> 1500  1500  1500      1.38147      4886.11        0.409224      16494.6     0.000278265
>> 1600  1600  1600      1.72238      4756.21        0.492314      16639.8     0.000338095
>> 1700  1700  1700      2.38508      4119.77        0.603566      16279.9     0.000404362
>> 1800  1800  1800      3.12034      3738.05        0.717409      16258.5     0.000481575
>> 1900  1900  1900      3.93668      3484.66        0.824933      16629.2     0.000564727
>> 2000  2000  2000      4.76038      3361.07        0.941643      16991.6      0.00065862
>> 2100  2100  2100      5.90627      3135.99         1.12226      16504.2     0.000762307
>> 2200  2200  2200      7.26419      2931.64         1.28213      16609.9     0.000876699
>> 2300  2300  2300      8.88171      2739.79         1.45247      16753.5      0.00100222
>> 2400  2400  2400      10.4956      2634.26         1.62705      16992.7      0.00113566
>> 2500  2500  2500       11.913      2623.18         1.87499      16666.7      0.00128371
>> 2600  2600  2600      13.7057      2564.77          2.1156      16615.6      0.00144259
>> 2700  2700  2700      15.5959      2524.13         2.33957      16826.1      0.00161501
>> 2800  2800  2800      17.1121      2565.67         2.57445      17053.8      0.00179901
>> 2900  2900  2900      19.4167      2512.16         2.92445      16679.4      0.00199764
>> 3000  3000  3000      21.3239      2532.37         3.18891      16933.7      0.00220999
>> 3100  3100  3100      23.5049      2534.88          3.5305      16876.4      0.00243845
>> 3200  3200  3200      25.7362      2546.45         3.81708      17169.1      0.00267581
>> 3300  3300  3300      28.4467      2526.62         4.25869        16877      0.00293513
>> 3400  3400  3400      30.4607      2580.63         4.67999      16796.6      0.00320688
>> 3500  3500  3500      33.7737      2538.96         5.04289      17004.1      0.00349667
>> 3600  3600  3600      36.9633      2524.45           5.414      17235.3      0.00380237
>> 3700  3700  3700      39.5153      2563.71         6.04875      16748.2      0.00412583
>> 3800  3800  3800      42.9412      2555.68         6.48985      16910.1      0.00446785
>> 3900  3900  3900      46.5282      2549.81         7.05844        16808      0.00482701
>> 4000  4000  4000      50.2218      2548.69         7.42442      17240.4      0.00520272
>> 
>> As the generated machine code is completely different, I guess gcc notices the aligned alloc, and uses the alignment information for optimisation.
>> 
>> Cheers,
>> 
>> Imre
>> 
>> 
>> On Friday, 22 January 2016, 15:09, Michael Lehn <michael.lehn_at_[hidden]> wrote:
>> 
>> 
>> Hi Imre,
>> 
>> thanks for running the benchmarks.  Of course you are right that using aligned memory for the buffers improves
>> performance.  I also did not really put any effort in optimising the parameters MC, NC, KC, MR and NR.  I will
>> compare different variants and report them on the website
>> 
>> 	http://www.mathematik.uni-ulm.de/~lehn/test_ublas/index.html
>> 
>> I modified my benchmark program such that it also computes the FLOPS as
>> 
>> 	FLOPS = 2*m*n*k/time_elpased
>> 
>> See
>> 
>> 	http://www.mathematik.uni-ulm.de/~lehn/test_ublas/download/session1/matprod.cc
>> 
>> Could you re-run your benchmarks and post the different MFLOPS you get? That is important for actually tuning thing.
>> On my machine my code only reaches 20% of the peak performance (about 5 GFLOPS instead of 25.6 GFLOPS).   So
>> a speedup of 2.5 would be impressive but still far from peak performance.
>> 
>> Cheers,
>> 
>> Michael
>> 
>> 
>> On 22 Jan 2016, at 11:03, palik imre <imre_palik_at_[hidden]> wrote:
>> 
>>> Sorry for posting twice more or less the same thing.  I got confused with javascript interfaces.
>>> 
>>> It seems I also forgot to enable avx for my last measurements.  With that + my blocking and alignment changes, performance according to my tests is something like 250% higher than running Michael's original code (with avx).
>>> 
>>> Cheers,
>>> 
>>> Imre
>>> 
>>> 
>>> On Friday, 22 January 2016, 10:33, palik imre <imre_palik_at_[hidden]> wrote:
>>> 
>>> 
>>> Hi Michael,
>>> 
>>> your blocksizes are far from optimal.  MR & NR should be multiples of the L1 cache line size (i.e. 16 for double on Intel).  Also, the blocks should be allocated aligned to L1 cache lines (e.g., via posix_memalign()).
>>> 
>>> This alone brought something like 50% speedup for my square matrix test.
>>> 
>>> I will have a look at the other parameters + the whole thing via perf during the weekend.
>>> 
>>> Cheers,
>>> 
>>> Imre
>>> 
>>> 
>>> 
>>> On Friday, 22 January 2016, 0:28, "ublas-request_at_[hidden]" <ublas-request_at_[hidden]> wrote:
>>> 
>>> 
>>> Subject: Re: [ublas] Matrix multiplication performance
>>> Message-ID: <7FA49567-4580-4C7F-9B9E-43E08E78E14B_at_[hidden]>
>>> Content-Type: text/plain; charset="windows-1252"
>>> 
>>> Hi Nasos,
>>> 
>>> first of all I don?t want to take wrong credits and want to point out that this is not my algorithm.  It is based on
>>> 
>>>    http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>>> 
>>>    https://github.com/flame/blis
>>> 
>>> For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
>>> sophisticated but still not too hard.  The demo I posted does not use micro kernels that exploit SSE, AVX or
>>> FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
>>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>>> 
>>> Maybe you could help me to integrate your code in the benchmark example I posted above.
>>> 
>>> About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
>>> tuned BLAS implementation (?Otherwise you get only poor performance?) for the matrix-matrix product.
>>> IMHO they only have tuned the ?easy? stuff like BLAS Level1 and Level2.  In that case it makes more
>>> sense to compare the performance with the actual underlying GEMM implementation.  But if I am wrong,
>>> let me know.
>>> 
>>> About the block size: In my experience you get better performance if you chose them dynamically at runtime
>>> depending on the problem size.  Depending on the architecture you can just specify ranges like 256 - 384 for
>>> blocking factor MC.  In my code it then also needs to satisfy the restriction that it can be divided by factor MR.
>>> I know that doing things at compile time is a C++ fetish.  But the runtime overhead is negligible and having
>>> blocks of similar sizes easily pays of.
>>> 
>>> Cheers,
>>> 
>>> Michael
>>> *************************************
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ublas mailing list
>>> ublas_at_[hidden]
>>> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
>>> Sent to: michael.lehn_at_[hidden]
>> 
>> 
>> 
> 
> _______________________________________________
> ublas mailing list
> ublas_at_[hidden]
> http://listarchives.boost.org/mailman/listinfo.cgi/ublas
> Sent to: michael.lehn_at_[hidden]
>