$include_dir="/home/hyper-archives/ublas/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [ublas] Matrix multiplication performance
From: Michael Lehn (michael.lehn_at_[hidden])
Date: 2016-02-16 11:09:53
On 22 Jan 2016, at 00:28, nasos <nasos_i_at_[hidden]> wrote:
> Michael,
> please see below
> 
> On 01/21/2016 05:23 PM, Michael Lehn wrote:
>> Hi Nasos,
>> 
>> first of all I dont want to take wrong credits and want to point out that this is not my algorithm.  It is based on
>> 
>> 	http://www.cs.utexas.edu/users/flame/pubs/blis2_toms_rev3.pdf
>> 
>> 	https://github.com/flame/blis
>> 
>> For a few cores (4-8) it can easily made multithreaded.  For many-cores like Intel Xeon Phi this is a bit more
>> sophisticated but still not too hard.  
> Setting up Phis is indeed an issue, especially because they are "locked" with icpc. Openmp is working properly though.
> 
>> The demo I posted does not use micro kernels that exploit SSE, AVX or
>> FMA instructions.  With that the matrix product is on par with Intel MKL.  Just like BLIS. For my platforms I wrote
>> my own micro-kernels but the interface of function ugemm is compatible to BLIS.
>> 
> If you compile with -O3 I think you are getting  near optimal SSE vectorization. gcc is truly impressive and intel is even more.
>> Maybe you could help me to integrate your code in the benchmark example I posted above.
>> 
> I will try to find some time to spend on the code. 
>> About Blaze:  Do they have their own implementation of a matrix-matrix product?  It seems to require a
>> tuned BLAS implementation (Otherwise you get only poor performance) for the matrix-matrix product.
> I will check the benchmarks I run. I think I was using MKL with Blaze, but Blaze is taking it a step further (I am not sure how) and they are getting better performance than the underlying GEMM. Their benchmarks indicate that they are faster than MKL (https://bitbucket.org/blaze-lib/blaze/wiki/Benchmarks#!row-major-matrixmatrix-multiplication)
I started today with similar experiments on BLAZE and had closer look at their internal implementation.  By default
they are calling an external BLAS backend.  On my machine I used the Intel MKL.   But you are right, they also have
an internal implementation that can be used if no external BLAS is available.  I will publish the results on this page:
        http://www.mathematik.uni-ulm.de/~lehn/test_blaze/index.html
At the moment the benchmarks for the internal BLAZE implementation for the matrix-matrix product seem to look
poor.  I asked Klaus Iglberger (the author of BLAZE) to check the compiler flags that I have used.  So dont take the
current results as-is.
Cheers,
Michael