Subject: [boost] [Multiprecision] Benchmarking
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2012-06-14 11:40:33


Dear All,

I have attempted to evaluate the proposed multiprecision library by
applying it to the "Dealaunay Flip" function for 32-bit integer
coordinates, as this is one of the few things that I've ever
implemented that needed more than 64-bit integers:

inline bool delaunay_test(int32_t ax, int32_t ay, int32_t bx, int32_t by,
                           int32_t cx, int32_t cy, int32_t dx, int32_t dy)
{
   // Test whether the quadrilateral ABCD's diagonal AC should be
flipped to BD.
   // This is the Cline & Renka method.
   // Flip if the sum of the angles ABC and CDA is greater than 180 degrees.
   // Equivalently, flip if sin(ABC + CDA) < 0.
   // Trig identity: cos(ABC) * sin(CDA) + sin(ABC) * cos(CDA) < 0
   // We can use scalar and vector products to find sin and cos, and simplify
   // to the following code.
   // Numerical robustness is important. This code addresses it by performing
   // exact calculations with large integer types.

   i64 ax64 = ax, ay64 = ay, bx64 = bx, by64 = by,
       cx64 = cx, cy64 = cy, dx64 = dx, dy64 = dy;

   i64 cos_abc = (ax64-bx64)*(cx64-bx64) + (ay64-by64)*(cy64-by64);
   i64 cos_cda = (cx64-dx64)*(ax64-dx64) + (cy64-dy64)*(ay64-dy64);
   if (cos_abc >= 0 && cos_cda >= 0) return false;
   if (cos_abc < 0 && cos_cda < 0) return true;

   i64 sin_abc = (ax64-bx64)*(cy64-by64) - (cx64-bx64)*(ay64-by64);
   i64 sin_cda = (cx64-dx64)*(ay64-dy64) - (ax64-dx64)*(cy64-dy64);

   i128 sin_sum = static_cast<i128>(sin_abc)*cos_cda
                + static_cast<i128>(cos_abc)*sin_cda;

   return sin_sum < 0;
}

As you can see, with 32-bit inputs, the first set of computations yield
64-bit results and about half of calls terminate at this stage.
Otherwise a calculation yielding a 128-bit result is carried out. In
practice, the results of the subtractions are typically small and the
128-bit value rarely has more than 64 significant bits.

I have compared the following types used for i64 and i128 in the code above:

(1) int64_t and int64_t (which gives WRONG ANSWERS, but is useful for comparison)
(2) int64_t and my own "quick and dirty" int128_t.
(3) cpp_int.
(4) int64_t and cpp_int.
(5) cpp_int without expression templates.
(6) int64_t and cpp_int without expression templates.
(7) int64_t and mp_int128_t.
(8) mpz_int.
(9) int64_t and mpz_int.

Mean times per function call are as follows:

(1) 39.3 ns
(2) 61.9 ns
(3) 513 ns
(4) 103 ns
(5) 446 ns
(6) 102 ns
(7) 102 ns
(8) 2500 ns
(9) 680 ns

(AMD Athlon(tm) II X2 250e Processor @ 3 GHz; g++ 4.6.1.)

Note that:
- GMP is slowest, despite oft-repeated claims about its good performance.
- Expression templates make the code slower.
- Even the fastest Boost.Multiprecision type, mp_int128_t, is
significantly slower than my own quick hack.

This is all a bit disappointing. Perhaps the problem is that I'm
dealing only with 128 bits and the intended application is for much
larger values?

I am reminded of my experience with xint last year. See e.g. this thread:

     http://thread.gmane.org/gmane.comp.lib.boost.devel/215565

The current proposal seems much better - xint was hundreds of times
slower than my ad-hoc code - but I still find it discouraging. I
suspect that the performance difference may be attributable to the use
of sign-magnitude representation; as a result, the implementation is
full of sign-testing like this:

template <unsigned MinBits, bool Signed, class Allocator>
inline void eval_subtract(cpp_int_backend<MinBits, Signed, Allocator>&
result, const signed_limb_type& o)
{
    if(o)
    {
       if(o < 0)
          eval_add(result, static_cast<limb_type>(-o));
       else
          eval_subtract(result, static_cast<limb_type>(o));
    }
}

I think I understand the motivation for sign-magnitude for
variable-length representation, but I don't see any benefit in the case
of e.g. mp_int128_t.

I also wonder about the work involved in e.g. a 64 x 64 -> 128-bit
multiplication. Here is my version, based on 32x32->64 multiplies:

inline int128_t mult_64x64_to_128(int64_t a, int64_t b)
{
   // Make life simple by dealing only with positive numbers:
   bool neg = false;
   if (a<0) { neg = !neg; a = -a; }
   if (b<0) { neg = !neg; b = -b; }

   // Divide input into 32-bit halves:
   uint32_t ah = a >> 32;
   uint32_t al = a & 0xffffffff;
   uint32_t bh = b >> 32;
   uint32_t bl = b & 0xffffffff;

   // Long multiplication, with 64-bit temporaries:

   // ah al
   // * bh bl
   // ----------------
   // al*bl (t1)
   // + ah*bl (t2)
   // + al*bh (t3)
   // + ah*bh (t4)
   // ----------------

   uint64_t t1 = static_cast<uint64_t>(al)*bl;
   uint64_t t2 = static_cast<uint64_t>(ah)*bl;
   uint64_t t3 = static_cast<uint64_t>(al)*bh;
   uint64_t t4 = static_cast<uint64_t>(ah)*bh;

   int128_t r(t1);
   r.high = t4;
   r += int128_t(t2) << 32;
   r += int128_t(t3) << 32;

   if (neg) r = -r;

   return r;
}

So I need 4 32x32->64 multiplies to perform 1 64x64->128 multiply.

But we can't write this:

   int64_t a, b ...;
   int128_t r = a * b;

because here a*b is a 64x64->64 multiply; instead we have to write
something like

   r = static_cast<int128_t>(a)*b;

so that our operator* overload is found; but this operator* doesn't
know that its int128_t argument is actually a promoted int64_t, and it
has to do an 128x64->128 or 128x128->128 multiply, requiring 8 or 16
32x32->64 multiplies. The only chance of avoiding the extra work is if
the compiler can do constant-propagation, but I have my doubts about that.

Any thoughts?

My test code is here: http://chezphil.org/tmp/boost_mp_review.tbz2

$ make
$ cat nn100000.dat | benchmark

(Note that this needs a fix to the resize() issue reported earlier in
order to get accurate answers, but that doesn't much alter the
execution time.)

Regards, Phil.