$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [lockfree] Review
From: Helge Bahmann (hcb_at_[hidden])
Date: 2011-08-08 06:41:33
On Monday 08 August 2011 10:59:47 Grund, Holger wrote:
> > > Agreed, this is not impossible, but I still tend to think we should
> >
> > strive
> >
> > > for a more efficient implementation if at all possible.
> >
> > Where do you see room for improvement? It is a fallacy to assume that
> > "most
> > efficient implementation" always means "there is a machine instruction
> > providing a 1:1 translation of my high-level construct". Look at this
> > from
> > the POV of cache synchronisation cost (which is the real cost, not the
> > number
> > of instructions), and you will realize that there is not much you can
> > do
> > (assuming you can squeeze the data copies as well as the sequence
> > counter
> > into the same cacheline).
> >
> > This approach BTW is already way faster than e.g. using a 64-bit mmx
> > register
> > and paying the cost of mmx->gpr transfers on x86.
>
> That doesn't match my experience. Even in the noncontended case, I would be
> very surprised to see anything "way faster".
mmx -> gpr has quite significant latency, don't forget that you need to
shuffle around a fair bit, plus the cost of the eventual emms
Also don't forget that moving mmx -> gpr defeats a large portion of the CPU's
out-of-order and speculative execution capability, the CPUs cannot in general
track dependencies across different register classes.
> However, under any kind of contention I do expect the MMX MOVQ version to
> be significantly faster.
Assuming that you manage to put everything into a single cache line, I doubt
that you will see any difference at all under contention: the real cost is
the cache line transfer and transferring a single one has a latency of ~150
cycles (and that's rather not going to decrease with modern CPUs). After
that, the number of bytes read out of the cache line are basically not
measurable anymore.
> And of course, 64 bits is less than 32 + 2 * 64.
It would rather be 32 + 64 +32 (there is no point in reading the "inactive"
copy, but the processor can certainly read it speculatively).
Helge