$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Peter Dimov (pdimov_at_[hidden])
Date: 2007-08-23 10:35:11
Howard Hinnant wrote:
> On Aug 22, 2007, at 2:59 PM, Zach Laine wrote:
>
>> Could the folks who object to the
>> current design spell it out for me a bit more explicitly -- what in
>> the design is dangerous/inconvenient enough to throw away one or more
>> of the 4 goals?
>
> I would like to see an answer to Zach's question too. I do not know
> what the major objection is with the "current proposal". I only know
> that people are suggesting alternatives.
I deliberately only submitted an alternative for consideration instead of
poking holes in your design or argumentation. But if you insist...
I don't believe that my suggested alternative throws away any of the four
goals, hypothetical malicious vendor D notwithstanding. It is still possible
for the vendor to meet the goals. In addition, it adds
Goal 5: the ability to control the level of checking globally;
Goal 6: ... without source changes or even recompilation.
This can help if hypothetical user D, lured by the "no overhead, less L1
cache misses!" slogan, uses unchecked<> as a matter of habit. "Checking is
for other people."
On reflection though, I'll change the constructor from
explicit condition( Mutex * pm = 0 );
to
explicit condition( Mutex * pm );
as it's too easy to accidentally disable checking by not including the
condition in the member init list.
Going back to:
class shared_mutex
{
typedef mutex mutex_t;
typedef condition<unchecked<mutex_t>> cond_t;
mutex_t mut_;
cond_t gate1_;
cond_t gate2_;
unsigned state_;
...
and the L1 cache miss argument for:
class A
{
shared_mutex mx_;
...
};
vector<A> v;
1. The size of shared_mutex, according to your numbers, is 104. If we assume
16 bytes of state in A, this makes sizeof(A) 120. The addition of two
pointers makes it 128. This is a 7% increase, but it also happens to round
up the size of A to 128, which makes it never straddle a cache line, so the
"more bloated" version will actually be faster. Note that I picked the
number 16 before realizing that. :-) If A had 24 bytes of state the two
pointers would of couse be detrimental.
2. I'm having a hard time imagining a program where the L1 cache misses due
to the increased size of A would matter. An object of type A allocates a not
insignificant amount of kernel resources, so it would be hard to keep so
many A's in memory for the +7% L1 misses to show up.
3. What is sizeof(pthread_rwlock_t) on your platform? Is it not something
like 48-56? This is half the size of the above shared_mutex, so users who
are L1 cache miss conscious will not use the shared_mutex anyway.
4. The vector v would need to be protected by its own rwlock as well since
you can't reallocate it while someone is accessing the A's, which creates a
central bottleneck. vector< shared_ptr<A> > does not suffer from this
problem as you can reallocate it while someone is holding a reference to an
element in the form of shared_ptr.
5. If optimizing the L1 cache friendliness of shared_mutex is an important
goal, I would consider moving the conditions to the heap as they aren't
accessed on the fast path.
class shared_mutex
{
mutex mx_;
unsigned state_;
void * rest_;
};
I know that your actual shared_mutex will use atomics on state_, so you
could even be able to make
class shared_mutex
{
unsigned state_; // , state2_?
void * rest_;
};
work. Uncontended access now doesn't touch *rest_ and L1 cache misses are a
fraction of what they used to be. Contended access does cause more cache
misses, but this is shadowed by the cost of the contention.