$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [lockfree::fifo] Review
From: Tim Blechmann (tim_at_[hidden])
Date: 2009-12-05 09:27:14
thanks for your comments!
> First of all I think that having a library of lock-free 
> algorithms/containers would be greatly beneficial to a great number of 
> boost users. On the other hand, having implemented several lock-free 
> algorithms, I know that it is notoriously hard to get them right. We'll 
> need all the help we can get from knowledgeable people in the field. The 
> implementation needs to be thoroughly reviewed. Let me give the first 
> attempt at that.
i know of the difficulty to get lock-free data structures right, and
several issues have been solved by people, reviewing my code ...
> Second, this algorithm makes certain assumptions about the ability to 
> access memory that has been freed. The algorithm can try to read memory 
> that has been de-allocated. (See further down for description of how). 
> Several debug implementations of malloc/free store every allocated 
> element on one page and when the element is freed, the page is marked 
> protected for both reading and writing. That means that this algorithm 
> can only work with a subset of custom allocators. Your "free_list" for 
> example can't work with it since it's possible for a node to be freed 
> when you call "deallocate". The "caching_freelist" and "static_freelist" 
> on the other hand work just fine because they never return the memory 
> back to the system.
the free_list class should not be used for the stack/fifo classes (i
should remove it to avoid confusion). i chose the freelist approach,
since hazard pointers and pass-the-buck are/may be subject of patents/
> Third, and the biggest limitation, is the algorithm's assumption that 
> you can store a pointer to an element plus an ABA tag in a contiguous 
> memory space that can be passed to a CAS instruction. I believe not all 
> platforms provide a DCAS instruction, that's why you're trying to work 
> around this issue by using "packed pointers", i.e. storing the ABA tag 
> in the part of the pointer that presumably is never used. The problem 
> with that, however, is that some libc implementations of malloc/free do 
> use all the bits in the pointer for their own needs, so you can't assume 
> that you can meddle with them.
puh, can you elaborate on this? i haven't been aware of any use of the
unused address space bits.
> My experience with this algorithm shows that the safest approach is to 
> use a pre-allocated array for the queue and replace pointers with 
> indexes into the array. Assuming that the size of an index is 
> sufficiently lower than the size of a pointer, you can use the remaining 
> bits to store the ABA tag. This alleviates both point 2 and 3. This does 
> mean however, that the queue can't have an unbound size, the size has to 
> be specified during the queue instantiation.
that would greatly increase the portability ... it would be even
possible to use e.g. 16bit indices and 16bit tags on 32bit platforms
without dcas
> Fourth, on platforms where CAS is implemented using LL/SC (such as 
> PowerPC and MIPS), it is possible to livelock. The problem with LL/SC on 
> most platforms is that the write granulum for marking LL's address as 
> "dirty" isn't just the memory pointed to by LL, but the whole cache line 
> where the memory resides. On these platforms it is important to make 
> sure that each "node" of the queue is located on a separate cache line.
interesting point! i should somehow handle size and alignment of the
nodes on these platforms
> Here's the code of your enqueue:
> 
> 
>      bool enqueue(T const & t)
>      {
>          node * n = alloc_node(t);
> 
>          if (n == NULL)
>              return false;
> 
>          for (;;)
>          {
>              atomic_node_ptr tail (tail_); //1
>              read_memory_barrier(); //2
>              atomic_node_ptr next (tail->next); //3
> 
>              if (likely(tail == tail_)) //4
>              {
>                  if (next.get_ptr() == 0) //5
>                  {
>                      if ( tail->next.cas(next, n) )//6
>                      {
>                          tail_.cas(tail, n);
>                          return true;
>                      }
>                  }
>                  else
>                      tail_.cas(tail, next.get_ptr());
>              }
>          }
>      }
> 
> 
> 1. You're missing write barrier after a call to alloc_node().
> Rational: alloc_node sets the "next" pointer of the newly allocated node 
> to NULL. That is a store to shared memory. Setting next pointer to NULL 
> has to be observable to other threads BEFORE they observe the inserted 
> node. If you don't insert a write barrier, it is possible for other 
> threads to see the new element in the list before element's next pointer 
> is initialized with NULL.
i was assuming, that cas issues a full memory barrier (at least my cas
implementation does), so i don't think, this would be an issue.
> 2. You've misplaced the read_barrier.
> Rational: Line 4 checks the validity of the load done on line 3. That 
> means that the read_barrier should be between lines 3 and 4. If you 
> don't have a read_barrier after loading *tail->next, then this load can 
> happen after the load of tail_ on line 4, which defies the purpose of 
> line 4.
> You still need to have a barrier between lines 1 and 3, but here only a 
> data dependency barrier will suffice, which is a weaker form of a read 
> barrier and on most all platforms is a no-op since data dependent loads 
> can be handled by processors automatically.
thanks for this explanation ...
> 3. Your tail_ and head_ should be volatile. Furthermore, the getters for 
> you tagged_ptr (like get_ptr, operator->() and operator*  ) should 
> return pointers/references to volatile types.
> Rational: If your tail_ is not volatile, then the compiler is free to 
> load head_ just once and line 4 effectively becomes a no-op.
this was already pointed out a few days ago and is fixed in my repository.
> 1. Again, you need to put read_barrier between lines 4 and 5. Only data 
> dependency barrier will suffice at line 2.
> 2. You need to add write barrier after line 7 for the same reason as in 
> the enqueue.
same as above
> Hope this was descriptive,
i am currently porting the data structures to the boost.atomic proposal,
which provides better language support to specify the memory order
constraints. the code is available from a separate branch of my
repository. i would be curious, if you could have a look at that
implementation as well.
thanks a lot for your review!
cheers, tim
-- tim_at_[hidden] http://tim.klingt.org There's no such entity as "most people". These are generalities. All generalities are meaningless. You've got to pin it down to a specific person doing a specific thing at a specific time and space. William S. Burroughs