Subject: Re: [boost] boost] [flat_map] Any interest in a flat_map variant that is much faster on insertion/erase at the price of a slower lookup?
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2015-03-22 12:40:26


Giacomo Drago wrote:
> On 2015-03-22 14:41, Phil Endecott wrote:
>> There is an in-memory B-tree at google code that looks decent, though
>> I've not personally used it:
>>
>> http://code.google.com/p/cpp-btree/
>>
>> If you click "Wiki pages - UsageInstructions" at the bottom left of
>> that page you'll find tables giving the memory overhead. A B-tree has
>> no per-item pointers, but it does have per-page pointers; you can
>> asymptotically approach zero overhead by increasing the page size.
>> With an infinite page size you effectively have a flat_map, with
>> a page size of one you have a binary tree.
>>
>> In practice, a more important influence on memory footprint than the
>> per-page pointers is the empty space in the pages that results from
>> splitting after insertion, and after deletion but before merging.
>> But a flat_map implemented on a std::vector has a similar issue, i.e.
>> that std::vector doubles its size when it re-allocates. So a
>> std::vector will be wasting up to half of the memory that it has
>> allocated.
>
> On a vector you can easily shrink_to_fit thus bringing the overhead to
> zero (if you neglect the vector size and a pointer to its buffer).

And you can compact ("vacuum") a B-tree, if you want.

> I still think a flat_map (that does not mean this
> flat_map in particular) has some reason to be.

For me, the main uses of flat data structures have been:

(a) When it's important to have no pointers at all, so that I can store
the data in a memory-mapped file or perhaps in interprocess shared
memory. (Though most recently I decided to use offset pointers from
Boost.Interprocess along with a Boost.Intrusive map, and that worked
well.)

(b) When the data has a "life cycle" with different phases, e.g. an
initial phase of bulk creation, followed by a phase of random lookup.
In this case you only sort once so the computational complexity is
optimal.

I would not want to use them when either you have fine-grained mixing
of insertion and lookup, or when the amount of data is large enough
that the poor locality of reference starts to dominate.

Regards, Phil.