$include_dir="/home/hyper-archives/geometry/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [geometry] Storage allocator questions
From: Tim Finer (tim.finer_at_[hidden])
Date: 2014-11-29 14:03:46
Hi Henry,
On 11/29/14, 10:48 AM, Henry Roeland wrote:
> Hi Adam,
>
> It would be greet if I can focus on storage allocator and leave the 
> rtree part to you or somebody else.
>
> I hope I can still ask you guys questions concerning this storage 
> allocator? I have some global Idea's and probably need help both on 
> design level and implementation.
>
>
> Questions so far:
>
> 1. Is paging the only mechanism to get memory growth under control for 
> an rtree? Its the only one I can think of, but maybe there are other 
> ways to explore?
>
>
> 2. Is there any way an rtree can be read-only? Or set read-only after 
> (bulk)insert?
>
>
> 3. One problem/challenge is how to get and keep region(box) 
> coordinates "divided" over the page nodes. At least thats my 
> understanding of an (r)tree. You need the region coordinates as fast 
> lookup ID to know if a query should "dive" into the page or its sub 
> pages or not.
> Crazy idea: Maybe the allocator should manage an rtree of pages with 
> references/pointers to the "real" tree nodes?
>
> Questions:
>
>   * Is my understanding of data lookup right? So coordinates determine
>     the location of the data inside the RTree?
>   * If this is true does this then mean that all Pages (Loaded or
>     unloaded) must be known including its region coordinates?
>   * This in order to get/keep the tree balanced and know which page
>     must be loaded (by coordinate query)?
>
>
> Lets discuss the different actions an storage allocator can/must handle:
>
>   * Load in Page data on request by RTree
>       o "Load" can mean anything: From File, From DB, From
>         MessageQueue,...
>       o "Request" means query hits coordinate region that matches the
>         Page coordinates
>   * Bulk Load multiple pages at once???
>
>
>   * Unload Page data on request by Storage Manager itself(?)
>       o "Unload" can mean:
>           + To /dev/null :-) Only useful when read-only rtree
>           + To File: Memory mapped or otherwise
>           + To DB: Save data to (spatial) database
>           + To MessageQueue
>           + ???
>       o "Request" can mean:
>           + Smart_pointer(s) say that nodes inside page are not used
>             anymore
>           + Smart_pointer(s) say that page is not used anymore
>           + Amount of usage(hit count) is small
>           + ???
>   * Bulk Unload multiple pages at once???
>
>
> NOTE: The storage allocator must, as Adam already pointed out, deliver 
> an interface to make above points possible. But not implement them in 
> anyway.
>
> Other questions/points that come to my mind are:
>
>   * Must the storage allocator always store? What about transactions
>     between in memory (RTree) and Persistence storage?
>   * State-full allocator: Who owns the nodes and the data inside it? I
>     always thought of an allocator like a factory that does not own
>     the data...
>   * When data is owned by another STL(like) container then IMHO a
>     storage allocator has no use. This because the storage allocator
>     then does not own the data and has no means to free/unload it.
>
>
>
> Above is going top down by requirements/features but code-wise I will 
> try to handle in my next e-mail.
>
> Thanks for your time,
>
> Kind regards,
>
> Henry Roeland
>
>
>> On 26 nov. 2014, at 22:12, Adam Wulkiewicz <adam.wulkiewicz_at_[hidden] 
>> <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>
>> Hi Henry,
>>
>> Henry Roeland wrote:
>>>
>>>> On 26 nov. 2014, at 17:48, Adam Wulkiewicz 
>>>> <adam.wulkiewicz_at_[hidden] <mailto:adam.wulkiewicz_at_[hidden]>> wrote:
>>>>
>>>> Henry Roeland wrote:
>>>>> Dear all,
>>>>>
>>>>> First let me introduce myself: my name is Henry Roeland 37 years 
>>>>> old and currently working as software engineer for a company that 
>>>>> delivers maps to car companies like BMW and VW.
>>>>> I'm working in the testtools team that is responsible for a map 
>>>>> viewer and maptest batch tool.
>>>>> Technics that we use are C/C++, Sqlite, spatialite and of corse boost.
>>>>>
>>>>> Personally I have multiple years of C++ experience and busy 
>>>>> getting up to speed with geometry technics in 2D. NeXT to that 
>>>>> just started with studying advanced datastructures and algoritms 
>>>>> like b+tree and r*tree.
>>>>>
>>>>> Currently I see your rtree as a perfect candidate for in memory 
>>>>> container/index/cache in our map interface. As I wrote in my 
>>>>> previous email the feature that is not supported yet is paging. 
>>>>> How can I help to accomplish this feature? Are there specs? Or am 
>>>>> I the only one in favior of this feature?
>>>> Certainly not the only one. People are asking about it from time to 
>>>> time, using Interprocess' mapped file as a replacement. You're the 
>>>> only one who is willing to help with the implementation. I greately 
>>>> appreciate it.
>>>>
>>>>
>>>> As I mentioned before I was thinking about a stateful 
>>>> allocator/storage/manager handling file or files/pages, cacheing 
>>>> etc. This allocator would return pointers which when dereferenced 
>>>> would force the allocator to load and cache data. Then, there would 
>>>> be a mechanism to tell the allocator that some pointers/data aren't 
>>>> used anymore. It could be done explicitly by calling some function 
>>>> on allocator like:
>>>>
>>>> persistent_allocator alloc(...);
>>>> rtree<..., persistent_allocator> rt(..., alloc);
>>>>
>>>> rt.insert(...);
>>>> rt.insert(...);
>>>> rt.insert(...);
>>>> alloc.flush();
>>>>
>>> I don't know about an rtree but from my homework:-) concerning 
>>> b-trees I understood that pages can be handled as nodes themselves. 
>>> Pages can be split and contain multiple nodes depending on size. If 
>>> we could create a stable interface for Page(Nodes) then this 
>>> interface can become public leaving the Node interface hidden???
>>> It can also have other advantages to alloc a group of nodes (Page) 
>>> together in terms of memory alignment(Your don't need a smart 
>>> pointer for every node only for the page?), Also in terms of 
>>> (de)serialisation and bulk load handling a group of nodes would IMHO 
>>> be better (if its possible).
>>
>> Of course you're right, nodes could (or even should) be grouped in 
>> pages, loaded together etc. Also we should probably group close nodes 
>> together in one page (e.g. parents with children), allocator hint 
>> could be used for this. But I was thinking about leaving the desicion 
>> how to store nodes to the allocator/storage. As you said one could 
>> use big pages, one could store 1 node per file, or a database etc. 
>> And still somehow a page or some other storage or Archive must know 
>> what data should be saved/loaded per node. In my scenario this would 
>> just be implemented in a Serialization-like way. Higher level aspects 
>> like pages would be closed in allocator/storage. Of course assuming 
>> that this could be done this way but I feel that it could.
>>
>> I think that a pointer still would have to store some ID of a node 
>> inside a page to identify an exact placement of the data. Or at least 
>> some global ID somehow indexed to access a specific node in a 
>> specific page.
>>
>>>
>>>> The next step could be adding of some hooks or traits for 
>>>> allocators which would be called by the rtree internally on the end 
>>>> of some operation, e.g. insert. So this wouldn't be a C++ allocator 
>>>> anymore since it'd have additional methods.
>>>>
>>>> To avoid storing a reference to allocator/storage in those smart 
>>>> pointers the rtree could notify the allocator/storage when a 
>>>> pointer is dereferenced and data is required. So the responsibility 
>>>> for the notification of allocator/storage would be moved from a 
>>>> pointer to the rtree.
>>>>
>>> The Page Node can administrate its size and its free size and have a 
>>> smart_pointer telling when it was last used and dependency_checker 
>>> /unload checker with responsibility for notification. Maybe we can 
>>> use Boost Signals2 for notification?
>>>
>>>
>>> Together with a Page/Storage manager I'm thinking about the 
>>> following points:
>>> - Policy for when to load/unload. traits as you point it out.
>>> - Size of rtree must be known at all times. If some limit is reached 
>>> start unloading (allocator can tell the size?) to bottom pages that 
>>> are less used?
>>> - 25%/50%/75% percent of the top of the tree must remain in memory 
>>> for speed???
>>> - define 3 actions/states for a page: loaded, unloaded, unloaded but 
>>> cached as memory map.????
>>
>> So what you're describing here is how a paging allocator/storage 
>> could work. Sure if the storage was notified by the rtree about the 
>> performed actions it'd be able to do whatever it needs, cache pages, 
>> store some kind of usage counter to know which pages are used most 
>> often, support various policies, etc.
>>
>> I was describing more low level stuff that would have to be 
>> implemented on the rtree side. Basically 2 things notifications for 
>> actions and an interface for serialization of nodes and rtree header. 
>> Or would you prefer if I handled it, that you could focus on the 
>> implementation of a specific allocator/storage?
>>
>> There is also another thing that is mentioned in another thread 
>> (started by Tim). There should be some way of creation non-rtree, 
>> temporary pages, e.g. for data stored in a vector or some other 
>> container designed specifically for this purpose. But this could be 
>> handled later.
>>
>>>
>>>> Finally the rtree could notify the allocator/storage each time the 
>>>> data owned by a pointer is no longer needed but this probably 
>>>> wouldn't change anything. Furthermore modifying and releasing the 
>>>> nodes before an operation is finished wouldn't be safe in case when 
>>>> an exception was thrown. In fact, this allocator should keep a copy 
>>>> of nodes in memory during a whole operation and then at the end of 
>>>> the operation save the data into a persistent storage.
>>>>
>>>>
>>>> The next thing is how to serialize data from and to node. We should 
>>>> ask ourselves should we allow the users to implement other 
>>>> persistent storage variants on their own. If the answer was no, 
>>>> then we could just internally in the persistent allocator use the 
>>>> internals of the rtree. But if we wanted to allow it (which IMO 
>>>> should be done) the most straightforward way would be to expose the 
>>>> interface used by the rtree internally to access node's elements. 
>>>> Exposing it would mean that we couldn't change it in the future, so 
>>>> I'd prefer to avoid it. Instead, we could mimic or directly use 
>>>> Boost.Serialization interface for Nodes. In this scenario a 
>>>> specific Node would know what data must be saved and loaded and 
>>>> it'd be able to save or load to/from some generic Archive. 
>>>> Depending on the needs this could be some temporary 1-node Archive 
>>>> gathering data for later or a target Archive capable to serialize 
>>>> data on the fly, it'd depend on an allocator/storage.
>>>>
>>>> This way we could also support versioning on a node level, the same 
>>>> way how it's done in Serialization. So changes would have to be 
>>>> done on a nodes level not in the user-defined allocator. An example 
>>>> could be an implementation of a rtree variant storing additional 
>>>> data in nodes (hilbert r-tree) or additional types of nodes 
>>>> (PRtree). Also arbitrary user-defined Values would be serialized 
>>>> the same way (using Serialization or familiar Serialization-like 
>>>> interface).
>>>>
>>>> And this way we'd also support Serialization in one go.
>>>>
>>>>
>>>> We'd probably need some file header with the parameters of an rtree 
>>>> both in persistent storage and in serialization support. Similar as 
>>>> with nodes, an rtree could know how to load/save the header from/to 
>>>> some Archive. The rtree should e.g. check if it's possible to use 
>>>> persistent storage and load data, e.g. if it wasn't created using 
>>>> incompatible parameters, etc.
>>>>
>>>>
>>> How is it with backwards compatibility? I can imagine that Storage 
>>> work can break something already running?
>>
>> I think that this won't be problematic.
>>
>>> If I'm talking rubbish just tell me I can handle it:-)
>>>
>>
>> All ideas are valuable :)
>>
>>> Thanks for beneath info to getting me started. I will have to read 
>>> it a few times to understand it all.
>>
>> If you have any questions just ask. If I handled it on the rtree side 
>> and you was working on the allocator/storage, you probably wouldn't 
>> even be forced to know all of this internal stuff. Not that I have 
>> something against it :)
>>
>> Regards,
>> Adam
I have a request that the allocator be the sole source of allocation.  
One of my problems with the current implementation is that one of the 
constructors creates a temporary vector of centroids as a source for a 
packing algorithm.  My understanding is that the allocator needs to be 
generic enough to offer up a temporary buffer (something similar to 
std::get_temporary_buffer) so that the packing algorithm can take 
advantage of it without going to the heap.
Best,
Tim