$include_dir="/home/hyper-archives/boost-users/include"; include("$include_dir/msg-header.inc") ?>
From: Sebastian.Karlsson_at_[hidden]
Date: 2008-08-04 11:31:19
>>>> 1) In the overview section performance is nowhere to be seen as a  
>>>>   goal, which for my use case is very important. If I were to use  
>>>>  the  binary archive, how well would it perform in comparsion to  
>>>> a  hand  crafted optimized serialization aproach? I've seen in  
>>>> the  examples  that strings seems to be used to identify data,  
>>>> won't  this create a  large overhead for both deserilzation and  
>>>> storage?
>>>
>>> Performance is a secondary goal that I have worked on, especially the
>>> serialization of large dense arrays. This is now as fast ad any hand
>>> crafted approach. What data structures are you interested in?
>>>
>>> Matthias
>>
>> I'm reading a xml file into a custom tree data structure, parsing   
>> the string representations into their correct types stored as   
>> boost::any. I'm hoping that deserialization using boost::serialize   
>> will be considerably faster than using libxml2 which I use to parse  
>>  the xml file. The node data in this structure pretty much looks  
>> like:
>>
>> vector< DataCollection > children; // Naturally all the children of  
>>  this node
>> std::string name; // This is the tag name in xml
>> boost::any value; // This is <b>value</b> in xml
>> std::map< std::string, boost::any > attributes; // Not entirely   
>> suprising the attributes of the xml node
>>
>> The values stored in boost::any will be fairly lightweight, so I   
>> would recon that the majority of data read will actually be   
>> std::string for keys into the attributes as well as the name of the  
>>  node. So I guess I'm having a little bit of everything hehe.
>>
>> Since I won't send this data over network, and if I make a build   
>> for another system I can just ship different data files, I'm more   
>> interested in speed and the flexibility which boost::serilization   
>> offers. I'd be very interested in your changes Matthias.
>
> There are not many optimizations for XML files: most of the overhead is
> in parsing the strings. I you are interested in performance, a binary
> archive will always be faster than an XML one. Most of the
> optimizations for binary archives are already in Boost 1.35.
>
> I have a couple of questions:
>
> 1. why are your attributes a std::map< std::string, boost::any >  and
> not a std::map< std::string, std::string > ? How do you find out which
> type to use?
>
> 2. why is your value a boost::any? How do you know the type to use?
When I parse the XML file with libxml2 I have a list where the  
different types have registered a regex filter which it will use to  
find the real type. Lets say you have for example <elem position="3 3  
3">, then that will match the vector3 filter and construct a  
boost::any holding that vector3. I have a pretty neat system running  
here where I just need new types to register at FilterList. My  
DataCollection then have a Type& GetAttribute< Type >( const  
std::string& ), which basically wraps the any_cast and asserts that  
the typeids match. This way I get a pretty decent type safety, and  
since the client knows what type to expect it works out in the end.
I don't really know how boost::serialize works under the hood, but I  
was expecting to get healthy speed up due to:
A) libxml2 needs to parse the string data, locating start/end of xml  
elements, which I'm presuming is pretty costly in searching through  
all the string data.
B) When I use libxml2 it first parses data into a string, which I then  
need to extract and match at runtime to construct the real type.
C) I'm hoping the binary archive will take up less memory, resulting  
in less I/O. I strip the xml formating for example.
I'm also enteraining the thought of having much more complex objects  
stored from my application, kind of using the binary archive as a  
cache. I haven't really explored that area all that much yet though.