$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: jbandela_at_[hidden]
Date: 2001-10-17 23:31:14
Thanks for bringing up this problem. However, without a more or less 
radical redesign, the Policy needs to be contained by value in each 
iterator because it contains the value of the next token.  In 
addition, I believe the iterator_adaptor library that underlies 
Tokenizer contains policies by value. However, TokenizerFunctions 
need only be contained by value if it has non-constant state. Here is 
a quick solution that could be applied to specific instances.
template<class TokenizerFunc>
class TokFunRef{
   TokenizerFunc* f_;
public:
   TokFunRef(TokenizerFunc& ):f_(&f){}
   template<class InputIterator,class Token>
   bool operator()(InputIterator& next,InputIterator end,Token& tok){
      return (*f_)(next,end,tok);
   }
   void reset(){f_->reset();}
};
 typedef char_delimiters_separator<char> tok_f;
 typedef TokFunRef<tok_f> tok_f_ref;
 tok_f tok_func(false,":");
 while (getline(cin,mystr)) {
   tokenizer<tok_f_ref> tokens(mystr,tok_func);
   tokenizer<tok_f_fer>::iterator field(tokens.begin());
   // A rare hit:
   if (*field == "XXYYZZ") {
     // process data...
   }
 }
This would get rid of the extraneous copies, at a slight increase in 
complexity. In addition, you would have to make sure that the 
TokenizerFunction used did not contain non-constant state data, and 
that the TokenizerFunction object stayed in scope as long you were 
using the Tokenizer. 
This solution could be extended to be transparent. First change 
TokenizerFunction to require the following typedef
typedef TokFunRef<ThisFunction> PolicyFunction; // If constant state
OR
typedef ThisFunction PolicyFunction // If non-constant state
Where ThisFunction is the type of the specific TokenizerFunction.
Then a very small change in tokenizer.hpp would keep a local copy in 
the tokenizer object, and each iterator would have a reference to 
that copy. 
Let me know what you think of this solution
Regards,
John R. Bandela
--- In boost_at_y..., "George A. Heintzelman" <georgeh_at_a...> wrote:
> 
> Hi,
> 
> I was spending some time staring at my profiling results, and found 
> something a little bit disturbing in the tokenizer.
> 
> The token_iterators ultimately contain a copy of the policies 
object, 
> which contains the function which decides where the tokens begin 
and 
> end. Some manner of reference to the function is necessary, of 
course. 
> However, the iterator contains the policy by value. The policy is 
> allowed to have state.
> 
> However, there is no distinguishing between the aspects of the 
policy 
> which are constant and unchanging, as with the strings used to hold 
the 
> separator lists in char_delimiters_separator and 
escaped_list_separator,
>  and the aspects which really are stateful, as with the 
current_offset 
> in the offset_separator.
> 
> So, that means that whenever I copy a token_iterator, I wind up 
making 
> two unnecessary string copies.
> 
> Most of the STL algorithms take iterators by value, sometimes 
> recursively, assuming that iterators are cheap to copy....
> 
> Thus, in my application, a sizeable fraction of the time used in 
> tokenizer-related work ends up copying and destroying 
> char_delimiters_separator objects. If you write a loop like this:
> 
> typedef char_delimiters_separator<char> tok_f;
> tok_f tok_func(false,":");
> while (getline(cin,mystr)) {
>   tokenizer<tok_f> tokens(mystr,tok_func);
>   tokenizer<tok_f>::iterator field(tokens.begin());
>   // A rare hit:
>   if (*field == "XXYYZZ") {
>     // process data...
>   }
> }
> 
> You wind up creating and destroying at least 2 copies, perhaps 3 if 
> your optimizer is less good, of tok_func for every line in the 
input. 
> If you pull the declaration of tokens out of the loop and use 
> tokens.assign(mystr), you can get rid of one of them (this trick 
should 
> be documented). But if you then call for_each(tokens.begin
(),tokens.end(
> ), pred), you make at least 2 more copies...
> 
> It seems to me that the token_iterators could be treated just like 
> iterators into any other container -- that is, that they are 
invalid if 
> the container, the tokenizer, is mucked with. In fact, right now 
they 
> are persistent and would continue to be valid beyond the death of 
the 
> tokenizer, since each has its own complete copy of the 
TokenizerFunc. 
> (though not, of course, beyond the death of the string being 
tokenized. 
> No matter what, the validity of the token_iterators should be 
> documented...).
> 
> In any case, it seems like this is violating the no-use-no-pay 
> principle. I'm not sure the best way to deal with this, but is 
there 
> any way this overhead could be avoided?
> 
> George Heintzelman
> georgeh_at_a...