$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: George A. Heintzelman (georgeh_at_[hidden])
Date: 2001-10-17 18:45:31
Hi,
I was spending some time staring at my profiling results, and found 
something a little bit disturbing in the tokenizer.
The token_iterators ultimately contain a copy of the policies object, 
which contains the function which decides where the tokens begin and 
end. Some manner of reference to the function is necessary, of course. 
However, the iterator contains the policy by value. The policy is 
allowed to have state.
However, there is no distinguishing between the aspects of the policy 
which are constant and unchanging, as with the strings used to hold the 
separator lists in char_delimiters_separator and escaped_list_separator,
 and the aspects which really are stateful, as with the current_offset 
in the offset_separator.
So, that means that whenever I copy a token_iterator, I wind up making 
two unnecessary string copies.
Most of the STL algorithms take iterators by value, sometimes 
recursively, assuming that iterators are cheap to copy....
Thus, in my application, a sizeable fraction of the time used in 
tokenizer-related work ends up copying and destroying 
char_delimiters_separator objects. If you write a loop like this:
typedef char_delimiters_separator<char> tok_f;
tok_f tok_func(false,":");
while (getline(cin,mystr)) {
  tokenizer<tok_f> tokens(mystr,tok_func);
  tokenizer<tok_f>::iterator field(tokens.begin());
  // A rare hit:
  if (*field == "XXYYZZ") {
    // process data...
  }
}
You wind up creating and destroying at least 2 copies, perhaps 3 if 
your optimizer is less good, of tok_func for every line in the input. 
If you pull the declaration of tokens out of the loop and use 
tokens.assign(mystr), you can get rid of one of them (this trick should 
be documented). But if you then call for_each(tokens.begin(),tokens.end(
), pred), you make at least 2 more copies...
It seems to me that the token_iterators could be treated just like 
iterators into any other container -- that is, that they are invalid if 
the container, the tokenizer, is mucked with. In fact, right now they 
are persistent and would continue to be valid beyond the death of the 
tokenizer, since each has its own complete copy of the TokenizerFunc. 
(though not, of course, beyond the death of the string being tokenized. 
No matter what, the validity of the token_iterators should be 
documented...).
In any case, it seems like this is violating the no-use-no-pay 
principle. I'm not sure the best way to deal with this, but is there 
any way this overhead could be avoided?
George Heintzelman
georgeh_at_[hidden]