Boost mailing page: tokenizer profiling results

Date view	Thread view	Subject view	Author view

From: George A. Heintzelman (georgeh_at_[hidden])
Date: 2001-10-17 18:45:31

Next message: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Previous message: Gennadiy E. Rozental: "token_iterator_base copy constructor"
Next in thread: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"

Hi,

I was spending some time staring at my profiling results, and found
something a little bit disturbing in the tokenizer.

The token_iterators ultimately contain a copy of the policies object,
which contains the function which decides where the tokens begin and
end. Some manner of reference to the function is necessary, of course.
However, the iterator contains the policy by value. The policy is
allowed to have state.

However, there is no distinguishing between the aspects of the policy
which are constant and unchanging, as with the strings used to hold the
separator lists in char_delimiters_separator and escaped_list_separator,
and the aspects which really are stateful, as with the current_offset
in the offset_separator.

So, that means that whenever I copy a token_iterator, I wind up making
two unnecessary string copies.

Most of the STL algorithms take iterators by value, sometimes
recursively, assuming that iterators are cheap to copy....

Thus, in my application, a sizeable fraction of the time used in
tokenizer-related work ends up copying and destroying
char_delimiters_separator objects. If you write a loop like this:

typedef char_delimiters_separator<char> tok_f;
tok_f tok_func(false,":");
while (getline(cin,mystr)) {
  tokenizer<tok_f> tokens(mystr,tok_func);
  tokenizer<tok_f>::iterator field(tokens.begin());
  // A rare hit:
  if (*field == "XXYYZZ") {
    // process data...
  }
}

You wind up creating and destroying at least 2 copies, perhaps 3 if
your optimizer is less good, of tok_func for every line in the input.
If you pull the declaration of tokens out of the loop and use
tokens.assign(mystr), you can get rid of one of them (this trick should
be documented). But if you then call for_each(tokens.begin(),tokens.end(
), pred), you make at least 2 more copies...

It seems to me that the token_iterators could be treated just like
iterators into any other container -- that is, that they are invalid if
the container, the tokenizer, is mucked with. In fact, right now they
are persistent and would continue to be valid beyond the death of the
tokenizer, since each has its own complete copy of the TokenizerFunc.
(though not, of course, beyond the death of the string being tokenized.
No matter what, the validity of the token_iterators should be
documented...).

In any case, it seems like this is violating the no-use-no-pay
principle. I'm not sure the best way to deal with this, but is there
any way this overhead could be avoided?

George Heintzelman
georgeh_at_[hidden]

Next message: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Previous message: Gennadiy E. Rozental: "token_iterator_base copy constructor"
Next in thread: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"

Date view	Thread view	Subject view	Author view