Boost mailing page: Re: tokenizer profiling results

Date view	Thread view	Subject view	Author view

From: jbandela_at_[hidden]
Date: 2001-10-17 23:31:14

Next message: Drew.Whitehouse_at_[hidden]: "python linking problem"
Previous message: George A. Heintzelman: "tokenizer profiling results"
In reply to: George A. Heintzelman: "tokenizer profiling results"
Next in thread: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"

Thanks for bringing up this problem. However, without a more or less
radical redesign, the Policy needs to be contained by value in each
iterator because it contains the value of the next token. In
addition, I believe the iterator_adaptor library that underlies
Tokenizer contains policies by value. However, TokenizerFunctions
need only be contained by value if it has non-constant state. Here is
a quick solution that could be applied to specific instances.

template<class TokenizerFunc>
class TokFunRef{
   TokenizerFunc* f_;
public:
   TokFunRef(TokenizerFunc& ):f_(&f){}
   template<class InputIterator,class Token>
   bool operator()(InputIterator& next,InputIterator end,Token& tok){
      return (*f_)(next,end,tok);
   }
   void reset(){f_->reset();}
};

typedef char_delimiters_separator<char> tok_f;
typedef TokFunRef<tok_f> tok_f_ref;
tok_f tok_func(false,":");
while (getline(cin,mystr)) {
   tokenizer<tok_f_ref> tokens(mystr,tok_func);
   tokenizer<tok_f_fer>::iterator field(tokens.begin());
   // A rare hit:
   if (*field == "XXYYZZ") {
     // process data...
   }
}

This would get rid of the extraneous copies, at a slight increase in
complexity. In addition, you would have to make sure that the
TokenizerFunction used did not contain non-constant state data, and
that the TokenizerFunction object stayed in scope as long you were
using the Tokenizer.

This solution could be extended to be transparent. First change
TokenizerFunction to require the following typedef

typedef TokFunRef<ThisFunction> PolicyFunction; // If constant state

typedef ThisFunction PolicyFunction // If non-constant state

Where ThisFunction is the type of the specific TokenizerFunction.

Then a very small change in tokenizer.hpp would keep a local copy in
the tokenizer object, and each iterator would have a reference to
that copy.

Let me know what you think of this solution

Regards,

John R. Bandela

--- In boost_at_y..., "George A. Heintzelman" <georgeh_at_a...> wrote:
>
> Hi,
>
> I was spending some time staring at my profiling results, and found
> something a little bit disturbing in the tokenizer.
>
> The token_iterators ultimately contain a copy of the policies
object,
> which contains the function which decides where the tokens begin
and
> end. Some manner of reference to the function is necessary, of
course.
> However, the iterator contains the policy by value. The policy is
> allowed to have state.
>
> However, there is no distinguishing between the aspects of the
policy
> which are constant and unchanging, as with the strings used to hold
the
> separator lists in char_delimiters_separator and
escaped_list_separator,
> and the aspects which really are stateful, as with the
current_offset
> in the offset_separator.
>
> So, that means that whenever I copy a token_iterator, I wind up
making
> two unnecessary string copies.
>
> Most of the STL algorithms take iterators by value, sometimes
> recursively, assuming that iterators are cheap to copy....
>
> Thus, in my application, a sizeable fraction of the time used in
> tokenizer-related work ends up copying and destroying
> char_delimiters_separator objects. If you write a loop like this:
>
> typedef char_delimiters_separator<char> tok_f;
> tok_f tok_func(false,":");
> while (getline(cin,mystr)) {
> tokenizer<tok_f> tokens(mystr,tok_func);
> tokenizer<tok_f>::iterator field(tokens.begin());
> // A rare hit:
> if (*field == "XXYYZZ") {
> // process data...
> }
> }
>
> You wind up creating and destroying at least 2 copies, perhaps 3 if
> your optimizer is less good, of tok_func for every line in the
input.
> If you pull the declaration of tokens out of the loop and use
> tokens.assign(mystr), you can get rid of one of them (this trick
should
> be documented). But if you then call for_each(tokens.begin
(),tokens.end(
> ), pred), you make at least 2 more copies...
>
> It seems to me that the token_iterators could be treated just like
> iterators into any other container -- that is, that they are
invalid if
> the container, the tokenizer, is mucked with. In fact, right now
they
> are persistent and would continue to be valid beyond the death of
the
> tokenizer, since each has its own complete copy of the
TokenizerFunc.
> (though not, of course, beyond the death of the string being
tokenized.
> No matter what, the validity of the token_iterators should be
> documented...).
>
> In any case, it seems like this is violating the no-use-no-pay
> principle. I'm not sure the best way to deal with this, but is
there
> any way this overhead could be avoided?
>
> George Heintzelman
> georgeh_at_a...

Next message: Drew.Whitehouse_at_[hidden]: "python linking problem"
Previous message: George A. Heintzelman: "tokenizer profiling results"
In reply to: George A. Heintzelman: "tokenizer profiling results"
Next in thread: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"

Date view	Thread view	Subject view	Author view