Boost mailing page: Re: [boost] tokenizer profiling results

Date view	Thread view	Subject view	Author view

From: George A. Heintzelman (georgeh_at_[hidden])
Date: 2001-10-18 09:42:13

Next message: Gennadiy E. Rozental: "Sun Forte c++ config in cvs"
Previous message: brianjparker_at_[hidden]: "Rational approximation link"
In reply to: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"
Next in thread: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: jbandela_at_[hidden]: "Re: tokenizer profiling results"

todo_at_[hidden] wrote:
> From: "George A. Heintzelman" <georgeh_at_[hidden]>
>
> > I was spending some time staring at my profiling results, and found
> > something a little bit disturbing in the tokenizer.

[snip]

> I've had the same concerns. I've found that on my implementation
> sizeof(token_iterator<char_delimiters_separator>) == 44. This
> is really gross.
>
> I've ended with the solution with low cost of copying, at
> the expense of "degradation" of token iterator from
> ForwardIterator to InputIterator. The "degradation" occurs
> because the tokenizer function instance and the current token
> are kept in the state object that is shared between iterator
> instances.
[snip]

Well, I think this might not be good, if it can be avoided. There are
several std algorithms that require Forward Iterators that would be
quite conceivably useful with a tokenizer: search, find_first_of,
find_end look that way, and the ability to look-ahead in a sequence
could not infrequently be useful to user algorithms.

I think one problem is that ForwardIterator actually encompasses two
separate refinements of InputIterator: first, it is an InputIterator
for which you can make meaningful copies which can be independently
incremented, and second it is an InputIterator for which the
dereference is an lvalue, assuming it's a mutable iterator.

The ideal tag for the token_iterator would be one for which the first
is true, but not the second.

Alternatively we could change the value_type -- it is required to be a
reference for a ForwardIterator, but we can decide what it is a
reference to! We could use a new object which represents a substring
pretty easily. If that object just contained two iterators, and only
created std::string temporaries when required, this would probably
improve tokenizer's performance in other dimensions as well.

The more I think about it, the more that this seems like the right
thing to do. But, this alone doesn't solve the problem of big
iterators. I think what should happen is that the TokenizerFunction be
split up into the function proper (including static/closure-like
state), and other state which is needed for the function. I guess what
I would be getting to would be a layout of a token_iterator which looks
like this:

typedef token_representation<underlying_iterator> value_type;
value_type token;
compressed_pair<TokenizerFunction *, State<TokenizerFunction> >;

Where we give token_representation a fully-featured suite of
string-like assignments, conversions, etc, but contains just the start
and end iterators and has a lifetime limited by the original parsed
string (this substring class would probably be useful in other
contexts, actually (regex::match_results comes to mind); it's a lot
like the Range representations I've seen talked about too). The
original TokenizerFunction object referred to is in the tokenizer
object (and now can itself hold the start- and end-iterator of the
original sequence, which is also currently being carried around by the
iterator, if I'm reading things correctly).

Then we specialize things for the default case, where
State<TokenizerFunction> is void, and otherwise, token_iterator::advance
calls something like this:

token = TokenizerFunction(token.end,State<TokenizerFunction> &);

and the end token_iterator has start == the TokenizerFunction end of
the original sequence.

This would be a size 12 object for non-stateful iterators on most
implementations, which is much more palatable than the current 40 (on
my implementation) or 44 on yours; and the copying is just three
word-copies, so it will be pretty cheap, no ref-counting or anything to
worry about. As an added bonus, this representation would enable the
offset_separators function to dispense with true state, as the
TokenizerFunction could just compute the difference from the
stored-one-time value in the tokenizer object.

Now we have ForwardIterator status without any real trouble. We could
even consider allowing the user to define other functions (eg, moving
backwards) if they wanted the iterator to satisfy Bidirectional or even
RandomAccess Iterator requirements, and the tokenizer function in use
was suitable (eg, offset_separator could very naturally have an
iterator which satisfies RandomAccess requirements).

So, thoughts? Is this worth hacking together an attempt at an
implementation?

George Heintzelman
georgeh_at_[hidden]

Next message: Gennadiy E. Rozental: "Sun Forte c++ config in cvs"
Previous message: brianjparker_at_[hidden]: "Rational approximation link"
In reply to: todor_at_[hidden]: "Re: [boost] tokenizer profiling results"
Next in thread: jbandela_at_[hidden]: "Re: tokenizer profiling results"
Reply: jbandela_at_[hidden]: "Re: tokenizer profiling results"

Date view	Thread view	Subject view	Author view