$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: George A. Heintzelman (georgeh_at_[hidden])
Date: 2001-10-18 09:42:13
todo_at_[hidden] wrote:
> From: "George A. Heintzelman" <georgeh_at_[hidden]>
> 
> > I was spending some time staring at my profiling results, and found
> > something a little bit disturbing in the tokenizer.
[snip]
> I've had the same concerns. I've found that on my implementation
> sizeof(token_iterator<char_delimiters_separator>) == 44. This
> is really gross.
> 
> I've ended with the solution with low cost of copying, at
> the expense of "degradation" of token iterator from
> ForwardIterator to InputIterator. The "degradation" occurs
> because the tokenizer function instance and the current token
> are kept in the state object that is shared between iterator
> instances. 
[snip]
Well, I think this might not be good, if it can be avoided. There are 
several std algorithms that require Forward Iterators that would be 
quite conceivably useful with a tokenizer: search, find_first_of, 
find_end look that way, and the ability to look-ahead in a sequence 
could not infrequently be useful to user algorithms.
I think one problem is that ForwardIterator actually encompasses two 
separate refinements of InputIterator: first, it is an InputIterator 
for which you can make meaningful copies which can be independently 
incremented, and second it is an InputIterator for which the 
dereference is an lvalue, assuming it's a mutable iterator.
The ideal tag for the token_iterator would be one for which the first 
is true, but not the second.
Alternatively we could change the value_type -- it is required to be a 
reference for a ForwardIterator, but we can decide what it is a 
reference to! We could use a new object which represents a substring 
pretty easily. If that object just contained two iterators, and only 
created std::string temporaries when required, this would probably 
improve tokenizer's performance in other dimensions as well.
The more I think about it, the more that this seems like the right 
thing to do. But, this alone doesn't solve the problem of big 
iterators. I think what should happen is that the TokenizerFunction be 
split up into the function proper (including static/closure-like 
state), and other state which is needed for the function. I guess what 
I would be getting to would be a layout of a token_iterator which looks 
like this:
typedef token_representation<underlying_iterator> value_type;
value_type token;
compressed_pair<TokenizerFunction *, State<TokenizerFunction> >;
Where we give token_representation a fully-featured suite of 
string-like assignments, conversions, etc, but contains just the start 
and end iterators and has a lifetime limited by the original parsed 
string (this substring class would probably be useful in other 
contexts, actually (regex::match_results comes to mind); it's a lot 
like the Range representations I've seen talked about too). The 
original TokenizerFunction object referred to is in the tokenizer 
object (and now can itself hold the start- and end-iterator of the 
original sequence, which is also currently being carried around by the 
iterator, if I'm reading things correctly).
Then we specialize things for the default case, where 
State<TokenizerFunction> is void, and otherwise, token_iterator::advance
 calls something like this:
token = TokenizerFunction(token.end,State<TokenizerFunction> &);
and the end token_iterator has start == the TokenizerFunction end of 
the original sequence.
This would be a size 12 object for non-stateful iterators on most 
implementations, which is much more palatable than the current 40 (on 
my implementation) or 44 on yours; and the copying is just three 
word-copies, so it will be pretty cheap, no ref-counting or anything to 
worry about. As an added bonus, this representation would enable the 
offset_separators function to dispense with true state, as the 
TokenizerFunction could just compute the difference from the 
stored-one-time value in the tokenizer object.
Now we have ForwardIterator status without any real trouble. We could 
even consider allowing the user to define other functions (eg, moving 
backwards) if they wanted the iterator to satisfy Bidirectional or even 
RandomAccess Iterator requirements, and the tokenizer function in use 
was suitable (eg, offset_separator could very naturally have an 
iterator which satisfies RandomAccess requirements).
So, thoughts? Is this worth hacking together an attempt at an 
implementation?
George Heintzelman
georgeh_at_[hidden]