$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: jbandela_at_[hidden]
Date: 2000-09-04 12:13:08
I agree with Aleksey. The problem of backtracking is a real one. 
Consider code to process assignment
if(*iter->type == VAR){
// check if the next token is an =
iterator_type temp(++iter);
if(temp != end && temp->type == OP_EQ){
// Process the assignment
...
iter = temp;
}
// Otherwise do not change iter
}
Another problem is that you may not want to process the entire input, 
but rather get say the first 2 tokens and grab the rest as a string. 
For example you have ini type file parsing code that reads name value 
pairs that are like this
name=value
We could have escape characters so that name can contain = signs
this\=test=Hello World=first program in C
To quickly parse this, we get the first two tokens ("this=test" and 
=) and just grab the rest of the line ("Hello World=first program in 
C") since we want to be able to support strings in value that have an 
arbitrary fromat. An example might be base64 encoding that uses =
Thanks again for looking at the code and commenting
--- In boost_at_[hidden], "Aleksey Gurtovoy" <alexy_at_m...> wrote:
> Daryle Walker (<darylew_at_m...>) wrote:
> 
> 
> > I looked at the recent TokenIterator stuff, and I wonder if there 
is a way
> > to simplify tokenization.  Not every part of a concept has to be 
a class;
> we
> > could replace the token iterator class with an algorithm.  How 
about:
> >
> [snip]
> > template <typename Tokenizer, typename In, typename Out>
> > Tokenizer
> > tokenize (
> >     In         src_begin,
> >     In         src_end,
> >     Out        dst_begin,
> >     Tokenizer  tok )
> > {
> >     // Send any prefix tokens
> >     while ( tok )
> >         *dst_begin++ = *tok;
> >
> >     while ( src_begin != src_end )
> >     {
> >         // Give input symbols to tokenizer
> >         tok( *src_begin++ );
> >
> >         // If a token can now be formed, send it
> >         while ( tok )
> >             *dst_begin++ = *tok;
> >     }
> >
> >     // Return the tokenizer in case more symbols are needed
> >     return tok;
> > }
> >
> 
> That's always good to look at something from different 
perspective :).
> However, I don't think that replacing of token iterator concept by 
some
> 'tokenize' algorithm would be beneficial. Actually, I don't think 
that there
> is a common generic representation of all sequence parsing 
algorithms, which
> we could factor out and turn into some (useful) 'tokenize' function
> template. For instance, the algorithm you posted pretty much rules 
out
> backtracking tokenizers. The fact is that iteration through an 
original
> input sequence that needs to be tokenized is too much tied to the 
parsing
> algorithm, and I don't think there is much sense in breaking these
> dependences. So actually we don't care about how input sequence is 
iterated
> during the parsing process - that's the tokenizer's work. What we 
want is
> some standard way to get the results of its job, in the form that 
doesn't
> impose unnecessary requirements on users' code and that integrates 
well with
> the standard library itself. IMO, iterator concept is exactly what 
we need.
> It doesn't force you to process the whole input sequence all at 
once and put
> the results somewhere, although you could do it easily if you want 
to. It
> directly supports the common pattern of many (hi-level) parsing 
algorithms -
> read new lexeme -> go to some state -> do some work -> read new 
lexeme. It
> doesn't make constraining assumptions about how tokenizers work; 
and it
> allows tokenizers to have a minimal interface (e.g. tokenizer may be
> implemented as just a function). As for complexity of the current
> implementation (which concerns me too) - I hope it will be 
simplified a lot
> after we nail down the concepts.
> 
> --Aleksey