Boost mailing page: [boost] Re: A better tokenizer?

Date view	Thread view	Subject view	Author view

From: Robert Zeh (razeh_at_[hidden])
Date: 2003-12-05 09:24:02

Next message: Edward Diener: "[boost] Regression tests Fail results"
Previous message: David Abrahams: "[boost] Re: config, get_pointer, customization via ADL"
In reply to: John Torjo: "[boost] A better tokenizer?"
Next in thread: John Torjo: "Re: [boost] Re: A better tokenizer?"
Reply: John Torjo: "Re: [boost] Re: A better tokenizer?"

John Torjo <john.lists_at_[hidden]> writes:

> Dear boosters,
>
> While trying to implement slice range (in rtl - range template
> library), I came across the token_iterator class.
> While examining it, I found the TokenizerFunction concept too
> complicated, basically uniting two concepts.
>
>
> The way I see implementing a token, there are two concepts:
> 1. finding where each token begins and ends (this can be implemented
> incredibly simple, see below)
>
> 2. parsing the token, and returning the result.
>
>
> By keeping the above separated, we get simpler code and more reusability.
>
> A simple example could be: you want to parse each word on a file.
> As results, you might want the words themselves, (who knows?) only
> first 10 letters from the words, first letter from each word, or the
> word length.
> Keeping the 2 concepts separated, and the implementation is a breeze
> (efficent as well).
>
> Here's a possible implementation of parsing words:
> // does a new word begin, after 'first'?
> bool are_from_same_word( char first, char second) {
> if ( !isspace(second)) return true;
> return isspace(first) ? true : false;
> }
>
> void ignore_space(const char *& begin, const char *&end) {
> while ( begin != end)
> if (isspace(*begin)) begin++; else break;
> while ( begin != end)
> if (isspace(end[-1])) end--; else break;
> }
> std::string parse_word( const char * begin, const char *end) {
> ignore_space(begin,end);
> return std::string( begin, end);
> }
>
> int parse_word_len( const char * begin, const char *end) {
> ignore_space(begin,end);
> return end - begin;
> }
>
> ... etc.
>
>
> The above is a very generic solution that does not apply to strings only.
> (also, I was thinking a better name: slice - which slices a range into
> multiple ranges, and for each such range computes something. The result
> is another range).
>
> I will do some coding these days and post the results.
>
> Best,
> John
>

One of the nice features of the current Tokenizerfunction concept is
that it is a single pass algorithm, and will work with input
iterators. I'm not sure how to keep the algorithm single pass if you
split the token delimitation from the token creation.

Robert

Next message: Edward Diener: "[boost] Regression tests Fail results"
Previous message: David Abrahams: "[boost] Re: config, get_pointer, customization via ADL"
In reply to: John Torjo: "[boost] A better tokenizer?"
Next in thread: John Torjo: "Re: [boost] Re: A better tokenizer?"
Reply: John Torjo: "Re: [boost] Re: A better tokenizer?"

Date view	Thread view	Subject view	Author view