Subject: [boost] tokenizer for overlapping delimiters?
From: Jorge Cardoso Leitão (jorgecarleitao_at_[hidden])
Date: 2015-04-23 06:30:53


Dear boost devs,

I'm writing here because I've been coding a tokenizer that I believe, given
its generality, could be an addition to Boost. I'm asking for a judgement
on whether the idea has room in boost or not.

The interface I'm proposing is a function of two arguments, a string (call
it text) and a set of strings (call it key-terms), that returns a
vector<string> (call it result) which fulfils 4 constraints:

   1. a string in the result can only be either a key-term or a string
   between two key-terms;
   2. the concatenation of the result is always the original text;
   3. a key-term containing other key-terms has priority over the latter;
   4. a key-term overlapping other has priority based on its position in
   the text

A tokenizer that divides a string by delimiters is a subset of this
interface whose key-terms are the delimiters. This is considered in
Boost.Tokenizer, where the key-terms are *non-overlapping. *The critical
addition here is the ability to deal with *overlapping key-terms*.

A common use case of overlapping key-terms is when you have key-terms that
you want to consider as single tokens but they overlap with common
delimiters. A practical example:

tokenize a string (with words separated by spaces) and guarantee that both
`"United States of America"` and `"United States of Brazil"` are
interpreted as single tokens.

The non-triviality appears because such feat requires storing which
key-terms are using a previous sub-string, and how to reverse in case the
match fails. (e.g. "United States of " is common to both terms above, but
once the first letter appears, either one or both can be discarded as
potential matches).

Some examples in pseudo-code (see how they fulfil constraints 1-4)

tokenize("the end", {}) --> {"the end"}
tokenize("the end", {" "}) --> {"the", " ", "end"}
tokenize("foo-bar", {"foo", "foo-bar"}) --> {"foo-bar"}
tokenize("the end", {"the e", " ", "end"}) --> {"the e", "nd"}
tokenize("foo-tars ds", {"foo", "foo-bar", "-tar"}) --> {"foo", "-tar", "s
ds"}

As a proof of concept, I've implemented such interface and respective test
cases, which you can find in https://github.com/jorgecarleitao/tokenize.
Any change is possible to accommodate to boost standards: this can be
generalized to arbitrary sequences, to arbitrary types, and to use
iterators, documented, better tested etc.

But before anything else, I would like to ask for an opinion on whether
this is sufficiently general and useful to be considered to boost.

Thank you for your time,
Best regards,
Jorge