$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Vladimir Pozdyayev (pvv_at_[hidden])
Date: 2004-12-22 03:18:29
John Maddock wrote:
> This is an area I want to explore though, if I can get this next lot of code
> out the door, then I'll create a cvs branch to experiment with this, if you
> want to suggest / experiment with a design for the abstract creator in the
> meantime, then go for it.
Finally I got down to refactoring the ANFA code. Take a look at
http://groups.yahoo.com/group/boost/files/anfa-regex/anfa091.zip
(once again, this is not a fullscale regex library... yet)
The DESIGN file content is appended to this message.
--
Best regards,
Vladimir Pozdyayev.
----------------------------------------------------------------------
-= Regex Design Issues =-
The core classes.
* charset
Provides "bool operator( character )". Nothing much to say apart
from that, but do see the Charset Issues section.
* charset creator
Supports arbitrary charset expressions (within the limits of a
given set of possible operations). Implementations, however, are
not required to provide _all_ the declared functionality; calls
for unsupported features should result in appropriate exceptions.
Also provides the "void create( charset & )" function which is
used to initialize a charset newly created by "matcher creator".
The "abstract_charset_creator" class provides stubs for all
expression elements possibly to be requested by regex parsers.
Implementations with limited functionality can inherit them and
redefine only those functions that should actually do something
useful.
* matcher
Provides the low-level matching functionality, say, finding the
first occurrence of bla-bla-bla. On the other hand, replacing all
occurrences is a high-level action, for it consists of (1)
finding them and (2) creating a modified string---so it should go
into the "regex" class. (On the other other hand, if it is
possible to do replacement on the fly while searching, this
becomes a low-level action. I don't know if it can be done in a
sufficiently general way, however.)
* matcher creator
Like charset creator, only for matchers.
* parser
The syntax parser. Takes the input string in the form of
begin-end iterators, and issues a sequence of charset/matcher
creator calls ending with "matcher_creator::create" (or an
exception). In essence, simply provides the function
"void parse( matcher &m, iterator begin, iterator end )".
A parser must be consistent with the properties of "creator"
classes.
* regex
A wrapper for the "matcher" class. Provides the high-level
creation & utilization routines.
How they are connected.
A sample from "regex.cpp":
typedef basic_regex<
basic_simple_parser<
basic_charset_creator< wchar_t >,
basic_anfa_matcher_creator< basic_charset< wchar_t > >,
basic_anfa_matcher< basic_charset< wchar_t > >
>,
basic_anfa_matcher< basic_charset< wchar_t > >
> regex;
On "creators" and "create" functions.
The name is somewhat misleading, since they fill target objects
with compiled data rather than create them. Still, "charset
compiler" sounds a bit weird... or does it?
Anyway. The "creators" are subject to the following uses and
requirements. They must be able to destruct themselves gracefully
even if the expression they are being feeded with is only halfway
done (in case someone has thrown an "unsupported" exception). The
"create" function must do an implicit "pop" from the expression
stack, so that the "creator" object could be reused. The "create"
function can assume there's only one top-level expression tree
node on the stack.
-= Charset Issues =-
(Should I rename them to "character sets" for consistency with
the full-names style?)
All the above templates have quite a freedom in intermixing
different character types, let alone different character
encodings. E.g., the sample program has all regex templates
instantiated with wide characters, but the regex string itself is
char-typed. This clearly needs to be controlled.