$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Erik Wien (wien_at_[hidden])
Date: 2004-10-19 19:34:40
Peter Dimov wrote:
> It appears that there are two schools of thought when it comes to string
> design. One approach treats a string purely as a sequential container of
> values. The other tries to represent "string values" as a coherent whole.
> It doesn't help that in the simple case where the value_type is char the
> two approaches result in mostly identical semantics.
>
> My opinion is that the std::char_traits<> experiment failed and
> conclusively demonstrated that the "string as a value" approach is a dead
> end, and that practical string libraries must treat a string as a
> sequential container, vector<char>, vector<char16_t> and vector<char32_t>
> in our case.
>
> The interpretation of that sequence of integers as a concrete string value
> representation needs to be done by algorithms.
That is kinda what my current implementation does, but the container is not
directly accessible by the user. (Nor do I think it should be) Instead I
wrap the vector of code points in a class and provide different types of
iterators to iterate though the vector at different "character levels",
instead of external algorithms. You can therefore access the string on a
code unit level, but the casual user would not neccesarily know (or care)
about that. Instead he would use the "string as a value" approach, using
strings to represent a sentance, word, or some other language construct.
When most people think of a string, they think of text, and not the
underlying binary representation, and therefore that is, in my opinion, the
notion a library should be designed around.
> In other words, I believe that string::operator== should always perform
> the per-element comparison std::equal( lhs.begin(), lhs.end(),
> rhs.begin() ) that is specified in the Container requirements table.
>
> If I want to test whether two sequences of char16_t's, interpreted as
> UTF16 Unicode strings, would represent the same string in a printed form,
> I should be given a dedicated function that does just that - or an
> equivalent. Similarly, if I want to normalize a sequence of chars that are
> actually UTF8, I'd call the appropriate 'normalize' function/algorithm.
Though I see where you are coming from, I don't agree with you on that. In
my opinion a good unicode library should hide as much as possible of the
complexity of the actual character representation from the user. If we were
to require the user to know that a direct binary comparison of strings is
not the same as a actual textual comparison, we loose some of the simplicity
of the library. Most users that use such a library would not know that the
character ö can be represented as both 'o¨' and 'ö', and that as a
consequence of that, calling == on to strings could result in the behaviour
"ö" != "ö". By removing the need for such knowledge by the user, we reduce
the learning curve considerably, which is one of the main reasons for
abstracting this functionality anyway.