From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-19 05:27:52


I've recently started on the first draft of a Unicode library.

An assumption I think is wrong is that wchar_t would be suitable for
Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on
Microsoft compilers, for example. The utf8_codecvt_facet
implementation will on these compilers cut off any codepoints over
0xFFFF. (U+1D12C will come out as U+D12C.)

I think a definition of unicode::code as uint32_t would be much
better. Problem is, codecvt is only implemented for wchar_t and char,
so it's not possible to make a Unicode codecvt without manually adding
(dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
the std namespace. I guess this is the reason that Ron Garcia just
used wchar_t.

About Unicode strings:
I suggest having a codepoint_string, with the string of code units as
a template parameter. Its interface should work with 21 (32) bits
values, while internally these are converted to UTF-8, UTF-16, or
remain UTF-32.
template <class CodeUnitString> class codepoint_string {
    CodeUnitString code_units;
    // ...
};

The real unicode::string would be the character string, which uses a
base character with its combining marks for its interface.
template <class CodePointString> class string {
    CodePointString codepoints;
    // ...
};

So unicode::string<unicode::codepoint_string<std::string> > would be a
UTF8-encoded string that is manipulated using its characters.

unicode::string should take care of correctly searching for a
character string, rather than a codepoint string.

operator< has never done "the right thing" anyway: it does not make a
difference between uppercase and lowercase, for example. Probably,
locales should be used for collation. The Unicode collation algorithm
is pretty well specified.

Hope all this is clear...
Regards,
Rogier