$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2004-10-19 05:27:52
I've recently started on the first draft of a Unicode library.
An assumption I think is wrong is that wchar_t would be suitable for
Unicode. Correct me if I'm wrong, but IIRC wchar_t has 16 bits on
Microsoft compilers, for example. The utf8_codecvt_facet
implementation will on these compilers cut off any codepoints over
0xFFFF. (U+1D12C will come out as U+D12C.)
I think a definition of unicode::code as uint32_t would be much
better. Problem is, codecvt is only implemented for wchar_t and char,
so it's not possible to make a Unicode codecvt without manually adding
(dummy) implementations of codecvt<unicode::code,char,mbstate_t> to
the std namespace. I guess this is the reason that Ron Garcia just
used wchar_t.
About Unicode strings:
I suggest having a codepoint_string, with the string of code units as
a template parameter. Its interface should work with 21 (32) bits
values, while internally these are converted to UTF-8, UTF-16, or
remain UTF-32.
template <class CodeUnitString> class codepoint_string {
CodeUnitString code_units;
// ...
};
The real unicode::string would be the character string, which uses a
base character with its combining marks for its interface.
template <class CodePointString> class string {
CodePointString codepoints;
// ...
};
So unicode::string<unicode::codepoint_string<std::string> > would be a
UTF8-encoded string that is manipulated using its characters.
unicode::string should take care of correctly searching for a
character string, rather than a codepoint string.
operator< has never done "the right thing" anyway: it does not make a
difference between uppercase and lowercase, for example. Probably,
locales should be used for collation. The Unicode collation algorithm
is pretty well specified.
Hope all this is clear...
Regards,
Rogier