$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] Unicode and codecvt facets
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-07-05 13:01:40
On 05/07/10 17:27, Artyom wrote:
>>
>> As some may know, I am working on a Unicode library that I plan to submit to
>> Boost fairly soon.
>>
>
> Take a look on Boost.Locale proposal.
I know of it, yes.
But my library purposely *doesn't* use the standard C++ locale subsystem
because it's slow, broken, and inflexible.
Nevertheless I want to provide the ability to bridge my library with
that system.
>
>> The codecs in that library are based around iterators and ranges, but since
>> there was some demand for support
>> for codecvt facets I am working on adapting those into that form as well.
>>
>> Unfortunately, it seems it is only possible to subclass std::codecvt<char,
>> char, mbstate_t> and
>> std::codecvt<wchar_t, char, mbstate_t>.
>
> Yes, these are actually the only specialized classes.
I was hoping I could specialize some more myself.
Some implementations appear to support using arbitrary codecvt facets
just fine, but not GCC's and MSVC's.
> More then that
> std::codecvt<char, char, mbstate_t>
> should be - "noconvert" facet.
I'm talking about types derived from these.
There is no restriction for subclasses of std::codecvt<char, char,
mbstate_t> to be non-convert, only std::codecvt<char, char, mbstate_t> is.
> You can derive from these two classes in re-implement them (like I did in
> Boost.Locale).
That's indeed what I said I can do, but as I said I find that very limiting.
> Also I strongly recommend to take a look on locale and iostreams in standard
> library if you are working with Unicode for C++.
The thing is, I'm not sure it's worth delving into it too much. On top
of being a so-so design, the popular implementations seem to all do
things differently and have different limitations.
>
>>
>> What I wonder is if there is really a point to facets, then.
>> std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would
>> be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be
>> UTF-8.
>
> Not exactly narrow encoding may be any 8-bit encoding, even something like
> Latin1 or Shift-JIS (and UTF-8 as well).
My library doesn't aim at providing code conversion from/to every
character set ever invented, which is why I just put UTF-8 in there.
Regardless I intend to allow to define a codecvt facet from any pair of
objects modeling the Converter concept; so nothing would prevent someone
from writing one or chaining them to do whatever they want, granted it
converts between char and char or wchar_t and char, since it seems there
is no way around that one.
That way you can also do normalization, case conversion or whatnot with
a codecvt facet.
> C++0x provides char16_t and char32_t to fix this standard's bug.
GCC has those types in C++0x mode, but doesn't support codecvt facets
with them.
>
>>
>> Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing
>> with UTF-8 rather than
>> maybe UTF-16 or UTF-32?
>>
>
> Ask Windows developers, they use wide strings because it is the only way to work
> correctly with their OS.
utf8_codecvt_facet is an utility provided by boost in the detail
namespace, that some libraries not particularly tied to Windows appear
to use.