Boost mailing page: Re: [boost] Unicode and codecvt facets

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Unicode and codecvt facets
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2010-07-05 13:01:40

Next message: Edd Dawson: "[boost] [function_types] is there an equivalent for functors?"
Previous message: Peter Bartlett: "Re: [boost] auto-link with gcc/windows"
In reply to: Artyom: "Re: [boost] Unicode and codecvt facets"
Next in thread: Bo Persson: "Re: [boost] Unicode and codecvt facets"
Reply: Bo Persson: "Re: [boost] Unicode and codecvt facets"
Reply: Artyom: "Re: [boost] Unicode and codecvt facets"

On 05/07/10 17:27, Artyom wrote:
>>
>> As some may know, I am working on a Unicode library that I plan to submit to
>> Boost fairly soon.
>>
>
> Take a look on Boost.Locale proposal.

I know of it, yes.
But my library purposely *doesn't* use the standard C++ locale subsystem
because it's slow, broken, and inflexible.
Nevertheless I want to provide the ability to bridge my library with
that system.

>
>> The codecs in that library are based around iterators and ranges, but since
>> there was some demand for support
>> for codecvt facets I am working on adapting those into that form as well.
>>
>> Unfortunately, it seems it is only possible to subclass std::codecvt<char,
>> char, mbstate_t> and
>> std::codecvt<wchar_t, char, mbstate_t>.
>
> Yes, these are actually the only specialized classes.

I was hoping I could specialize some more myself.
Some implementations appear to support using arbitrary codecvt facets
just fine, but not GCC's and MSVC's.

> More then that
> std::codecvt<char, char, mbstate_t>
> should be - "noconvert" facet.

I'm talking about types derived from these.
There is no restriction for subclasses of std::codecvt<char, char,
mbstate_t> to be non-convert, only std::codecvt<char, char, mbstate_t> is.

> You can derive from these two classes in re-implement them (like I did in
> Boost.Locale).

That's indeed what I said I can do, but as I said I find that very limiting.

> Also I strongly recommend to take a look on locale and iostreams in standard
> library if you are working with Unicode for C++.

The thing is, I'm not sure it's worth delving into it too much. On top
of being a so-so design, the popular implementations seem to all do
things differently and have different limitations.

>
>>
>> What I wonder is if there is really a point to facets, then.
>> std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would
>> be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be
>> UTF-8.
>
> Not exactly narrow encoding may be any 8-bit encoding, even something like
> Latin1 or Shift-JIS (and UTF-8 as well).

My library doesn't aim at providing code conversion from/to every
character set ever invented, which is why I just put UTF-8 in there.

Regardless I intend to allow to define a codecvt facet from any pair of
objects modeling the Converter concept; so nothing would prevent someone
from writing one or chaining them to do whatever they want, granted it
converts between char and char or wchar_t and char, since it seems there
is no way around that one.

That way you can also do normalization, case conversion or whatnot with
a codecvt facet.

> C++0x provides char16_t and char32_t to fix this standard's bug.

GCC has those types in C++0x mode, but doesn't support codecvt facets
with them.

>
>>
>> Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing
>> with UTF-8 rather than
>> maybe UTF-16 or UTF-32?
>>
>
> Ask Windows developers, they use wide strings because it is the only way to work
> correctly with their OS.

utf8_codecvt_facet is an utility provided by boost in the detail
namespace, that some libraries not particularly tied to Windows appear
to use.

Next message: Edd Dawson: "[boost] [function_types] is there an equivalent for functors?"
Previous message: Peter Bartlett: "Re: [boost] auto-link with gcc/windows"
In reply to: Artyom: "Re: [boost] Unicode and codecvt facets"
Next in thread: Bo Persson: "Re: [boost] Unicode and codecvt facets"
Reply: Bo Persson: "Re: [boost] Unicode and codecvt facets"
Reply: Artyom: "Re: [boost] Unicode and codecvt facets"

Date view	Thread view	Subject view	Author view