Boost mailing page: Re: [boost] Unicode and codecvt facets

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] Unicode and codecvt facets
From: Artyom (artyomtnk_at_[hidden])
Date: 2010-07-05 12:27:46

Next message: Robert Ramey: "[boost] auto-link with gcc/windows"
Previous message: Mathias Gaunard: "[boost] Unicode and codecvt facets"
In reply to: Mathias Gaunard: "[boost] Unicode and codecvt facets"
Next in thread: Mathias Gaunard: "Re: [boost] Unicode and codecvt facets"
Reply: Mathias Gaunard: "Re: [boost] Unicode and codecvt facets"

>
> As some may know, I am working on a Unicode library that I plan to submit to
>Boost fairly soon.
>

Take a look on Boost.Locale proposal.

> The codecs in that library are based around iterators and ranges, but since
>there was some demand for support
> for codecvt facets I am working on adapting those into that form as well.
>
> Unfortunately, it seems it is only possible to subclass std::codecvt<char,
>char, mbstate_t> and
> std::codecvt<wchar_t, char, mbstate_t>.

Yes, these are actually the only specialized classes. More then that
std::codecvt<char, char, mbstate_t>
should be - "noconvert" facet.

> I personally don't know and understand that much about iostreams/locales, but
>I have looked quickly at
> libstdc++'s implementation and it doesn't seem like it is possible for
>std::locale to contain any other instance
> of codecvt.

You can derive from these two classes in re-implement them (like I did in
Boost.Locale).

Also I strongly recommend to take a look on locale and iostreams in standard
library if you are working with Unicode for C++.

>
> What I wonder is if there is really a point to facets, then.
> std::codecvt<wchar_t, char, mbstate_t> means that the in-memory charset would
>be UTF-16 or UTF-32 (depending on the size of wchar_t) while the file would be
>UTF-8.

Not exactly narrow encoding may be any 8-bit encoding, even something like
Latin1 or Shift-JIS (and UTF-8 as well).

> The problem is that wchar_t is platform-dependent and not really reliable, so
>it's not really something I'd recommend to use as the in-memory representation
>to deal with Unicode.

Welcome to broken Unicode world of C++. Yes. wchar_t is platform dependent, if
you want to use it you should

support both of these encodings UTF-16 and UTF-32 (technically it may be even 8
bits wide, but there is no
such implementations).

C++0x provides char16_t and char32_t to fix this standard's bug.

>
> Why do people even use utf8_codecvt_facet anyway? What's wrong with dealing
>with UTF-8 rather than
> maybe UTF-16 or UTF-32?
>

Ask Windows developers, they use wide strings because it is the only way to work
correctly with their OS.

Artyom

Next message: Robert Ramey: "[boost] auto-link with gcc/windows"
Previous message: Mathias Gaunard: "[boost] Unicode and codecvt facets"
In reply to: Mathias Gaunard: "[boost] Unicode and codecvt facets"
Next in thread: Mathias Gaunard: "Re: [boost] Unicode and codecvt facets"
Reply: Mathias Gaunard: "Re: [boost] Unicode and codecvt facets"

Date view	Thread view	Subject view	Author view