Subject: Re: [boost] Review Request: Boost.Locale
From: Gevorg Voskanyan (v_gevorg_at_[hidden])
Date: 2010-05-24 13:57:35


Artyom wrote:
> - There is absolutely no information given about std::mbstate_t that
> should save intermediate data between conversions so, there is actually
> no way to pass anything between sequential calls of
> std::locale::codecvt<...>::in/out. So even if I observe first surrogate
> pair there is no way to pass this information for next call and thus
> I loose this information

Ah, yes, mbstate_t. It may be good enough for UTF-8 (multibyte sequence) but may not be usable for UTF-16 (multi-wchar_t sequence :-) on windows). Thanks, that fully explains it.

> This is exactly the reason you can't implement utf-8 - utf-16 codepage
> conversion using codecvt facet.

And still codecvt<char16_t, char, mbstate_t> converts between UTF-8 and UTF-16 in C++11. That seems to suggest the new standard will require mbstate_t to be usable for UTF-16 as well.

> On the other hand there is no such limitations for utf-32 encodings
> as there is no information to preserve between calls.
>
> Additional note: it is also not possible to convert statefull encodings
> like UTF-7 as there is no way to move state around.
>
> So generally std::locale::codecvt is not well designed to be derived
> from, so only way to to stream conversion correctly is redesign this
> facet, but in such case you can't use it with std::iostreams library.

Yes, I see.

> >
> > For the original (non-compliance) point I raised it would
> > be interesting to see how well codecvt< char32_t, char,
> > std::mbstate_t > is going to be implemented under windows
> > :)
>
> There is no problem to implement it correctly.

My point is that, if that is implemented correctly, then strictly speaking an implementation where sizeof(wchar_t) == 16 will become non-conforming according to 3.9.1/5. Which would be interesting to see :)
As intended by the standard wchar_t should have at least 21 bits for C++ implementations supporting Unicode, but of course that isn't going to be fixed for windows compilers in the foreseeable future.

> >
> > BTW, I see some interesting additions to codecvts in n3090,
> > 22.5.
> > Any plans to implement them in Boost.Locale?
>
> On same wave, when char32_t/char16_t would be available, hopefully
> these facets would be implemented. But today it is impossible to
> implement utf-16 codecvt facets.

You're right, implementing them would require implementation-specific knowledge about std::mbstate_t.

> My personal opinion - avoid wide characters and any "Unicode"
> characters. Because it is best way to full yourself with "Unicode"
> support as in reality they do not provide any advantage over plain
> char and utf-8 encodings.
>
> So, unless you are using Win32 API avoid wide characters.
> However too many programmers would disagree with me, epsecially
> Windows programmers who grew on "Unicode" and "Wide" API.
> So Boost.Locale fully supports wide characters.

Despite having started as a Windows programmer myself, I don't disagree with you on this point. On the contrary, I've always been uncomfortable with windows' A/W API, and would've much preferred UTF-8 instead, as is the case in the *nix world. Another reason I am forced still to use wide characters is wxwidgets, which (in its 2.x releases) assumes ANSI unless wxUSE_UNICODE is defined to non-zero value, in which case it uses wide characters in its API, essentially following the windows model. Fortunately, this is going to change in soon-to-be-released wxwidgets 3.0, which will have UTF-8 interface.

> >
> > Non-iterator interface is a real pain in using codecvt, I
> > admit.
>
> I think best interface would be rather something like boost::iostreams
> filter but I think this should be rather part of iostreams library
> then localization. Also it should not pass wide encoding in the middle
> when converting utf-8 to ISO-8859-8.
>
> But that is different story.
>
> For simple string conversion boost::locale provides from_utf/to_utf
> that work correctly with utf-8/16/32.

Looking forward to Boost.Locale review!

> Artyom

Artyom, thank you very much for providing your insightful ideas satisfying my curiosity!

Best Regards,
Gevorg