Boost mailing page: Re: [boost] [unicode] Interest Check / Proof of Concept

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: James Porter (porterj_at_[hidden])
Date: 2008-11-19 17:27:07

Next message: Stjepan Rajko: "Re: [boost] math statistical distribution: multivariate gaussian"
Previous message: Anthony Williams: "Re: [boost] MSVC exception_ptr"
In reply to: Zach Laine: "Re: [boost] [unicode] Interest Check / Proof of Concept"
Next in thread: Eric Niebler: "Re: [boost] [unicode] Interest Check / Proof of Concept"

Zach Laine wrote:
> I would love to see a Unicode support library added to Boost.
> However, I question the usefulness of another string class, or in this
> case another hierarchy of string classes. Interoperability with
> std::string (and QString, and CString, and a thousand other
> API-specific string classes) is always thorny. I'd much rather see an
> iterators- and algorithms-based approach, along the lines of your
> ct_string::iterator.

It might get equally thorny just trying to get the algorithms to
recognize all the strange varieties of strings out there without writing
iterator facades for the lot of them! It's probably possible, but I'm
not I'd want it to be the primary interface for encoding. Most custom
string types (both QString and CString, for instance) are designed to
work with only one encoding (UTF-16 seems popular), so if you had some
reason that you needed to store your strings in UTF-8, or - god forbid -
Shift-JIS, you'd be out of luck.

This is especially important when you're reading in arbitrary data whose
encoding you don't know at compile-time. If someone sends me a message
encoded in Shift-JIS and I want to forward it on, I don't want to have
to decode it into UTF-8 and then re-encode it into Shift-JIS before I
send it; I just want to store it in Shift-JIS.

> Instead of doing this:
>
>> baz.encode(bar,rt::utf8);
>
> I'd rather be able to do something like this:
>
> typedef std::basic_string<some_32bit_char_type> unicode_string;
>
> unicode_string u_string = /*...*/;
> std::string std_string = /*...*/;
>
> typedef boost::recoding_iterator<boost::ucs4, boost::utf8> ucs4_to_utf8_iter;
> std::copy(ucs4_to_utf8_iter(u_string.begin()),
> ucs4_to_utf8_iter(u_string.end()), std::back_inserter(std_string));

std::strings aren't really appropriate for this purpose, at least not
without a lot of changes to their interface, since they're designed for
compile-time-tagged, fixed-width-encoding strings. In your examples, you
have to remember what the source encoding is. This is easy enough if you
know that "all my strings are in UTF-8", but if you start working with
runtime-tagged strings (see my Shift-JIS example above), you'd need to
keep track of every encoding in use.

- Jim

Next message: Stjepan Rajko: "Re: [boost] math statistical distribution: multivariate gaussian"
Previous message: Anthony Williams: "Re: [boost] MSVC exception_ptr"
In reply to: Zach Laine: "Re: [boost] [unicode] Interest Check / Proof of Concept"
Next in thread: Eric Niebler: "Re: [boost] [unicode] Interest Check / Proof of Concept"

Date view	Thread view	Subject view	Author view