$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: James Porter (porterj_at_[hidden])
Date: 2008-11-19 17:27:07
Zach Laine wrote:
> I would love to see a Unicode support library added to Boost.
> However, I question the usefulness of another string class, or in this
> case another hierarchy of string classes.  Interoperability with
> std::string (and QString, and CString, and a thousand other
> API-specific string classes) is always thorny.  I'd much rather see an
> iterators- and algorithms-based approach, along the lines of your
> ct_string::iterator.
It might get equally thorny just trying to get the algorithms to 
recognize all the strange varieties of strings out there without writing 
iterator facades for the lot of them! It's probably possible, but I'm 
not I'd want it to be the primary interface for encoding. Most custom 
string types (both QString and CString, for instance) are designed to 
work with only one encoding (UTF-16 seems popular), so if you had some 
reason that you needed to store your strings in UTF-8, or - god forbid - 
Shift-JIS, you'd be out of luck.
This is especially important when you're reading in arbitrary data whose 
encoding you don't know at compile-time. If someone sends me a message 
encoded in Shift-JIS and I want to forward it on, I don't want to have 
to decode it into UTF-8 and then re-encode it into Shift-JIS before I 
send it; I just want to store it in Shift-JIS.
>  Instead of doing this:
> 
>>        baz.encode(bar,rt::utf8);
> 
> I'd rather be able to do something like this:
> 
> typedef std::basic_string<some_32bit_char_type> unicode_string;
> 
> unicode_string u_string = /*...*/;
> std::string std_string = /*...*/;
> 
> typedef boost::recoding_iterator<boost::ucs4, boost::utf8> ucs4_to_utf8_iter;
> std::copy(ucs4_to_utf8_iter(u_string.begin()),
> ucs4_to_utf8_iter(u_string.end()), std::back_inserter(std_string));
std::strings aren't really appropriate for this purpose, at least not 
without a lot of changes to their interface, since they're designed for 
compile-time-tagged, fixed-width-encoding strings. In your examples, you 
have to remember what the source encoding is. This is easy enough if you 
know that "all my strings are in UTF-8", but if you start working with 
runtime-tagged strings (see my Shift-JIS example above), you'd need to 
keep track of every encoding in use.
- Jim