Subject: Re: [boost] [general] What will string handling in C++ look like in the future [was Always treat ... ]
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-22 14:53:27


On Sat, Jan 22, 2011 at 12:36 AM, Patrick Horgan <phorgan1_at_[hidden]> wrote:
> On 01/21/2011 01:54 AM, Matus Chochlik wrote:
>>
>> ... elision by patrick...
>> Why not boost::string (explicitly stating in the docs that it is
>> UTF-8-based) ?
>> the name u8string suggests to me that it is meant for some special case
>> of character encoding and the (encoding agnostic/native) std::string
>> is still the way
>> to go.
>
> I think that's the truth.  std::string has some performance guarantees that
> a utf-8 based string wouldn't be able to keep.  std::string can do things,
> and people do things with std::string that a utf-8 based string can't do.

If this was really the case then what you describe would be already happening
on all the platforms that use the UTF-8 encoding by default for any locale.

>  If you set LC_COLLATE to en_US.utf8 or the equivalent (I hate the way
> locale names are not as standardized as you might like), then most of the
> standard algorithms will be locale aware and operations on your string will
> be muchly aware of the string encoding.  By switching locales, you can then
> operate on strings with other encodings.  utf-8_string isn't intended to
> operate like that.  It's specialized.
>>
>> IMO we should send the message that UTF-8 is
>> "normal"/"(semi-)standard"/"de-facto-standard"
>> and the other encodings like the native_t (or even ansi_t,
>> ibm_cp_xyz_t, string16_t,
>> string32_t, ...) are the special cases and they should be treated as such.
>
> Why would people want to lose so much of the functionality of std::string?

What functionality would they loose exactly ? Again, on many platforms
the default encoding for all (or nearly all) locales already is UTF-8 so
if you get a string from the OS API and store it into a std::string then
it is UTF-8 encoded. I do a equal share of programming on Windows
and Linux platforms and I have yet to run into these problems you
describe on Linux where for some time now the default encoding is UTF-8.
Actually today I encounter more problems on Windows, where I can't
set the locale to use UTF-8 and consequently I have to transcode data
from socket connections of files manually.

If you are talking about being able to have indexed random-access
to "logical characters" for example on Windows with some special
encodings, then this is only a platform-specific and unportable functionality.
What I propose, is to extend the interface so that it would allow you
handle the "raw-byte-sequences" that are now used to represent strings
of logical characters in a platform independent way by using the Unicode
standard.

>  The only advantage of a utf8_string would be automatic and continual
> verification that it's a valid utf-8 encoded string that otherwise acts as
> much as possible like a std::string.  For that you would give up a lot of
> other functionality.

Again what exactly would you give up? The gain is not only what you describe,
but also that, for example when writing text into a file in a portable
application,
sending the file to a different machine with a different platform you
can read the
string on that other machine without explicit transcoding (which means picking
a library/tool that can do the transcoding and use it explicitly everywhere you
potentially handle data that may come from different platforms).

BR,

Matus