Subject: Re: [boost] [unicode] Interest Check / Proof of Concept
From: Zach Laine (whatwasthataddress_at_[hidden])
Date: 2008-11-20 09:08:57


> Eric Niebler wrote:
>>
>> Agree. Thanks Zach. I'm discouraged that every time the issue of a Unicode
>> library comes up, the discussion immediately descends into a debate about
>> how to design yet another string class. Such a high level wrapper *might* be
>> useful (strong emphasis on "might"), but the core must be the Unicode
>> algorithms, and the design for a Unicode library must start there.
>
> Since it seems like there's a lot of concern with making a new string type,
> how about the following (off-the-cuff):
>
> * Iterator filters a la Zach's message:

[snip]

> * Runtime-defined filters:
>
> typedef boost::recoding_iterator<boost::utf16,boost::runtime>
> utf16_to_any_iter;
> boost::runtime *my_codec = /*...*/;
> std::copy(utf16_to_utf8_iter(u_string.begin(), my_codec),
> utf16_to_utf8_iter(u_string.end(), my_codec),
> std::back_inserter(std_string));

Yes, that's what I was thinking as well. In fact, if you look at the
Boost.GIL any_image<> and any_image_view<> templates, you'll see that
they allow the user to specify a limit number of variants (a la
Boost.Variant). So it's more restrictive than a Boost.Any, but that
might be an advantage if it allows you to detect more errors at
runtime. I think that in use cases, one will have knowledge of the
maximum number of encodings that are possible in that case. Just
something to consider.

> * Shorthand for the above two points:
>
> boost::transcode(u_string, boost::utf16(),
> std_string, boost::utf8());
>

Looks good, but is this function an assignment, or an append?

> * String views that can wrap up the encoding type and the data (a container
> of some kind: strings, vector<char>s, ropes, etc):
>
> boost::estring_view<utf8> my_utf8_string(std_string);
> boost::estring_view<> my_rt_string(str, my_codec);
>
> boost::transcode(my_utf8_string, my_rt_string);

Yes. Views are notably absent in my original post. I think views are
essential for encodings that are variable in length (e.g. UTF-8).
Getting the character-location of code point N, or vice versa, and
doing it efficiently, is a must-have.

> Luckily, most of the work I've done is in making the encoding facets
> extensible and chooseable at runtime, so I wouldn't mourn the loss of my
> (frankly none-too-zazzy) string class.

This is just what I was hoping. The bulk of the work you'll do in any
case will probably be with the algorithms and number of supported
encodings.

Zach