Subject: Re: [boost] [string] proposal
From: Matus Chochlik (chochlik_at_[hidden])
Date: 2011-01-28 16:57:59


On Fri, Jan 28, 2011 at 10:31 PM, Dean Michael Berris
<mikhailberis_at_[hidden]> wrote:
> On Sat, Jan 29, 2011 at 5:13 AM, Matus Chochlik <chochlik_at_[hidden]> wrote:
>> On Fri, Jan 28, 2011 at 9:46 PM, Dean Michael Berris
>>>>
>>>>   All the discussion in started because we need UTF-8
>>>>   in strings now we are back to the beginning?
>>>>
>>>
>>> No, the discussion started because we need a UTF-8 view of data. You
>>> missed the point I was making. And you didn't understand the document
>>> I wrote.
>>
>> Sorry, but no. The discussion started by the proposal that we should
>> by default treat std::strings as if they were UTF-8 encoded.
>> Artyom should know because he was the one who did the original
>> proposal. The whole 'view' idea was brought up only much later.
>>
>
> And the point I was making was that, doing precisely this was the
> "wrong" way of doing it. Assuming a default encoding is "unnecessary"
> as an encoding is largely a matter of interpretation of data
> ultimately.
>
> I was attempting to solve the problem that is std::string. In the
> process I'm moving the issue away from the underlying data and moving
> it to a matter of interpretation. To do that in a manner that would
> make sense as how I see it, that means moving it into a view of the
> data that is held in a string. The string would be the data structure,
> the view an interpretation of it.
>
> I never precluded that the string can hold UTF-8 encoded data, but
> saying that is the default achieves nothing and is ultimately
> unnecessary. In the design I've been proposing the point of the matter
> is, interpreting data in a given encoding is separate from how the
> data is actually stored. Now let's say you have a UTF-8 string
> builder, what else would that write in memory aside from UTF-8 encoded
> data? It will though still yield a string, which could be interpreted
> many different ways -- I just don't see the encoding as something
> intrinsic to the string. That means a string can hold UTF-8 encoded
> data and I can wrap that in a view for UTF-16 and see that it will not
> validate correctly -- unless I wrap the string with a view for UTF-8
> first then pass that into a view for UTF-16 and transcoding can happen
> on the fly.
>
> Writing algorithms that deal with strings, is different from writing
> algorithms that deal with encoded text. That's two different levels.
>
> This explaining, and trying to explain again, the whole point of the
> matter makes me sound like a broken record. If you still don't get
> what I'm saying then I guess I'm going to have to try a different
> route and just show what I mean in terms of code at some point in
> time.

Dean, believe me, I got what you said the first time you said
it, like 200 posts ago. I know that the string data is ultimately
stored in the memory as a sequence of bytes. But then you
proposed to solve my problem by suggesting the view<Encoding>
template. Then like 50 posts ago we finally agreed on typedef-ing
and naming it 'text' since using something called view<encoding_tag>
is not acceptable for me.

Now, if this

typedef view<utf8_encoding_tag> text;

is the only line of code where I see the encoding and
I'll be able to do all the text handling, i.e.: searching
for code points/characters (not only bytes), searching for
words, concatenation, splitting, writing it into a file, socket,
etc. and reading it from file, socket, etc., using it
with some c_str-like adapter with C APIs, etc., basically
doing (nearly) everything that I was able to do with std::string
*without* ever mentioning the encoding again, the You already
have me convinced. If I cannot do those things without specifying
the encoding (unless necessary) then this is useless for me
for text handling.

Peace, Love, Best regards,

Matus