Boost mailing page: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 17:38:56

Next message: Peter Dimov: "Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet"
Previous message: Louis Dionne: "Re: [boost] [Build] Test file name conflicts"
In reply to: Artyom Beilis: "Re: [boost] [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet"

Artyom Beilis wrote:
> > What I meant by that is for instance
> >
> > - is 0xCC 0x81 a valid UTF-8 string?
> > - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?
>
> Both are valid strings.. and both are meaningless on their own i.e. accent
> without letter or two same accents.
>
> Being illogical in human terms or representation does not make them UTF-8
> illegal.
>
> UTF-8 is simple, human language processing is complex.

My point here is that strictly valid UTF-8 is the valid multibyte encoding
of a valid codepoint sequence, and that the definition of "valid codepoint
sequence" may vary depending on context, such that the above sequences are
considered invalid.

Drawing a line at the place where codepoints over 10FFFF and single
surrogates are invalid but the above sequences are valid is an arbitrary
decision. Not that this decision is wrong, it isn't. But it may not be what
the user needs.

Saying "invalid UTF-8 is just invalid, period" doesn't always work very
well, although it's a good default. There are cases in which you have to
handle specific kinds of invalid UTF-8 (but not any invalid UTF-8) and
having to write UTF-8 encoding/decoding functions for every such instance
does not really contribute to either security or correctness. It's better -
I posit - to have functions that can be configured to handle various invalid
forms of UTF-8 (that is, to accept certain invalid UTF-8, not necessarily to
produce it, of course).

Next message: Peter Dimov: "Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet"
Previous message: Louis Dionne: "Re: [boost] [Build] Test file name conflicts"
In reply to: Artyom Beilis: "Re: [boost] [nowide] Library Updates and Boost'sbrokenUTF-8codecvt facet"

Date view	Thread view	Subject view	Author view