$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [nowide] Library Updates and Boost's broken UTF-8 codecvt facet
From: Peter Dimov (lists_at_[hidden])
Date: 2015-10-09 17:38:56
Artyom Beilis wrote:
> > What I meant by that is for instance
> >
> > - is 0xCC 0x81 a valid UTF-8 string?
> > - is 0x65 0xCC 0x81 0xCC 0x81 a valid UTF-8 string?
>
> Both are valid strings.. and both are meaningless on their own i.e. accent 
> without letter or two same accents.
>
> Being illogical in human terms or representation does not make them UTF-8 
> illegal.
>
> UTF-8 is simple, human language processing is complex.
My point here is that strictly valid UTF-8 is the valid multibyte encoding 
of a valid codepoint sequence, and that the definition of "valid codepoint 
sequence" may vary depending on context, such that the above sequences are 
considered invalid.
Drawing a line at the place where codepoints over 10FFFF and single 
surrogates are invalid but the above sequences are valid is an arbitrary 
decision. Not that this decision is wrong, it isn't. But it may not be what 
the user needs.
Saying "invalid UTF-8 is just invalid, period" doesn't always work very 
well, although it's a good default. There are cases in which you have to 
handle specific kinds of invalid UTF-8 (but not any invalid UTF-8) and 
having to write UTF-8 encoding/decoding functions for every such instance 
does not really contribute to either security or correctness. It's better - 
I posit - to have functions that can be configured to handle various invalid 
forms of UTF-8 (that is, to accept certain invalid UTF-8, not necessarily to 
produce it, of course).