$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Gavin Lambert (boost_at_[hidden])
Date: 2020-01-07 23:16:52
On 7/01/2020 14:58, Yakov Galka wrote:
>> So, while unfortunate, v3 made the correct choice. Paths have to be
>> kept in their original encoding between original source (command line,
>> file, or UI) and file API usage, otherwise you can get weird errors when
>> transcoding produces a different byte sequence that appears identical
>> when actually rendered, but doesn't match the filesystem. Transcoding
>> is only safe when you're going to do something with the string other
>> than using it in a file API.
>
> See above, malformed UTF-16 can be converted to WTF-8 (a UTF-8 superset)
> and back losslessly. The unprecedented introduction of a platform specific
> interface into the standard was, still is, and will always be, a horrible
> mistake.
Given that WTF-8 is not itself supported by the C++ standard library
(and the other formats are), that doesn't seem like a valid argument.
You'd have to campaign for that to be added first.
The main problem though is that once you start allowing transcoding of
any kind, it's a slippery slope to other conversions that can make lossy
changes (such as applying different canonicalisation formats, or
adding/removing layout codepoints such as RTL markers).
Also, if you read the WTF-8 spec, it notes that it is not legal to
directly concatenate two WTF-8 strings (you either have to convert it
back to UCS-16 first, or execute some special handling for the trailing
characters of the first string), which immediately renders it a poor
choice for a path storage format. And indeed a poor choice for any
purpose. (I suspect many people who are using it have conveniently
forgotten that part.)
Although on a related note, I think C++11/17 dropped the ball a bit on
the new encoding-specific character types. It's definitely an
improvement on the prior method, but it would have been better to do
something like:
struct ansi_encoding_t;
struct utf_encoding_t;
typedef encoded_char<ansi_encoding_t, 8> char_t;
typedef encoded_char<utf_encoding_t, 8> char8_t;
typedef encoded_char<utf_encoding_t, 16> char16_t;
Where "encoded_char<E,N>" has storage size equal to N bits (blittable,
and otherwise behaves like a standard integer type) but also carries
around an arbitrary encoding tag type E. This could be used to
distinguish "a string encoded in UTF-8" from "a string encoded in WTF-8"
or "a string encoded in EDBDIC". And supplemental libraries could
define additional encodings and conversion functions, and algorithms
could operate on generic strings of any encoding.