Subject: Re: [boost] [review] Review of Nowide (Unicode) starts today
From: Groke, Paul (paul.groke_at_[hidden])
Date: 2017-06-12 18:14:32


Artyom Beilis wrote:
> On Mon, Jun 12, 2017 at 6:05 PM, Vadim Zeitlin via Boost
> <boost_at_[hidden]> wrote:
> > On Mon, 12 Jun 2017 17:58:32 +0300 Artyom Beilis via Boost
> <boost_at_[hidden]> wrote:
> >
> > AB> By definition: you can't handle file names that can't be
> > AB> represented in UTF-8 as there is no valid UTF-8 representation exist.
> >
> > This is a nice principle to have in theory, but very unfortunate in
> > practice because at least under Unix systems such file names do occur
> > in the wild (maybe less often now than 10 years ago, when UTF-8 was
> > less ubiquitous, but it's still hard to believe that the problem has
> > completely disappeared). And there are ways to solve it, e.g. I think
> > glib represents such file names using special characters from a PUA
> > and there are other possible approaches, even if, admittedly, none of
> them is perfect.
> >
>
> Please note: Under POSIX platforms no conversions are performed and no
> UTF-8 validation is done as this is incorrect:
>
> http://cppcms.com/files/nowide/html/index.html#qna

Well... what's correct on POSIX platforms is a matter of opinion. If you go with the strict interpretation, then in fact conversion from the current locale to UTF-8 must be considered incorrect. Only then you cannot rely on *anything*, except that 0x00 is NUL and 0x2F is the path separator. Which makes any kind of isdigit/toupper/tolower/... string parsing/processing "incorrect".

> The only case is when Windows Wide API returns/creates invalid UTF-16 -
> which can happen only when invalid surrogate
> UTF-16 pairs are generated - and they have no valid UTF-8 representation.
>
> On the other hand creating deliberately invalid UTF-8 is very problematic
> idea.

Since the UTF-8 conversion is only done on/for Windows, and Windows doesn't guarantee that all wchar_t paths (or strings in general) will always be valid UTF-16, wouldn't it make more sense to just *define* that the library always uses WTF-8, which allows round-tripping of all possible 16 bit strings? If it's documented that way it shouldn't matter. Especially since users of the library cannot rely on the strings being in UTF-8 anyway, at least not in portable applications.

I agree that the over-long zero/NUL encoding part of modified UTF-8 might still be problematic though, and therefor WTF-8 might be the better choice. Now that still leaves some files that can theoretically exist on a Windows system inaccessible (i.e. those with embedded NUL characters), but those are not accessible via the "usual" Windows APIs either (CreateFileW etc.). So this should be acceptable.