$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [strings][unicode] Proposals for Improved String Interoperability in a Unicode World
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2012-01-29 09:28:57
On 01/29/2012 02:53 PM, Artyom Beilis wrote:
> Not, MSVC does not allow to create both "ש×××" and L"ש×××" literal
> as Unicode (utf-8, UTF-16) for all other compilers it is default
> behavior.
And it shouldn't.
String literals are in the execution character set. On Windows the 
execution character set is what it calls ANSI. That much is not going to 
change.
>>>   1. BOM should not be used in source code, no compiler except MSVC uses it
>> and most
>>>       do not support it.
>>
>> According to Yakov, GCC supports it now.
>> It would be nice if it could work without any BOM though.
>>
>
> GCC's default input and literal encoding is UTF-8. BOM is not needed.
That's not what I'm saying. What we want is a unified way to set UTF-8 
as the source character set.
The problem is that MSVC requires BOM, but GCC used to not allow it.
>>>   2. Setting UTF-8 BOM makes narrow literals to be encoded in ANSI encoding
>> which
>>>       makes BOM useless (crap... sory) with MSVC even more.
>>
>> That's the correct behaviour.
>
> No, it is unspecified behavior according to the standard.
It isn't.
> Standard does not specify what narrow encoding should be used, that
> is why u8"" was created.
The standard specifies that it is the execution character set. MSVC 
specifies that for its implementation, the execution character set is ANSI.
> All (but MSVC) compilers create UTF-8 literals and use UTF-8 input
> and this is the default.
That's because for those other compilers, you are in a case where the 
source character set is the same as the execution character set.
With MSVC, if you don't do anything, both your source and execution 
character sets are ANSI. If you set your source character set to UTF-8, 
your execution character set remains ANSI still.
On non-Windows platforms, UTF-8 is the most common execution character 
set, so you can have a setup where source = execution = UTF-8, but you 
can't do that on Windows.
But that is irrelevant to the standard.
>> Use u8 string literals if you want UTF-8.
>
> Why on earth should I do this?
Because it makes perfect sense and it's the way it's supposed to work.
> All the world around uses UTF-8. Why should I specifiy u8"" if it is
> something that can be easily defined at compiler level?
Because otherwise you're not independent from the execution character set.
Writing you program with Unicode allows you to not depend on 
platform-specific encodings, that doesn't mean it makes them go away.
I repeat, narrow string literals are and will remain in the execution 
character set. Expecting those to end up as UTF-8 data is wrong and not 
portable.
> All we need is some flag for MSVC that tells that string
> literals encoding is UTF-8.
That "flag" is using the u8 prefix on those string literals.
Remember: the encoding used for the data in a string literal is 
independent from the encoding used to write the source.
> AFAIR, neither gcc4.6 nor msvc10 supports u8"".
Unicode string literals have been in GCC since 4.5.
However there are indeed practical problems with using the standard 
mechanisms because they're not always implemented.