$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Artyom (artyomtnk_at_[hidden])
Date: 2011-01-17 13:09:13
> I've done some research, and it looks like it would require little
> effort to create an os::string_t type that uses the current locale, and
> assume all raw std::strings that contain eight-bit values are coded in
> that instead.
>
> Design-wise, ascii_t would need to change slightly after this, to throw
> on anything that can't fit into a *seven*-bit value, rather than
> eight-bit. I'll add the default-character option to both types as well,
> and maybe make other improvements as I have time.
>
Unfortunately this is not the correct approach as well.
For example why do you think it is safe to pass ASCII subset of utf-8
to current non-utf-8 locale?
For example Shift-JIS that is in use on Windows/ANSI API has different
subset in 0-127 range - it is not ASCII!
Also if you want to use std::codecvt facet...
Don't relay on them unless you know where they come from!
1. By default they are noop - in the default C locale
2. Under most compilers they are not implemented properly.
OS \ Compiler MSVC GCC SunOS/stlport SunOS/standard
-------------------------------------------------------------------
Windows ok none - -
Linux - ok ? ?
Mac OS X - none - -
FreeBSD - none - -
Solaris - none buggy! ok-but-non-standard
Bottom lines don't relate on "current locale" :-)
>
> Artyom, since you seem to have more experience with this stuff than I,
> what do you think? Would those alterations take care of your objections?
>
The rule of thumb is following:
- When you hadle with strings as text storage just use std::string
- When you do a system call
a) on Posix - pass it as is
b) on Windows - Convert to Wide API from UTF-8
- When handling text as text (i.e. formatting, collation etc.)
use good library.
I would strongly recommend to read the answer of Pavel Radzivilovsky
on Stackoverflow:
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful/1855375#1855375
And he is hard-core-windows-programmer, designer, architext and developer
and still he had chosen UTF-8!
The problem that the issue is so completated that making
it absolutly general and on the other hand right is only
one - decide what you are working with and stick with it.
In CppCMS project I work with (and I developed Boost.Locale
because of it) I stick by default with UTF-8 and use plain
std::string - works like a charm.
Invening "special unicode strings or storage" does not
improve anybody's understanding of Unicode neither improve
its handing.
Best,
Artyom