$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Vladimir Prus (ghost_at_[hidden])
Date: 2001-11-01 04:36:55
Ronald Garcia wrote:
>     MA> is there any relation between your work and vladimir prus, who
>     MA> uploaded some codecvt code about a month ago?
>     MA> http://groups.yahoo.com/group/boost/message/17772
>
> I have taken a look at the above message and the code that it refers
> to.  I can't quite grasp what the code is doing,
> but according to descriptions it appears to provide two codecvt
> facets: one converting from a utf-8 external (file) representation to
> ucs2 internally (memory), and back, while the other converts from ucs2
> externally to utf8 internally.  I may be wrong and so the author may
> wish to correct me here.
Indeed, I wish to correct. The other codecvt converts from ucs2 externally to 
ucs2 internally -- i.e. does no conversion. As far as I can tell, C++ 
standard does not require default conversion facet to use any particular 
encoding, and under bcc external files are considered to be something called 
"multibyte string". I have no idea what it is, but it does not seem to be 
ucs2 at all.
>     MA> is there a reason not to introduce a fixed typedef
>     MA> boost::ucs4_t, as a uint32_t?  then there could be a version
>     MA> of this that would work on any platform.  as you know, on
>     MA> win32 (and elsewhere?) wchar_t is 16bits, so you are currently
>     MA> forcing platform-specific specialization.
>
> I chose to implement the facet as a template to avoid making solid
> decisions about the types used to represent utf-8 elements and
> ucs-4 elements.  It makes sense that compilers with large enough
> wchar_t should use std::codecvt<wchar_t,char,std::mbstate_t>,
> wofstream, and wifstream for file streaming, but you
> are correct that for windows one would have to provide
> specializations.  I'm pretty new to this area of the C++ library and
> so I'm trying to get a feel for what works best.
Correction again -- wchar_t is 16 bit for *some* windows compilers. But in 
principle, ability to use any type for internal character would be desirable. 
(and it costs nothing to have it)
>     MA> even on systems where wchar_t is 32bits, there are no
>     MA> guarantees that the implementation character set is unicode.
>     MA> even if __STDC_ISO_10646__ is defined, i'm not sure if that
>     MA> strictly guarantees that the values are comparable with cast
>     MA> ints, because it (i think) is still implementation defined
>     MA> what the signedness and endianness is of wchar_t storage, even
>     MA> if the code value space is unicode.
>
> I'm not sure what you are referring to here.  Could you run that by me
> again?
I'm not sure too. Signedness of whar_t is not important if it 32-bit wide, 
since, IIRC, Unicode requires only 31 bit. And I don't understand how 
endianness of whar_t storage can matter at all. Regarding wchar_t and unicode 
relation, we have:
std::2.13.3/2:
The value of a wide-character literal containing a single c-char has value 
equal to the numerical value of the encoding of the c-char in the execution 
wide-character set.
std::2.2/3:
The values of the members of the execution character sets are 
implementation-defined...
This, in theory, seems to mean that wide literals can use arbitrary encoding, 
but I really doubt this is ever the case in practice. Then, I think we are 
free to think that it's ok to use wchar_t for Unicode, provided it's wide 
enough.
I also think that performance aspects of ucs2 codecvt should be considered.
Regards,
Vladimir