From: Alberto Barbati (abarbati_at_[hidden])
Date: 2002-11-04 19:14:24


Hi Boosters,

I read in the list archives that there was a proposal by Vladimir Prus
about codecvt facets for uft8-ucs2 and ucs2-ucs2 conversion. The
proposal dates back to last year. I wonder what happened of it.

However, I would like to up the ante and propose a much wider choice of
codecvt facets that could be used to effectively process Unicode text
files. This new proposal aims to be fully conformant to Unicode 3.2
requirements. Therefore, I will refer to the utf-8, utf-16 and utf-32
encodings of Unicode code points, disregarding the ucs-2 and ucs-4
counterparts.

The proposal shall include facets to convert:

   external internal
   utf-8 -> utf-16* (BMP only - no surrogates)
   utf-8 -> utf-16* (all planes - surrogates allowed)
   utf-16LE -> utf-16
   utf-16BE -> utf-16
   utf-16** -> utf-16
   utf-8 -> utf-32
   utf-16LE -> utf-32
   utf-16BE -> utf-32
   utf-16** -> utf-32
   utf-32 -> utf-32

Notes:
(*) There are two utf-8 -> utf-16 facets because a 4-bytes utf-8 code
unit sequence is mapped to a utf-16 surrogate pair. If the application
won't handle surrogates anyway, it can opt for a more optimized facet
(such processing is probably not conformant, is this "optimization"
really needed?)

(**) external utf-16 facets will detect an initial BOM (U+FEFF) to
select the endian-ness of the external stream (what to do if there is no
BOM? default to the endian-ness of the platform?)

All proposed facets will be implemented as class templates, in order to
avoid any explict reference to wchar_t or any other fixed-size integral
type. Simply, a compile-time assertion will be used to ensure that the
supplied type is large enough to hold the internal characters. (For
platforms where wchar_t has less than 32 bits an application that wants
to use utf-32 facets will thus be responsible of choosing a suitable
integral type, defining char_traits and specializing basic_*stream
accordingly.)

(future directions) The facets could use template policies, for example
  to customize error handling (for instance, if a non-character is
encountered the conversion may either signal an error o ignore the
non-character).

The library is explicitly directed to platforms were char is an 8-bits
type. Support for other platforms can be included in subsequent
revisions, according to interest.

Is there any interest in this proposal? Any feedback is appreciated.

Alberto Barbati