Boost mailing page: Re: [boost] Thoughts on Unicode and Strings

Date view	Thread view	Subject view	Author view

From: Reece Dunn (msclrhd_at_[hidden])
Date: 2004-04-17 07:29:52

Next message: David Abrahams: "[boost] Re: move semantics"
Previous message: Rogier van Dalen: "[boost] Re: Thoughts on Unicode and Strings"
Maybe in reply to: Reece Dunn: "[boost] Thoughts on Unicode and Strings"

Marshall Clow wrote:
>>>"Reece Dunn" writes:
>>> > RATIONALE: Why standardize on the UTF-32 view? Because UTF-8 and
>>>UTF-16 are
>>>> multi-character encodings of UTF-32 (not considering combining marks
>>>>at this
>>>> stage), whereas UTF-32 is a single character encoding.

>I'm pretty sure that this is a bad assumption.

Why is this a bad assumption?

At the unicode_string level, we are talking about individual Unicode
characters as specified by unicode.org. As an example, U+0x20 (space) can be
represented simply on all encodings; U+0x2192 (left arrow) requires 2 bytes
for UTF-8 encoding; U+0x1Dxxxx (I think these are the Fractur characters)
require 3 UTF-8, 2 UTF-16 and 1 UTF-32.

By treating a Unicode string as a virtual UTF-32 string (no matter what the
underlying encoding is) makes it easier to use on a higher level, because
you are dealing with the characters as they are represented on the Unicode
tables. This makes it easier if there are mixed-width characters in the
string:
U+0x300A hello U+0x300B ==> [<<] hello [>>]

>You can't just ignore combining characters.

I am not ignoring combining characters. All I'm saying is that dealing with
grapheme clusters at this stage makes processing Unicode strings too
complex. They should be treated as a view *on top of the underlying
unicode_string represtentation*.

>I believe that Miro posted an example of how (even using UTF-32), you
>may not have a single character <<-->> single "entry" mapping.

I understand that now (see my other post), but dealing with it all at one
level would make the interface too complex and would become too difficult to
manage. You could have something like:

struct grapheme_cluster: public std::pair< unicode_string::utf32_iterator,
unicode_string::utf32_iterator >
{
   inline grapheme_cluster( unicode_string & us ):
      std::pair< unicode_string::utf32_iterator,
unicode_string::utf32_iterator >
      ( us.utf32_begin(), us.utf32_end())
   {
   }

...

   inline bool is_single() const
   {
      return( first == second );
   }

   inline unicode_string::utf32_t get_base() const
   {
      return( *first );
   }

bool advance(); // implementation defined; false iff end of string
...
};

NOTE: if is_single() is true, then is_base() will be the value of the
unicode character, otherwise it is the primary character with the combining
characters removed.

Regards,
Reece

_________________________________________________________________
Express yourself with cool emoticons - download MSN Messenger today!
http://www.msn.co.uk/messenger

Next message: David Abrahams: "[boost] Re: move semantics"
Previous message: Rogier van Dalen: "[boost] Re: Thoughts on Unicode and Strings"
Maybe in reply to: Reece Dunn: "[boost] Thoughts on Unicode and Strings"

Date view	Thread view	Subject view	Author view