Subject: Re: [boost] GSoC Unicode library: second preview
From: Scott McMurray (me22.ca+boost_at_[hidden])
Date: 2009-06-20 15:22:55


2009/6/20 Artyom <artyomtnk_at_[hidden]>
>
> > UTF-16 ... This is the recommended encoding for dealing with
> > Unicode internally for general purposes
>
> To be honest, it is most error prone encoding to work with Unicode:
>

Amen.

Really, I don't see why people don't just use UTF-8 all over the
place. Even UTF-32 isn't as convenient as most would like, since you
still have combining code points and other similar complications.

As a programmer what I really care about is usually some nebulous
concept of "characters", and one character can easily be 3 codepoints
or 1/3 of a codepoint.

It feels like the only way to get Unicode string handling right (at
the application level, not library or render levels) is to deal
entirely in strings and regexes.

Suppose I have "difficult" with the "ffi" ligature codepoint, and I do
a perl-style split on /i/. I should probably be getting "d", the "ff"
ligature codepoint, and "cult". I know if I tried to code that by
hand in every application I'd miss all kinds of evil corner cases like
that.