$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Robert Kawulak (robert.kawulak_at_[hidden])
Date: 2011-01-18 18:00:59
> From: Artyom
> Ok let's thing what do you need iterators for? Accessing "characters"
> if so you are most likely doing something terribly wrong as you ignore
> the fact that codepoint != character.
>
> I would say such iterator is wrong by design unless you develop
> a Unicode algorithm that relates to code point.
Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators:
- storage iterator (char, wchar_t etc.),
- codepoint iterator,
- character iterator.
You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation, like:
- bitwise copy:
std::copy(utf8_1.storage_begin(), utf8_1.storage_end(),
utf8_2.storage_begin())
- check if utf32 is a substring of utf8, codepoint-wise:
std::search(utf8.codepoint_begin(), utf8.codepoint_end(),
utf32.codepoint_begin(), utf32.codepoint_end())
- character-wise copy ascii_t to utf_16, considering the codepage of ascii object:
utf16_t utf16(ascii.character_begin(), ascii_t.character_end())
- count codepoints:
std::distance(utf8.codepoint_begin(), utf8.codepoint_end())
- count characters:
std::distance(utf8.character_begin(), utf8.character_end())
- get the 5th codepoint:
std::advance(utf8.codepoint_begin(), 5)
I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting. What do you think?
Best regards,
Robert