Boost mailing page: Re: [boost] [General] Always treat std::strings as UTF-8

Date view	Thread view	Subject view	Author view

Subject: Re: [boost] [General] Always treat std::strings as UTF-8
From: Robert Kawulak (robert.kawulak_at_[hidden])
Date: 2011-01-18 18:00:59

Next message: Andreas Masur: "Re: [boost] namespace boost?"
Previous message: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
In reply to: Artyom: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Chad Nelson: "Re: [boost] [General] Always treat std::strings as UTF-8"
Reply: Chad Nelson: "Re: [boost] [General] Always treat std::strings as UTF-8"

> From: Artyom
> Ok let's thing what do you need iterators for? Accessing "characters"
> if so you are most likely doing something terribly wrong as you ignore
> the fact that codepoint != character.
>
> I would say such iterator is wrong by design unless you develop
> a Unicode algorithm that relates to code point.

Now wouldn't it be nice if ascii_t (or whatever it's called) and utf*_t string classes had 3 kinds of iterators:
- storage iterator (char, wchar_t etc.),
- codepoint iterator,
- character iterator.

You could then reuse many existing algorithms to perform operations on a level that is sufficient in a given situation, like:

- bitwise copy:
    std::copy(utf8_1.storage_begin(), utf8_1.storage_end(),
        utf8_2.storage_begin())
- check if utf32 is a substring of utf8, codepoint-wise:
    std::search(utf8.codepoint_begin(), utf8.codepoint_end(),
        utf32.codepoint_begin(), utf32.codepoint_end())
- character-wise copy ascii_t to utf_16, considering the codepage of ascii object:
    utf16_t utf16(ascii.character_begin(), ascii_t.character_end())
- count codepoints:
    std::distance(utf8.codepoint_begin(), utf8.codepoint_end())
- count characters:
    std::distance(utf8.character_begin(), utf8.character_end())
- get the 5th codepoint:
    std::advance(utf8.codepoint_begin(), 5)

I don't know Unicode quirks enough to tell how useful this interface would be, but it seems interesting. What do you think?

Best regards,
Robert

Next message: Andreas Masur: "Re: [boost] namespace boost?"
Previous message: Peter Dimov: "Re: [boost] [General] Always treat std::strings as UTF-8"
In reply to: Artyom: "Re: [boost] [General] Always treat std::strings as UTF-8"
Next in thread: Chad Nelson: "Re: [boost] [General] Always treat std::strings as UTF-8"
Reply: Chad Nelson: "Re: [boost] [General] Always treat std::strings as UTF-8"

Date view	Thread view	Subject view	Author view