From: Pavol Droba (droba_at_[hidden])
Date: 2008-08-28 09:53:18


Martin Lütken wrote:
> Martin Lutken wrote:
>> Anyone who knows how this could be made possible?
>> I suppose I need a locale facet like the std::ctype, but which works for
>> UTF-8, and not just for ASCII a-z,A-Z. I guess the information in a table
>> like this (http://www.unicode.org/Public/UNIDATA/CaseFolding.txt)
>> could be used.
>>
>
> This might not work out-of-the-box. StringAlgo lib is designed around the sequences
> od characters. Since UTF-8 have variable character with encoding, algotrithms
> in the library would not work as expected.
>
> To make it working, you will need a container with iterators, that will
> iterate over meta-characters, not bytes.
>
>> If it's better/easier just to convert the string to UTF-32 before doing case
>> insensitive compares, replaces I could live with that.
>
> If you meant UTS-32 and you have a corresponding locale implementation, than
> this approach is a viable solution.
>
> Sorry, what is UTS-32 ? I tried to Google it: 351 results, with none of them
> looking like char encoding related.
>
> I found this article on Wikipedia on UTF-32/UCS-4:
> http://en.wikipedia.org/wiki/UTF-32
>
> Is it not what I need ?
> I suspect that many people must have ran into similar problems. Perhaps we should
> add a 32 bit string class to Boost. And until I get a better understanding, I will
> keep calling it UTF-32 :-)
>

Sorry, I mixed up it a little. I meant UCS-4 a.k.a fixed-width encoding. I was not
aware that UTF-32 id de-facto the same.

Anyway, the statement about usability with StringAlgo still holds. It can work with
any fixed-size encoding, as long as you have the corresponding locales.

It could theoretically work also with variable-with characters, provided you
have a container/localte framework, that allows to operate on metacharacters.
I'm not sure how efficient it will be, though.

Best regards,
Pavol.