Boost users' mailing page: [Boost-users] Xpressive: UTF-8 and diacritics

Date view	Thread view	Subject view	Author view

From: Allan Odgaard (gusixpl02_at_[hidden])
Date: 2008-08-25 15:58:10

Next message: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"
Previous message: François Mauger: "Re: [Boost-users] cross-platfrom binary serialization?"
Next in thread: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"
Reply: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"

It looks like the traits aspect of Xpressive is geared toward
characters, so I assume that Xpressive is not directly usable with
UTF-8 encoded text, am I correct?

It might work by having the character type be a 32 bit integer and
then use iterator adapters which expose the sequence as ucs-4 code
points (after all, the sequence is “encoded”), but that leads me to
the next question: diacritics.

For example something like é in decomposed unicode is two code points
(e followed by a combining ´ mark), so even when the sequence is
iterated as ucs-4 code points, a regexp of “.” will match just the e,
not the actual (rendered) character.

Since I was unable to find any discussion of this while searching for
Xpressive, I am curious to hear if any thoughts have gone into these
issues.

Next message: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"
Previous message: François Mauger: "Re: [Boost-users] cross-platfrom binary serialization?"
Next in thread: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"
Reply: Eric Niebler: "Re: [Boost-users] Xpressive: UTF-8 and diacritics"

Date view	Thread view	Subject view	Author view