Boost mailing page: [boost] Re: Any interest in adding unicode support to boost?

Date view	Thread view	Subject view	Author view

From: Miro Jurisic (macdev_at_[hidden])
Date: 2004-10-19 16:07:08

Next message: Beman Dawes: "Re: [boost] Any interest in adding unicode support to boost?"
Previous message: Edward Diener: "[boost] Re: Any interest in adding unicode support to boost?"
In reply to: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

In article <cl3nps$4d8$1_at_[hidden]>, "Erik Wien" <wien_at_[hidden]> wrote:

> Hi. Thanks for the feedback!

My pleasure :-)

> "Miro Jurisic" <macdev_at_[hidden]> wrote in message
> news:macdev-BACD3C.13585519102004_at_sea.gmane.org...
> > I generally agree with this design approach, but I don't think that code
> > point iterators alone are sufficient.
>
> Neither do I as the matter a fact, but this is as far as I have come right
> now. :) There would probably be different types of iterators (or iterator
> wrappers) made available to enable iterations over everything from code units
> to code points/abstract characters.

Yes, I agree.

> > Iteration over encoded characters and abstract characters would be needed
> > for some algorithms to function sensibly. For example, the simple task of:
> >
> > find(begin, end, "ü")
> >
> > needs to use abstract characters in order to be able to find precomposed
> > and decomposed versions of ü.
> >
>
> True... And this is a point where implemtation would be less than trivial.

Yeah, that's how far I got before I decided that I didn't have the time to deal
with the problem given my current schedule.

> > Again, taking this example, you let's say that do_some_operation performs
> > canonicalization to some Unicode canonical form; you can't do this by
> > iterating over code points.
>
> Nope. A code unit iterator would be needed for things like that.

I am pretty sure you mean abstract character here, not code unit. My
understanding of the Unicode terminology is that the decomposed version of ü
consists of

one abstract character (ü)
two encoded characters (u, ¨)
two UTF-32 code units (0x00000075 0x00000308)
two UTF-16 code units (0x0075 0x0308)
three UTF-8 code units (0x75 0xCC 0x88)

but perhaps I have it backwards...

> The implementation described here would not pose too much of a problem, I was
> thinking more of the problems that arise when you take things like collation
> and locales into consideration. From what i understand there is a real issue
> in enabling proper unicode support in the standard classes like locale, ctype
> and collate, as they assume things that do not neccesarily apply to a unicode
> representation of text. A failiure to enable good support in those classes
> (at least locale and ctype), would also make the iostream support break, and
> things start to snowball. I could very well be wrong on this (Actually, I
> hope I am! :) ), as I haven't had the time to read up on all issues
> concerning this. But again, this is one of many problems I hope running this
> project will help reveal.

I don't know enough about locales to comment on this, unfortunately.

meeroh

Next message: Beman Dawes: "Re: [boost] Any interest in adding unicode support to boost?"
Previous message: Edward Diener: "[boost] Re: Any interest in adding unicode support to boost?"
In reply to: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Next in thread: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"
Reply: Erik Wien: "[boost] Re: Any interest in adding unicode support to boost?"

Date view	Thread view	Subject view	Author view