Boost mailing page: Re: [boost] Call for interest for native unicode character and string support in boost

Date view	Thread view	Subject view	Author view

From: Rogier van Dalen (rogiervd_at_[hidden])
Date: 2005-07-27 23:24:52

Next message: Hajo Kirchhoff: "Re: [boost] Interest in an ODBC library?"
Previous message: Jonathan Turkanis: "Re: [boost] [iostreams] [1.33.0] Please give me a couple of extraminutes"
In reply to: Graham: "Re: [boost] Call for interest for native unicode character and string support in boost"
Next in thread: Graham: "Re: [boost] Call for interest for native unicode character and string support in boost"

Hi Graham,

On 7/25/05, Graham <Graham_at_[hidden]> wrote:
> [...]
> If we can agree the interface/ separation/ of Unicode character data
> from string interfaces then I believe that we will move forward quickly
> from there as we are then 'just' taking about algorithm optimisation on
> a known data set to create the best possible string implementation.

OK, we agree on this; I was incorrectly lumping things together.

> >> How have you hooked in dictionary word break support for languages
> like
> >> Thai
>
> >IMO that would be beyond the scope of a general Unicode library.
>
> It is both outside the scope but also fundamental to the approach as
> this case must be handled/ provided for.
>
> In my experience this is handled by the dictionary pass [outside the
> scope of this support] adding special break markers into the text [which
> need to be supported transparently as Unicode characters that happen to
> be in the private use range at this level] so that the text and string
> iterators can then be handled normally. The fact that the break markers
> are special characters in the private use range should not be relevant
> or special at this level.

You mean that we invent a set of private characters that the
dictionary pass should use?

> >> How far have you gone? Do you have support for going from logical to
> >> display on combined ltor and rtol ? Customised glyph conversion for
> >> Indic Urdu?
>
> >Correct me if I'm wrong, but I think these issues become important
> >only when rendering Unicode strings. Aren't they thus better handled
> >by Uniscribe, Pango, or similar OS-specific libraries? I think a Boost
> >Unicode library should focus on processing symbolic Unicode strings
> >and keep away from what happens when they are displayed, just like
> >std::basic_string does.
>
> Unfortunately I believe that there may be serious limitations in this
> approach.
>
> I strongly believe that even if we do not actually write all the code we
> must not be in a position where, for example, you have to use a
> Uniscribe library based on Unicode 4 and a Boost library based on
> Unicode 4.1. [This is even ignoring UniScribe's 'custom' handling].
>
> We must provide a Unicode character system on which all libraries can
> operate consistently.
>
> Even working out a grapheme break may require different sets of
> compromises that must work consistently for any set of inter-related
> libraries to be successful.

Do you have an example? I'm having trouble envisioning a situation in
which libraries based on different Unicode versions actually cause
conflicts.

> As another example where display controls data organisation, what
> happens if you want to have a page of text display the same on several
> machines?

Can you elaborate? In what cases is this vital and how does display
influence data organisation?

> This is actually a very difficult thing to do due to limitations in the
> Windows GDI scaling [which is not floating point but 'rounds' scaling
> calculations, and which can result in as much as a +/-10% difference in
> simple string lengths on different machines, unless handled specifically
> e.g. IIIIIIIIIIIIIIII can be the same length as WXWXWX on one machine
> but there can be a 20% difference on another machine] and requires
> access to the data conversion mechanisms and requires that you know how
> you are going to perform the rendering.

I fear I don't understand what you mean. It sounds to me like you're
suggesting defining a new font format for the Boost Unicode library.

> >Why would you want to do canonical decomposition explicitly in a
> >regular expression?
>
> Let me give two examples:
>
> First why?
>
> If you use a regular expression to search for <e acute> - [I use angle
> brackets <> to describe a single character for this e-mail] then
> logically you should find text containing:

And <acute> is a combining acute?

> <e><acute> and <e acute> as these are both visually the same when
> displayed to the user.

> Second why do we need to know? If we decompose arbitrarily then we can
> cover over syntax errors and act unexpectedly:
> [...]

Yes, the Unicode library should by default process grapheme clusters
rather than code points. This would automatically solve the regex
issue.

> >> We will need to create a utility to take the 'raw'/ published unicode
> >> data files along with user defined private characters to make these
> >> tables which would then be used by the set of functions that we will
> >> agree such as isnumeric, ishangul, isstrongrtol, isrtol etc.
> >
> >I find the idea of users embedding private character properties
> >_within_ the standard Unicode tables, and building their own slightly
> >different version of the Unicode library, scary. Why is this needed?
>
> It is important that the private use range, which is part of the Unicode
> spec, be handled consistently with the other Unicode ranges otherwise we
> end up having to write everything twice !
>
> The private use range is in the Unicode spec specifically as it has been
> recognised that any complex Unicode system will need private use
> characters.
>
> Classic examples are implementations that move special display
> characters into portions of the private use ranges to allow for optimal
> display of visible tabs, visible cr, special characters like Thai word
> breaks, and of course completely non-standard characters like a button
> that can be embedded in text and would be entirely implementation
> specific. Having the breaking characteristics of these characters be
> handled consistently with all Unicode characters is a massive
> simplification for coding.
>
> I strongly believe that we must therefore allow each developer who wants
> to use the Unicode system the ability to add these private use character
> properties into their own personal main character tables so they are
> handled consistently with all other characters, but acknowledge that
> these are implementation specific.
>
> This private use character data would NOT be published or distributed -
> the facility to merge them in during usage allows each developer the
> access to add their own private use data for their own system only.

But surely this means every app would have to come with a different DLL?
I'm not so sure about this. For many cases other markup (XML or
something) would do. Maybe other people have opinions about this?

Regards,
Rogier

Next message: Hajo Kirchhoff: "Re: [boost] Interest in an ODBC library?"
Previous message: Jonathan Turkanis: "Re: [boost] [iostreams] [1.33.0] Please give me a couple of extraminutes"
In reply to: Graham: "Re: [boost] Call for interest for native unicode character and string support in boost"
Next in thread: Graham: "Re: [boost] Call for interest for native unicode character and string support in boost"

Date view	Thread view	Subject view	Author view