Subject: Re: [boost] [rfc] Unicode GSoC project
From: Mathias Gaunard (mathias.gaunard_at_[hidden])
Date: 2009-05-15 10:04:30


Graham wrote:

> A good reloadable character library is in the vault.

I'll be reviewing it in a while.
I'm not too sure about the memory layout it uses (__uni_char_data could
really be compressed to use less memory for example), nor about the
interface it exposes, but it does seem to work well.

About is_grapheme_break though, isn't the implementation for legacy
grapheme cluster rather than extended ones though?

> I think that a grapheme is more of an iterator concept than a data type
> concept. By specialising it you will unnecessarily complicate any
> library. Don't forget that, for example, the current grapheme may start
> as one character, then suddenly 'grab' the surrounding characters as it
> makes a combined glyph.
> I have never found a use case in practise where specialising the
> grapheme as other than a validated series of code points was helpful.

A grapheme is nothing more than a subrange of code points, at least in
my current design.

> The two cases where graphemes are important is in display [which
> requires intermediate glyph conversion anyway, and works just as well on
> runs of code points, so code points are fine] and in editing - and the
> grapheme-ness here alters during typing.

It's also useful for grapheme-level searching.

Searching for the substring "foo", in the string "foo\u20d7" shouldn't
match anything, because the extremities of the match are not at grapheme
boundaries.

> if you
> can do graphemes then you can do words, paragraphs etc as they are all
> just attributes of the characters with simple rules. Graphemes come in
> to their own for text display and editing and you would need these as
> well to be able to support that.

Those are not as important in my opinion, and given the time I have is
restricted focus won't be on these.

Adding them later makes perfect sense, however.