Boost mailing page: Re: [boost] Comment on string / unicode discussion

Date view	Thread view	Subject view	Author view

From: Joel de Guzman (joel_at_[hidden])
Date: 2006-07-05 21:23:58

Next message: Joel de Guzman: "Re: [boost] Interest in super string class?"
Previous message: Jeff Garland: "Re: [boost] Comment on string / unicode discussion"
In reply to: Sean Parent: "[boost] Comment on string / unicode discussion"
Next in thread: Robert Ramey: "Re: [boost] Comment on string / unicode discussion"

Sean Parent wrote:
> I don't have enough time to delve deeply into this thread but I
> thought I'd make a few passing comments.
>
> Adobe has a fairly major string class problem (we joke that every
> project must have it's own string class - which is nearly true).
> There isn't such thing as a single type of string - there are _many_
> purposes and you need to be able to handle things like language and
> style runs and large, large blocks of text with efficient edits, UI
> substations (which are aware of things like split negation and
> masculine/feminine forms), language based ordering, different
> encodings...
>
> We need another string class like a hole in the head.
>
> What we do need - are good standard algorithms which can be applied
> to any string class.
>
> I believe this is doable with the current iterator interface.
>
> I believe it's possible (meaning I've done some quick experiments) to
> define an input iterator (actually as strong as a non-mutating
> forward iterator) and output iterator, which do conversions. This
> means that you can define operations in terms of unicode encoding
> (though some operations such as ordering may still require a locale).
>
> Consider -
>
> to_lower(first, last, output)
> to_upper(first, last, output)
>
> such transformations can work with any encoding (you can uppercase
> UTF-8 into UTF-32). They can't work in-situ (but I don't think
> to_upper or to_lower really can work in-situ - certainly not in UTF-8
> and probably not in UTF-16, and I believe there are some multi-
> character forms that even break in UTF-32...). It is possible though
> to wrap them with a replace function for in-place operations.
>
> The current std::find() will work with such iterator adapters to find
> single UTF-32 character (in any encoded sequence).
>
> Currently with ASL we're taking such an approach for localization
> strings (replacing an existing string class for localized strings at
> Adobe with a small set of functions and _any_ string class (any
> sequence of code units), including std::string, std::vector (or deque
> or list).
>
> You might take a look here for some ideas: <http://
> opensource.adobe.com/group__asl__xstring.html>.

This is very close to what I have in mind. The main difference is that
the functions/algorithms in my mind take ranges instead of iterators.
Thus:

to_lower(src, dest)
to_upper(src, dest)

With these, I could make Fusion like wrappers that transform them into
something like:

some_string s1 = to_lower(src);
some_string s2 = to_upper(src);

where to_lower and to_upper return cheap views that are in and by
themselves valid strings/ranges. They are cheap because the actual
conversions/transformations are done on demand-- think lazy evaluation.
So, like those done by expression template techniques, there are
no expensive temporaries when you perform seemingly expensive tasks
like:

some_string s = f1(f2(f3(f4(src))));

And yes, because they are generic, those string algorithms can work
on any string type that satisfy some basic requirements.

Regards,

-- 
Joel de Guzman
http://www.boost-consulting.com
http://spirit.sf.net

Next message: Joel de Guzman: "Re: [boost] Interest in super string class?"
Previous message: Jeff Garland: "Re: [boost] Comment on string / unicode discussion"
In reply to: Sean Parent: "[boost] Comment on string / unicode discussion"
Next in thread: Robert Ramey: "Re: [boost] Comment on string / unicode discussion"

Date view	Thread view	Subject view	Author view