Subject: Re: [boost] [gsoc] Request Feedback for Boost.Ustr Unicode String Adapter
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-08-14 10:12:44


Soares Chen Ruo Fei wrote:

>> with non-Unicode CJK encodings
>> like Shift-JIS or GBK there is no
>> way to go backward

> Ahh I see so that's quite nasty, but actually it still can be done
> with the sacrifice on efficiency. Basically since the iterator already
> has the begin and end boundary iterators it can simply reiterate all
> over from the beginning of the string. Although doing so is roughly
> O(N^2) it shouldn't make significant impact as developers rarely use
> this multi-byte encoding and even seldom use the reverse decoding
> function.

As a general point, I believe it's a bad idea to hide a surprise like
O(N^2) instead of O(N) complexity in a "rare" case. Doing so means
that users will implement something that seems to work, and then get
bitten later when it doesn't work in the field. (For example, the
first time that a customer in Japan tries to process a 1 MB file and it
takes a million times longer than expected.)

It would be better to not provide the inefficient case at all. Compare
with how std::list doesn't provide random access, even though it could
do so in O(N). Looking at your character set iterator, it seems to me
that you could have a forward-only iterator and a bidirectional
iterator for UTF, but only the former for these other encodings. Not
storing the begin iterator when only forward iteration is needed also
saves space.

Regards, Phil.