From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2007-10-17 16:23:02


Hi James, thanks for replying.

James Porter wrote:
> I've been thinking about this off and on as well, though have been a
> little too busy to give it the write-up it deserves. That said, I think
> your code is a pretty good start. While I agree that tagged strings
> shouldn't automatically convert on assignment, I think recode() isn't
> the most useful way to go about it.
>
> In practice, I expect that most code conversion would occur during I/O,
> so I'd prefer to see the conversion done by the stream itself. recode()
> could still exist as a convenience function, though.

Yes, other people have suggested similar things. Even if it were true
that most charset conversion occured during I/O - and that's not been
my experience in my own work - then I would still argue that charset
conversion should be available for use in other contexts.

I see my recode() member function (i.e. utf8_string s2 =
s1.recode<utf8>()) ultimately being a convenience around some sort of
free function or functor. The need to track shift-states and partial
characters makes this a bit complex, though.

> On the subject of converting between different encodings of strings, I
> noticed that you had some concerns about assignment between two
> different encodings using the same underlying type (latin1_string s =
> utf8_string("foo") for example). This could be resolved by using a
> nominally different char_traits class when inheriting from basic_string.

Yes; it has been suggested that they differ in their state_type. I
plan to investigate this, but if someone more knowledgeable would like
to do so, please go ahead.

> However, this would cause problems with I/O streams, since they expect a
> particular character type and char_traits. This goes back to my point
> above: the I/O streams should be aware of string tagging (if not
> directly responsible for it).

I imagine that an I/O streams library or some sort of adapter layer
compatible with these strings would be necessary.

> I'll need to think about how to specify character sets so that they're
> usable at compile time and run time, though my instinct would be to use
> subclasses that can be stored in a map of some sort. The subclassing
> would handle compile-time tagging, and the map would handle run-time
> tagging:
>
> class utf8 : public charset_base { ... };
> charset_map["utf8"] = new utf8();
>
> ...
>
> tagged_string<utf8> foo;
> rt_tagged_string bar;
> bar.set_encoding("utf8");
>
> This should combine the benefits of your first and third choices (type
> tags and objects), though I haven't thought about this enough to be
> confident that it's the right way to go.

Yes, this has some advantages. But using a map has the disadvantage
that lookups are more expensive, compared to the array indexed by enum
that I have; in my code, getting the char* name of a charset is a
compile-time-constant operation. I'm not sure how much that matters in practice.

Thanks for your feedback. Does anyone else have any comments? Do
please have a look at my example code
(http://svn.chezphil.org/libpbe/trunk/examples/charsets.cc) and tell me
how well it would fit in with your approaches to charset conversion.

Regards,

Phil.