From: Andrey Semashev (andysem_at_[hidden])
Date: 2007-06-20 16:31:21


Mathias Gaunard wrote:
> Andrey Semashev wrote:
>
>> I'd like to note that Unicode consumes more memory than narrow
>> encodings.
>
> That's quite dependent on the encoding used.
> The most popular Unicode memory-saving encoding is UTF-8 though, which
> doubles the size needed for non ASCII characters compared to ISO-8859-*
> for example. It's not that problematic though.

UTF-8 is a variable character length encoding which complicates
processing considerably. I'd rather stick to UTF-16 if I had to use
Unicode. And it's already twice bigger than ASCII.

> Alternatives which use even less memory exist, but they have other
> disadvantages.
>
>
>> This may not be desirable in all cases, especially when the
>> application is not intended to support multiple languages in its
>> majority of strings (which, in fact, is a quite common case).
>
> Algorithms to handle text boundaries, tailored grapheme clusters,
> collations (some of which are context-sensitive) etc. are needed to
> process correctly any one language.
> So you need Unicode anyway, and better reuse the Unicode stuff than work
> on top of a legacy encoding.

I'm not saying that we don't need Unicode support. We do!
I'm only saying that in many cases plain ASCII does its job perfectly
well: logging, system messages, simple text formatting, texts in
restricted character sets, like numbers, phone numbers, identifiers of
all kinds, etc. There are cases where i18n is not needed at all - mostly
server-side apps with minimal UI. Being forced to use Unicode internally
in these cases means increased memory footprint and degraded performance
due to encoding translation overhead.