$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Alexander Grund (alexander.grund_at_[hidden])
Date: 2020-06-18 06:45:52
> I think it has most of what's needed, though it seems that the
> type conversion __builtin_convertvector, which is needed to
> expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
> in newer versions of g++ than I have.
Than it's likely not very useful for now. Maybe later once that compiler
version is more wide-spread
> // Attempt to decode the subset of UTF-8 with code points < 256.
> // Format is either 0xxxxxxx -> 0xxxxxxx
> // or 110---xx 10yyyyyy -> xxyyyyyy
> // The input mustn't start or finish in the middle of a multi-byte
> // character.
> // Other inputs produce undefined outputs.
Good code for that special case. But I think "undefined outputs" is not
acceptable. I've seen other SIMD UTF-8 conversions around and they
basically all focus on ASCII converting as much as possible and fallback
to one-by-one decoding once a non-ascii is found
> That will be quick, but it does lack a few things; it doesn't check if
> it has reached the end of the input and it doesn't do any error checking.
So not really usable either. BUT: Compare to Boost.Locale which has a
`decode` and `decode_valid` function where the latter assumes valid UTF-8
However checking for end-of-input is a must obviously.
BTW: Does Boost.Text have functions or overloads where you can specify
that text is in a specific encoding/normalization?
If not I think this should be added. Sometimes you get text from an
internal function and know those things so you can skip verification and
conversion