From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2020-06-15 19:05:24


Dear All,

I have been looking at the UTF-8 decoding code in the proposed
Boost.Text, as this is a problem I've looked at myself in the past.
I've mentioned an issue with the copyright in another message.
Here are my other observations.

1. The SIMD code is x86-specific. It doesn't need to be; I think
it could use gcc's vector builtins to do the same thing and be
portable to other SIMD implementations. (Clang provides the same
builtins; I'm not sure about what you need to do on MSVC/Windows.)
See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html

2. The SIMD code only seems to provide a fast path for bytes < 0x80,
falling back to sequential code for everything else. I guess I was
expecting something more sophisticated.

3. The code used for bytes >= 0x80, and in all cases for non-x86,
is here:
https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
around lines 400-560. It implements a state machine, which surprises
me; it takes much less code and gives better performance if you write
out the bit-testing and shifting etc. explicitly. This seems to be
about 50% slower than my existing UTF-8 decoding code.

4. There aren't enough comments anywhere in the code I've looked at!

Regards, Phil.