$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2020-06-17 15:42:03
Zach Laine wrote:
> On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
> <boost_at_[hidden]> wrote:
>> 1. The SIMD code is x86-specific. It doesn't need to be; I think
>> it could use gcc's vector builtins to do the same thing and be
>> portable to other SIMD implementations. (Clang provides the same
>> builtins; I'm not sure about what you need to do on MSVC/Windows.)
>> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
>
> That page describes vector-friendly data types and arithmetic
> operations. It does not seem to support the operations actually used
> in the code currently in Boost.Text.
I think it has most of what's needed, though it seems that the
type conversion __builtin_convertvector, which is needed to
expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
in newer versions of g++ than I have.
>> 2. The SIMD code only seems to provide a fast path for bytes < 0x80,
>> falling back to sequential code for everything else. I guess I was
>> expecting something more sophisticated.
>
> The code makes the fast path extra fast, but the slow path, being
> quite branchy, is not really amenable to vectorization. If you have
> an implementation that proves that claim false, I'm happy to use it.
Well I've never written any SIMD code before, but I thought I'd
have a look. The following is just a proof of concept; it just
converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1.
(Because, as noted above, I don't have a good way to spread out
the bytes.)
int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p)
{
// Attempt to decode the subset of UTF-8 with code points < 256.
// Format is either 0xxxxxxx -> 0xxxxxxx
// or 110---xx 10yyyyyy -> xxyyyyyy
// The input mustn't start or finish in the middle of a multi-byte
// character.
// Other inputs produce undefined outputs.
// Process 16 bytes at a time (for 128-bit SIMD).
using uint8_x16 = uint8_t __attribute__((vector_size(16)));
// Load input:
uint8_x16 in;
for (int i = 0; i < 16; ++i) in[i] = in_p[i];
// Each byte is one of three types:
uint8_x16 is_onebyte = (in & 0b10000000) == 0b00000000;
uint8_x16 is_first_of_two = (in & 0b11100000) == 0b11000000;
uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000;
// (If all bytes are is_onebyte we could take a fast path now.)
// Get the bits that each byte contributes to the result:
uint8_x16 bits = is_onebyte ? in
: is_first_of_two ? (in << 6)
: is_second_of_two ? (in & 0b00111111)
: 0;
// Make a shifted copy:
// (Isn't there a better way to do this?)
uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 };
uint8_x16 shifted = __builtin_shuffle(bits,shifts);
// Combine the two bytes, where required:
uint8_x16 combined = is_first_of_two ? (bits | shifted)
: bits;
// Form a shuffle to pack the result, skipping the second bytes;
// this is non-SIMD. Any tricks to make this quicker?
uint8_x16 shuffle;
int to = 0, from = 0;
while (from < 16) {
shuffle[to++] = from++;
if (is_second_of_two[from]) ++from;
}
// Apply the shuffle:
uint8_x16 packed = __builtin_shuffle(combined,shuffle);
// Store result.
for (int i = 0; i < 16; ++i) out_p[i] = packed[i];
return to; // Number of characters in result.
}
That seems to work, and does generate vector instructions on ARM64.
I've no idea if it faster than the alternatives.
>> 3. The code used for bytes >= 0x80, and in all cases for non-x86,
>> is here:
>> https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
>> around lines 400-560. It implements a state machine, which surprises
>> me; it takes much less code and gives better performance if you write
>> out the bit-testing and shifting etc. explicitly. This seems to be
>> about 50% slower than my existing UTF-8 decoding code.
>
> Could you point me to that code, and let me use your benchmarks to
> verify? I'm happy to do something faster!
Essentially you could just replace the body of the advance() function at
line 536 of transcode_iterator.hpp with something like this:
char8_t b0 = *(p++);
IF_LIKELY((b0&0x80)==0) return b0;
char8_t b1 = *(p++);
IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6);
char8_t b2 = *(p++);
IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
char8_t b3 = *(p++);
IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);
That will be quick, but it does lack a few things; it doesn't check if
it has reached the end of the input and it doesn't do any error checking.
Regards, Phil.