Boost mailing page: Re: [text] SIMD UTF-8 decoding

Date view	Thread view	Subject view	Author view

From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2020-06-17 15:42:03

Next message: Phil Endecott: "Re: [text] Copyright notices"
Previous message: pbristow_at_[hidden]: "Boost.Random"
In reply to: Zach Laine: "Re: [text] SIMD UTF-8 decoding"
Next in thread: Alexander Grund: "Re: [text] SIMD UTF-8 decoding"
Reply: Alexander Grund: "Re: [text] SIMD UTF-8 decoding"

Zach Laine wrote:
> On Mon, Jun 15, 2020 at 2:06 PM Phil Endecott via Boost
> <boost_at_[hidden]> wrote:
>> 1. The SIMD code is x86-specific. It doesn't need to be; I think
>> it could use gcc's vector builtins to do the same thing and be
>> portable to other SIMD implementations. (Clang provides the same
>> builtins; I'm not sure about what you need to do on MSVC/Windows.)
>> See: https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html
>
> That page describes vector-friendly data types and arithmetic
> operations. It does not seem to support the operations actually used
> in the code currently in Boost.Text.

I think it has most of what's needed, though it seems that the
type conversion __builtin_convertvector, which is needed to
expand e.g. a UTF-8 byte to UTF-32 with zero bytes, is only present
in newer versions of g++ than I have.

>> 2. The SIMD code only seems to provide a fast path for bytes < 0x80,
>> falling back to sequential code for everything else. I guess I was
>> expecting something more sophisticated.
>
> The code makes the fast path extra fast, but the slow path, being
> quite branchy, is not really amenable to vectorization. If you have
> an implementation that proves that claim false, I'm happy to use it.

Well I've never written any SIMD code before, but I thought I'd
have a look. The following is just a proof of concept; it just
converts 16 bytes of UTF-8 to one-byte-per-character ISO-8859-1.
(Because, as noted above, I don't have a good way to spread out
the bytes.)

int utf8_to_latin1(uint8_t* out_p, const uint8_t* in_p)
{
  // Attempt to decode the subset of UTF-8 with code points < 256.
  // Format is either 0xxxxxxx -> 0xxxxxxx
  // or 110---xx 10yyyyyy -> xxyyyyyy
  // The input mustn't start or finish in the middle of a multi-byte
  // character.
  // Other inputs produce undefined outputs.

// Process 16 bytes at a time (for 128-bit SIMD).
using uint8_x16 = uint8_t __attribute__((vector_size(16)));

  // Load input:
  uint8_x16 in;
  for (int i = 0; i < 16; ++i) in[i] = in_p[i];

  // Each byte is one of three types:
  uint8_x16 is_onebyte = (in & 0b10000000) == 0b00000000;
  uint8_x16 is_first_of_two = (in & 0b11100000) == 0b11000000;
  uint8_x16 is_second_of_two = (in & 0b11000000) == 0b10000000;

// (If all bytes are is_onebyte we could take a fast path now.)

  // Get the bits that each byte contributes to the result:
  uint8_x16 bits = is_onebyte ? in
                 : is_first_of_two ? (in << 6)
                 : is_second_of_two ? (in & 0b00111111)
                 : 0;

  // Make a shifted copy:
  // (Isn't there a better way to do this?)
  uint8_x16 shifts = { 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,0 };
  uint8_x16 shifted = __builtin_shuffle(bits,shifts);

  // Combine the two bytes, where required:
  uint8_x16 combined = is_first_of_two ? (bits | shifted)
                     : bits;

  // Form a shuffle to pack the result, skipping the second bytes;
  // this is non-SIMD. Any tricks to make this quicker?
  uint8_x16 shuffle;
  int to = 0, from = 0;
  while (from < 16) {
    shuffle[to++] = from++;
    if (is_second_of_two[from]) ++from;
  }

// Apply the shuffle:
uint8_x16 packed = __builtin_shuffle(combined,shuffle);

  // Store result.
  for (int i = 0; i < 16; ++i) out_p[i] = packed[i];
  return to; // Number of characters in result.
}

That seems to work, and does generate vector instructions on ARM64.
I've no idea if it faster than the alternatives.

>> 3. The code used for bytes >= 0x80, and in all cases for non-x86,
>> is here:
>> https://github.com/tzlaine/text/blob/master/include/boost/text/transcode_iterator.hpp
>> around lines 400-560. It implements a state machine, which surprises
>> me; it takes much less code and gives better performance if you write
>> out the bit-testing and shifting etc. explicitly. This seems to be
>> about 50% slower than my existing UTF-8 decoding code.
>
> Could you point me to that code, and let me use your benchmarks to
> verify? I'm happy to do something faster!

Essentially you could just replace the body of the advance() function at
line 536 of transcode_iterator.hpp with something like this:

char8_t b0 = *(p++);
IF_LIKELY((b0&0x80)==0) return b0;
char8_t b1 = *(p++);
IF_LIKELY((b0&0xe0)==0xc0) return (b1&0x3f) | ((b0&0x1f)<<6);
char8_t b2 = *(p++);
IF_LIKELY((b0&0xf0)==0xe0) return (b2&0x3f) | ((b1&0x3f)<<6) | ((b0&0x0f)<<12);
char8_t b3 = *(p++);
IF_LIKELY((b0&0xf8)==0xf0) return (b3&0x3f) | ((b2&0x3f)<<6) | ((b1&0x3f)<<12) | ((b0&0x07)<<18);

That will be quick, but it does lack a few things; it doesn't check if
it has reached the end of the input and it doesn't do any error checking.

Regards, Phil.

Next message: Phil Endecott: "Re: [text] Copyright notices"
Previous message: pbristow_at_[hidden]: "Boost.Random"
In reply to: Zach Laine: "Re: [text] SIMD UTF-8 decoding"
Next in thread: Alexander Grund: "Re: [text] SIMD UTF-8 decoding"
Reply: Alexander Grund: "Re: [text] SIMD UTF-8 decoding"

Date view	Thread view	Subject view	Author view