Subject: Re: [boost] [Tokenizer]Usage and documentation
From: Yechezkel Mett (ymett.on.boost_at_[hidden])
Date: 2011-02-13 06:44:17


On Thu, Feb 10, 2011 at 3:46 PM, Max <more4less_at_[hidden]> wrote:
> I have 3 version of the RE's sitting side by side attempting to figure out
> the difference
> between them.
>
>> "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$ // (1)
>> "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,) //
> (2)
>> "([^"]*)"|([^\s,"]+) //
> (3) original version offered by Stephen
>
> But, unfortunately, I still cannot fully grasp the meaning of (1) and (2).

,\s*(),

means find a ',' followed by any number of spaces followed by a ','
and capture an empty string.

The others are similar.

>
> r: "([^"]*)"|([^\s,"]+)|,\s*(),|^\s*(),|,\s*()$
>
> empty,,,fields, , , like this
> [empty][][fields][][like][this]
> ,,,
> [][]
>
> There are 2 empty tokens in between each 3 contiguous ',' but only one for
> each is detected.

Yes, that's a mistake. When matching ,, as an empty field the second
',' is eaten and can no longer be used as the beginning of the next
field.

"([^"]*)"|([^\s,"]+)|,\s*()(?=,)|^\s*()(?=,)|,\s*()$

should work. (?=) is a lookahead, it checks that the pattern (',' in
this case) matches at this point, but doesn't eat any input.

>
> Likewise, for (2), I get:
>
> r: "([^"]*)"|([^\s,"]+)|(?:^|,)\s*()(?:$|,)
>
> empty,,,fields, , , like this
> [empty][fields][like][this]
>
> This time, the behavior is no different than the 'original' version.

I get the same results as the first version. Perhaps it wasn't escaped properly?

Yechezkel Mett