$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [algorithm] Review Request: edit_distance
From: Erik Erlandson (eje_at_[hidden])
Date: 2014-02-03 17:36:50
----- Original Message -----
> >
> > I've been mulling this over, and I have an idea that is starting
> > to grow on me.
> >
> > What if we left insertion() and deletion() (and substitution) as-is:
> > they continue to receive individual sequence elements, and operation
> > scores.  But we *do* modify the equality() method to take two ranges,
> > in the manner you suggested above.   It would introduce an asymmetry
> > of equality() with respect to the other methods, but I think there
> > are some real arguments for it:
> 
> [snip]
> 
> > *) Runs of 'equal' dominate the script in most larger-scale applications
> > of interest.  Actual insertions, deletions, etc, are relatively small.
> 
> I can't agree with that; for me, a common case is where the two inputs
> are entirely different.
Something of a tangent, but I'm curious how that works for you.   When two strings are entirely different, the algorithm cost is O(NxM).  On larger data, say a million elements, that's terabytes of working data, and tera-ops.
> 
> Here's another possibility: don't pass ranges, but do pass iterators:
> 
> struct output {
> 
>     ITER pending;
>     enum ... last_action;
>     void insertion(ITER i, int) {
>       if (last_action != insert) {
>         do_something_with(pending,i);
>         pending = i;
>         last_action = insert;
>       }
>     }
>     // etc.
> 
> };
> 
> 
> The point here is that I just need to store the iterator for the start
> of the range until I get a different action; in contrast, if the value
> is passed I need to accumulate a copy of the range.
That seems promising.   It makes it straightforward to handle the script data either as it appears, or save up an iterator range.   I could use a similar variation on the 'diff' example, and not bother to copy the file data, just refer back to the original for output.