Subject: Re: [boost] [gsoc-2013]Approximate string matching
From: Erik Erlandson (eerlands_at_[hidden])
Date: 2013-04-23 11:16:32


----- Original Message -----

From: "Yu Jiang" <sunlightt07_at_[hidden]>
To: boost_at_[hidden]
Sent: Tuesday, April 23, 2013 1:26:35 AM
Subject: [boost] [gsoc-2013]Approximate string matching

I have seen in a previous discussion, another student proposed to implement
some basic matching functions such as "edit distance" for boost. He said by
using dynamic programming, the time complexity is O(m*n). However, if we
give a threshold T and only report the exact edit distance if it's no
larger than T, the complexity can be O(T * min(m,n)). I think sometimes we
just need to decide whether the edit distance is small enough.

....

So, is it compatible for gsoc 2013 and for boost? Looking forward to your
opinions.

I can't speak to gsoc 2013, but as a user of edit distance algorithms in previous lives, I know I'd be interested in a boost edit-distance library.

How one would design the interface for edit_distance<> is an interesting question. In the basic general case, edit_distance<> is a function of two sequences, and a cost matrix for insertion/deletion/substitution/equality. In boost/template world, the "equality" test would of course be a template parameter, with the expected default. These days, I'd expect to be able to provide sequences as either begin/end iterators, or ranges. The cost matrix could be either provided as numeric values or functionals, with defaults to: 1/1/1/0.

I would want to support a variation that limited the search to a diagonal band, either as a separate function or via some parameter.

I imagine that it would be nice to provide specialization for strings, and perhaps vectors, since those are common sequence inputs and it would be convenient to supply them directly.