$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [GSoC] Some new ideas for this year's GSoC
From: Stefan (mstefanro_at_[hidden])
Date: 2010-04-08 19:05:12
Phil Endecott wrote:
> Stefan wrote:
>> * Optimize Boost String Algorithm finders to use an efficient 
>> substring matching algorithm as opposed to the naive algorithm. Most 
>> likely because of the fact that boost has implemented the finders for 
>> strings as generic algorithms, they haven't used efficient algorithms 
>> such as KMP, Boyer-Moore, Rabin-Karp etc.
>>   \see boost/algorithm/string/detail/finder.hpp
>>   The idea is to implement efficient algorithms such as the ones 
>> mentioned above to replace the current naive implementation. Note: it 
>> may be difficult to find nth matching substring position using some of 
>> these algorithms.
>>   Difficulty: easy-medium
>>
>> * Implement a string algorithm for generating the suffix array. The 
>> suffix array, after being precomputed on a certain string, allows for 
>> very efficient substring searches in the future. This is useful when 
>> the string doesn't change too often but there are plenty of substring 
>> to search against.
> 
> Hi Stefan,
> 
> Have you seen this thread, from a couple of years ago? :
> 
>     http://thread.gmane.org/gmane.comp.lib.boost.devel/171098
> 
> I don't know what has changed in string_algo since then, but I think 
> there is probably plenty of opportunity for easy optimisation.
> 
> Do you have a motivating application for this?  A lot of this stuff is 
> rather data-dependant, so having a particular application that can 
> provide a benchmark is a good idea.  I guess that a suffix array library 
> would be reasonable for a GSoC project, if you're good!  You might also 
> like to consider variants that index at e.g. word boundaries; I have 
> used something like that and it works well for some kinds of search.
> 
> 
> Regards,  Phil.
> 
> 
> 
> _______________________________________________
> Unsubscribe & other changes: 
> http://listarchives.boost.org/mailman/listinfo.cgi/boost
> 
I've decided to implement Rabin-Karp today and after benchmarking I 
realized that on my pseudo-randomly generated data it performs worse 
than boost. My Rabin-Karp implementation will sure need some profiling 
before it can be used in benchmarking. Also some better data sets than 
what I came up with.
Here it is:
/** Rabin-karp
  *
  * \author mstefanro_at_[hidden]
  */
#include <boost/cstdint.hpp>
#include <vector>
#include <utility>
#include <algorithm>
#include <boost/utility.hpp>
#include <boost/tuple/tuple.hpp>
#include <cassert>
//Note: I've wasted a few hours debugging because I initially haven't 
used the uintmax_t for logpow().
//! Binary exponentiation in O(log exp). This should also be added to 
boost unless it already exists and I'm not aware of it
//! Very quickly computes (base^exp)%mod
template <class Uint>
Uint logpow (Uint base, Uint exp, Uint mod)
{
        boost::uintmax_t ret=1, exponential_base = base;
        for (; exp; exp >>= 1)
        {
                if (exp&1) ret = (ret*exponential_base)%mod;
                exponential_base = (exponential_base*exponential_base)%mod;
        }
        return static_cast<Uint>(ret);
}
//! @TODO: add support for Container=char[],char*,wchar_t[],wchar_t*
//! @TODO: accept generic matches container rather than just std::vector 
in find()
//! @TODO: choose smaller primes (so that 2*prime fits in uint32)
template <
        class Container, bool Heuristic = false, class 
Comparator=std::equal_to<Container::value_type>,
        class HashType = boost::uint32_t,
        HashType FirstBase=257u, HashType SecondBase=277u,
        HashType FirstModulo=7624679u, HashType SecondModulo=7116647u
 >
class rabin_karp
{
public:
        rabin_karp (Container const &substring)
                : substring_(substring), first_hash(0u), second_hash(0u), comparator()
        {
                if (!substring.empty())
                {
                        boost::tie(first_hash, second_hash) =
                                compute_hashes(substring.begin(), substring.end());
                        first_additive_inverse  =
                                FirstModulo  - ::logpow(FirstBase , 
static_cast<HashType>(substring.size()-1), FirstModulo );
                        second_additive_inverse =
                                SecondModulo - ::logpow(SecondBase, 
static_cast<HashType>(substring.size()-1), SecondModulo);
                }
        }
        template <class Container2>
        bool find(
                Container2 const &string,
                std::vector<typename Container2::const_iterator> &matches,
                unsigned int stopAt=0)
        {
                HashType current_first_hash, current_second_hash;
                bool found = false;
                
                matches.clear();
                if (string.size() < substring_.size() || substring_.empty()) return false;
                
                boost::tie(current_first_hash, current_second_hash) = 
compute_hashes(string.begin(), string.begin()+substring_.size());
                Container2::const_iterator iter = string.begin()+substring_.length()-1;
                do {
                        ++iter;
                        if (first_hash  == current_first_hash  &&
                                second_hash == current_second_hash &&
                                test_for_match<Heuristic>(iter-substring_.length(), iter)
                                )
                        {
                                // Got a match
                                matches.push_back(iter-substring_.length());
                                found = true;
                                if (stopAt) { --stopAt; if (!stopAt) break; }
                        }
                        if (iter == string.end()) break;
                        else
                        {
                                // we've got a new character, update the hashes
                                // remove *(iter-substring.length())
                                // add *iter
                                HashType remove = *(iter-substring_.length()), add = *iter;
                                current_first_hash = ( (current_first_hash + 
(remove*first_additive_inverse)%FirstModulo)*FirstBase ) % FirstModulo;
                                current_first_hash = (current_first_hash + add)%FirstModulo;
                                
                                current_second_hash = ( (current_second_hash + 
(remove*second_additive_inverse)%SecondModulo)*SecondBase ) % SecondModulo;
                                current_second_hash = (current_second_hash + add)%SecondModulo;
                        }
                } while (true);
                return found;
        }
private:
        Container substring_;
        Comparator comparator;
        HashType first_hash, second_hash;
        HashType first_additive_inverse, second_additive_inverse;
        template <class Iterator>
        std::pair<HashType, HashType> compute_hashes (Iterator begin, Iterator end)
        {
                HashType first=0u, second=0u;
                for (Iterator iter = begin; iter != end; ++iter)
                {
                        first  = (first  * FirstBase  + *iter) % FirstModulo ;
                        second = (second * SecondBase + *iter) % SecondModulo;
                }
                return std::make_pair(first, second);
        }
        //! Check if [begin,end) == [ substring_.begin(),substring_.end() )
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == false,bool>::type 
test_for_match(Iterator begin, Iterator end)
        {
                Iterator iter2 = begin;
                assert(end-begin == substring_.length());
                for (Container::const_iterator iter=substring_.begin();
                                iter != substring_.end(); ++iter, ++iter2)
                {
                        if (!comparator(*iter, *iter2)) return false;
                }
                return true;
        }
        //! Heuristic specialization, omit testing character by character.
        //! One should hope for no collisions on both hash functions at once 
(highly unlikely)
        //! If it's seen happening, add a third hash function?
        template <bool Heuristic_, class Iterator>
        typename boost::enable_if_c<Heuristic_ == true,bool>::type 
test_for_match(Iterator begin, Iterator end)
        {
                return true;
        }
};
There's both an heuristic and a deterministic algorithm, but the 
heuristic one has never failed me so far (tested on around 20 mil inputs 
the heuristic implementation has always yielded the correct results).
Typical usage is:
  std::vector<std::string::const_iterator> matches;
  rabin_karp<std::string> rk("my substring");
  if (rk.find(std::string("my string which contains my substring"), 
matches)) { /* ... */ }
Or for the heuristic version:
  ...
  rabin_karp<std::string, true> rk("my substring");
  ...
I doubt rabin-karp normally performs so badly, therefore my 
implementation definitely needs improvement.