$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
Subject: Re: [boost] [GSoC] Request for Feedback on Boost.Bloom Filter	Project
From: Phil Endecott (spam_from_boost_dev_at_[hidden])
Date: 2011-06-25 13:22:04
Alejandro Cabrera wrote:
>> Sometimes it would be useful to access the underlying data.  For
>> example, say I have a spell check dictionary or a URL block-list; I
>> might want to compile that list into a bloom filter and ship that as a
>> binary file.  For that I'd want to be able to use it as an adaptor on
>> top of an underlying container that I supply:
>>
>> // compile:
>> std::vector<uint32_t>  data(100000);
>> bloom_filter b(data);  // acts as a container adaptor
>> b.insert(...);
>> file f;
>> f.write(data.begin(),data.end());
>>
>> // use:
>> memory_mapped_file f("dictionary.bloom");
>> bloom_filter b(f);
>> if (b.contains(url)) ....
> I'm not sure how this would work. Could you elaborate further in this 
> regard?
>
> Currently, as I understand, there's two components of interest. The 
> first is to be able to access the underlying Bloom filter data for 
> serialization/deserialization purposes. This should be easy enough to do 
> given a function that exposes the Bloom filter data. The other component 
> that I'm picking up on is to be able to construct a Bloom filter using 
> various sources. Range construction will be added that semantically 
> behaves as an insertion for each element in the range. However, I'm not 
> sure how to support construction from a memory_mapped_file.
OK, let me expand on the example I started above.  Say I have a long 
list of "banned" URLs and an interactive program that must check URLs 
entered by the user against that list.  I pre-compile the bloom filter 
and save its raw binary content to a file.  Then in the application, I 
map that data into memory.  The point about memory-mapping is that the 
data is not actually read from the file until it is used, so in this 
case the bloom filter's ctor can be very fast; when the user types in a 
URL, only those pages of the file that contain the bits corresponding 
to the URL's hashes are read.  This is both faster and more 
memory-efficient than loading all of the data.
In order for this to work, it's only necessary that I can make a bloom 
filter that is an adaptor over a range; the range that I provide will 
be a pair of pointers that refer to the memory region where the file is 
mapped.  I.e. something like this (pseudo-code):
int fd = open("banned_urls.dat", O_RDONLY);
const char* begin = mmap(0, size, PROT_READ, fd, 0);
const char* end = begin + size;
bloom_filter_adaptor<const char*> banned_urls(begin,end);
Currently you're using std::bitset for the underlying data.  Since that 
doesn't (AFAIK) have a way to adapt an underlying implementation data 
type, I think you'll need to replace it with a wrapper that does do 
that.  Then you can have something like this:
template <typename ITER_T>
class bloom_filter_adaptor {
   typedef bitset_adaptor<ITER_T> impl_t;
   impl_t impl;
public:
   bloom_filter_adaptor(ITER_T begin, ITER_T end):
     impl(begin,end)
   {}
   ...
};
(An alternative to taking an iterator-pair is to take a container and 
store a reference to it.)
Then the non-adaptor version, for when that flexibility is not needed:
template <size_t SIZE>
struct bloom_filter_base {
   typedef std::vector<uint32_t> data_t;
   data_t data;
   bloom_filter_base():
     data(0,SIZE)
   {}
};
template <size_t SIZE>
class bloom_filter:
   bloom_filter_base<SIZE>,
   bloom_filter_adaptor<bloom_filter_base::data_t>
{
public:
   bloom_filter():
     bloom_filter_adaptor(data.begin(),data.end())
   {}
};
...or something like that.
Regards,  Phil.