$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Joaquin M López Muñoz (joaquinlopezmunoz_at_[hidden])
Date: 2025-05-29 17:37:43
El 29/05/2025 a las 18:42, Kostas Savvidis via Boost escribió:
>> On 29 May 2025, at 19:25, Vinnie Falco via Boost <boost_at_[hidden]> wrote:
>>
>> Thank you for your work in the role of review manager, and also I think
>> this sets a new bar for what we might like to see in terms of review
>> summaries.
> Thanks indeed go to Arnaud for managing the review. I am somewhat puzzled that no technical points whatsoever became part of the revew managers report.
> The main technical innovation of the design of this library was the idea to do it essentially without hashing. This idea is solid, given
> that a library like bloom is going to be dealing with millions or at most few billions of items.
> It is sufficient to use DETERMINISTIC random numbers, ie a good enough RNG.
> The library is choosing this RNG parametrically which is unworkable and specific details are such that the choice may turn out to be really bad
> according to well established theory.
> The suggestion in my review was to use a Knuth 64bit RNG, but other choices are possible.
> I do not imagine that there are people here who have not at least heard of the disaster awaiting the naive use of bad RNGs.
Hi Kostas,
Regardless of whether this point you raised is included in Arnaud's
report or
not, it's already in my backlog and I will address it properly. The reason
I'm not adopting immediately your Knuth mixer approach is because I'd
like to
understand (and hopefully reproduce) the conditions, if any, under which
a poor MCG may ruin the efficiency of the filter. They key point here
(in my mind
at least), is that, given two sequences of hash values hi and gi, we're
not interested
in internal correlation of either hi or gi, but cross correlation
between the two
sequences. Internally, I think it suffices to guarantee that hi (and gi)
values
won't repeat --but of course I may be wrong and this is what I would like to
study carefully.
I know this is asking too much, but given that you're an expert on random
numbers and such, it would be extremely helpful if you can assist in looking
for pathological cases, or lack thereof. Either way, I'll keep you posted on
my developments.
> On a separate note, maybe we, the authors of the Boost.Random library should have done a better job of providing
> RNGs which people might use in such a project instead of rolling their own.
The only reason for not adopting Boost.Random is speed: Boost.Bloom has
been implemented with the aim of being as fast as possible, and any
non-trivial mixer will have a measureable impact on performance.
Joaquin M Lopez Munoz