From: Kostas Savvidis (kotika98_at_[hidden])
Date: 2025-06-01 19:52:54


> On 30 May 2025, at 19:54, Joaquin M López Muñoz via Boost <boost_at_[hidden]> wrote:
>
> El 30/05/2025 a las 18:27, Peter Dimov via Boost escribió:
>> Kostas Savvidis wrote:
>>>> On 29 May 2025, at 20:37, Joaquin M López Muñoz via Boost
>>> <boost_at_[hidden]> wrote:
>>>> given two sequences of hash values hi and gi, we're not interested in
>>>> internal correlation of either hi or gi, but cross correlation between
>>>> the two sequences.
>>> I am actually not sure which is more important for a Bloom filter,
>>> autocorrelation or cross-correlation.
>>>
>>> What is known is that the cross-correlation between two sequences in any
>>> MCG or LCG ( x' = a * x mod k) is even worse than the autocorrelation.
>>> It goes like this: if in the first bucket h1 = 2*g1, then hi=2*gi for all i or
>>> buckets. This is independent of a and k. Same if you replace the "2" with "3"
>>> etc.
>>> Effectively there is 100% correletion between buckets. One cannot fix that
>>> even with a better multiplier.
>> It's not clear to me why this would be a problem, but if it is, it can be fixed
>> by using an LCG (a*x+b) instead of an MCG (a*x).
>
>
> Umm, I like the LCG idea as an additional sum is basically free. Kostas, would a
> smart choice of b improve the statistical properties of thhe procedure? Note that
> we can afford determining b as a function of a (here a=m where m is the capacity
> of the array).

The b parameter does not fundamentally do anything for an MCG/LCG, quality remains the same,
it just makes reasoning about this issue more difficult.
 std::random and boost do not have any additive constants in any RNG for this reason, no improvement whatsoever.

And why do we need good randomness at all?
The high correlation between the buckets may mean that if two items collide in the first bucket, then they will collide in all subsequent buckets.
Or the opposite, equally bad (!), that if they dont collide in the first bucket, then they wont collide at all.
In this library we have not seen this leading to outright failure (yet),
probably because in the typical use case only SOME (high) bits of the hash are used to get position in the bucket.
So its complicated:
someone could and should do detailed theoretical analysis of "Bloom with MCG instead of hashing", but...

... let's get back to earth. Definitely, if you want to use an MCG, you do need a decent multiplier and a~=m does not assure that.
You can afford a good multiplier at the cost of one extra machine cycle.
Modern CPU makes both addition and multiplication in one cycle and it is anyway masked by memory latency.

Regards,
Kostas