$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Hervé Brönnimann (hervebronnimann_at_[hidden])
Date: 2007-02-01 20:34:10
Martin:
On Feb 1, 2007, at 1:29 PM, Martin Knoblauch Revuelta wrote:
> IMHO, the best way to do it is using a tree that is:
>  - balanced
>  - sorted (for binary search)
>  - augmented with counters that are added bottom-up, allowing random
> access (rank tree or statistic order tree)
>
> I send attached a sample program that solves this problem using a Rank
> List (proposed for inclusion in Boost some months ago):
> http://boost.cvs.sourceforge.net/*checkout*/boost-sandbox/boost- 
> sandbox/libs/rank_list/doc/index.html
If the purpose is calculating the median, forgive me for being the  
academic professor.  But that is an overkill.  You should maintain  
two binary heaps of equal size (plus or minus one) with the lower  
half being a max heap, and the upper half being a min-heap.  Upon  
adding an element, you insert it to the appropriate heap, and if one  
heap grows larger than the other by more than one, you can rebalance  
the two heaps by transferring the min of the second heap into the  
first heap, or the max of the first heap into the second.  The  
runtime is still O(log N) per element and the storage O(N) but all  
storage is now sequential, using two std::vector, all the algorithms  
are provided by std:: (you can even use an std::priority_queue out of  
the box), and being much more compact, the constants will be much lower.
Note that there are also min-max heaps that work in O(log N) time per  
insertion or extract-min or extract-max,  and I should really provide  
one in my boost.Minmax library.  They also work by using an  
std::vector storage and using an implicit binary tree. That enables  
to have exact quantiles in O(t+log N) time per insertion.   
Maintaining t quantiles means t separate vectors, so the overhead  
would be up to double the storage (assuming doubling rule for  
expanding vectors) and you'd have to compare that to a r/b tree  
node.  For elements such as integers, double still compares favorably  
to binary tree nodes which has an overhead of three pointers and a  
rank per integer.
> An accumulator storing a tree of this kind would be able to calculate
> not only the median, but any desired percentile in O(log N) time. The
> drawback is that:
>  - accumulating a sample would take O(log N) too
>  - O(N) extra storage
Agreed.  If you don't want just the median, but have the rank  
selection operation for any rank, a tree would be also my way to go.  
However, in that case, I wouldn't use the accumulator library to  
start with.  Even for applying this to a tail of a distribution, I  
humbly submit a sorted vector would be better (assuming tail is  
small, i.e. constant size) due to compactness in the code, locality  
of reference, etc.  Then there is the issue of distinct numbers,  
you'd have to decide for your application if N is much smaller than  
the total number of multiplicities...
> Combining two accumulators of this kind (containing all unique values
> in a tree) would require O(N+M) time, where N and M are the numbers of
> unique values stored in the respective accumulators. Though, I guess
> that the median (or any percentile) might be calculated in
> O(square(log N)+square(log M)) without actually merging... I'm not
> sure...
You're correct.  Given two sorted ranges (which your trees simulate  
with an overhead for random access to log(N) or log(M)), by applying  
binary search you can find the median of the union in an extra  
multiplicative logarithmic factor.
> The extra storage is a very important drawback. If an algorithm like
> this one was ever included in the accumulators, it should be optional
> and disabled by default.
I agree with this.  Once you provide it, people will use it.  Better  
make sure they darn know what they're getting into...
> Though, storing the values in a Rank List
> would allow very precise calculations. Note that a sequential sum is
> prone to cumulative errors, since the first samples pass through O(N)
> operations. In a Rank List every value would pass through about O(log
> N) operations. Regards, Martin
On a side note, maintaining the elements in order allows to recompute  
the sum (from smaller to larger in absolute values) with the same  
precision as the underlying floating point, i.e. 2^-52 for IEEE  
doubles.  If precision is a real issue, that can be an incentive.  On  
the other hand, if N is small, I'd much rather maintain the sum in  
exact representation (using for instance the fact that IEEE doubles  
can be added exactly using a trick due to Decker, using seven IEEE  
additions and maintaining the result as two non-overlapping doubles).
Cheers,
-- Hervé Brönnimann hervebronnimann_at_[hidden]