$include_dir="/home/hyper-archives/boost/include"; include("$include_dir/msg-header.inc") ?>
From: Calum Grant (calum_at_[hidden])
Date: 2005-10-23 10:14:06
Arkadiy Vertleyb wrote
> "Calum Grant" <calum_at_[hidden]> wrote
> 
> > > Calum, is it possible to see the latest code of your example?
> >
> > Here it is.
> 
> I see...
> 
> A couple of questions:
> 
> 1) You modify counter all the time -- does RML allow to do 
> this with the fields on which the table is sorted?
Yes - by using table.update_row() or table.update_where(), the indexes
are automatically updated.  That's the whole point of RML - it
uses/updates indexes implicitly.  If you decide to index a column, and
no code needs to change.
> 2) When you say :
> 
> select(m_locations,
>     m_locations.x>=loc.x-m_cutoff &&
>     m_locations.x<=loc.x+m_cutoff &&
>     m_locations.y>=loc.y-m_cutoff &&
>     m_locations.y<=loc.y+m_cutoff &&
>     m_locations.z>=loc.z-m_cutoff &&
>     m_locations.z<=loc.z+m_cutoff)
> 
> Do you use any kind of special index for this?
Yes - the TMP query analyser sees that m_locations.z is an indexed
column, and uses that index.  If on the other hand x or y were indexed,
that index would be used in preference to z.
If I had implemented user-defined queries, I could write
        select(m_locations,
                     m_locations.x>=loc.x-m_cutoff &&
                     m_locations.x<=loc.x+m_cutoff &&
                     m_locations.y>=loc.y-m_cutoff &&
                     m_locations.y<=loc.y+m_cutoff &&
                     m_locations.z>=loc.z-m_cutoff &&
                     m_locations.z<=loc.z+m_cutoff && 
                query(test_distance(loc)) )
where test_distance would be some user-defined functor to check the
distance more precisely.  Even in this scenario, the TMP query analyser
would use m_locations.z as an index.  It's not magic, it just uses the
first index is sees.
> In any case this seems to prove that RML does a good job when 
> searching on multiple indexes -- you do quite a bit of that.  
> Also, inserts/updates are probably very efficient.
Insertion carries the same overheads as Boost.MultiIndex.  Update is
possibly more efficient, since nodes can be relocated in a single index
- without needing to destroy/create a new indexed item.  This is one of
the inefficiencies of std::set or std::map - if you change the key you
need to allocate a new node.
> But, as I said before, the overall efficiency of this example 
> is based on the fact that this is a very smart, custom, 
> hand-coded solution (see how you maintain the counter).  No 
> library would be able to beat its performance using a 
> higher-level abstraction (If RTL had a count() operator in 
> addition to generic groupby(), we could be much smarter in 
> maintaining it during updates).
The RML solution is actually slower because of the incremental updates
to the "neighbours" column.  Had they been calculated in one go, then
the query would I suspect run faster.
I would really like to be able to write
  for_each(
    limit( 
      sort( (descending(col<2, 0>()), col<0, Location::name>() ),
        group_by( col<0, Location::id_column>(),
         select( 
          (locations, locations), 
          col<1, Location::x_column>() <= col<0, Location::x_column>() +
m_cutoff &&
          col<1, Location::x_column>() >= col<0, Location::x_column>() -
m_cutoff &&
          col<1, Location::y_column>() <= col<0, Location::y_column>() +
m_cutoff &&
          col<1, Location::y_column>() >= col<0, Location::y_column>() -
m_cutoff &&
          col<1, Location::z_column>() <= col<0, Location::z_column>() +
m_cutoff &&
          col<1, Location::z_column>() >= col<0, Location::z_column>() -
m_cutoff &&
          query(test_distance()) 
         ), 
        ),
       ), 50),
    display_result()
  );
In order to achieve this, I would need to implement sorting using
multiple columns, group_by(), and query().  And this query would I
suspect run as fast as the original.  (Though it would not support
dynamic update).
So the fact that I needed to split my query up is nothing to do with the
fundamental design of RML.  It was because some of the functions were
not implemented.  I will work towards getting these extra functions
implemented, but I have a long todo-list and it won't happen
immediately.
> But again, I don't believe this all proves RML is faster than 
> RTL, since the coding approaches were so different.  I don't 
> even know if it's possible to compare the libraries at this 
> point, since they have their strong points in different 
> areas.  RTL's strong point in the ability to build and run 
> SQL-like views.  This I tried to show in my example.  RML 
> seems to be very efficient at maintaining and using multiple 
> indexes, but, despite its more SQL-like interface, can 
> mimique only very simple SQL queries.
You're only saying that because I haven't got around to implementing
group_by and multi-column sorts.  These are less commonly used features
that were not a priority to implement.  But they would not be difficult.
When RML can implement the entire query in one statement, we can re-run
the tests if you'd like.
Indexed views can give performance improvements in real SQL databases -
but they can also reduce performance.  I believe the same phenomenon
might be occurring in your in-memory relational models.
I believe that you would gain a huge performance increase by using
Boost.MultiIndex as your underlying data structure (as opposed to
std::vector), and indexing the columns you require.  That is I believe
the fundamental architectural difference between RML and RTL.  By
storing the columns pre-indexed, you don't have any extra overhead.
Regards, Calum