C/C++ Users Journal March 2004
Although relational algebra is available to C++ programmers via relational database systems (RDBMS), the overhead of RDBMS prevents us from using relational algebra where it otherwise might have been attractive to do so. Moreover, the ODBC/SQL interface is such that C++ programmers cannot enforce compile-time type safety nor compile-time checks. Consequently, you only find out at runtime if (for example) a wrong column or table name is passed. Finally, the typical RDBMS only supports a limited number of types.
With this in mind, we developed RTL, a library that implements relational algebra while retaining the spirit of C++'s type safety and compile-time checks. (The complete source code for RTL is available at http://www.cuj.com/code/.) Our "relational table" looks similar to an STL container in that it provides an iterator to go through its items (tuples). A tuple can hold fields of any type, which satisfies the same requirements as for an STL container, and lets you set/access these fields in a type-safe manner. This is possible with the help of template meta-programming.
How do you represent a relational table in C++? The closest approximation would be a vector of structures:
struct employee{
std::string name;
int salary;
};
std::vector<employee> employees;
However, the problem with this representation is its inability to create new tuple types based on given ones. For example, projection might require returning only salaries. That means you would need to define a structure containing only salaries, create a vector of such structures, and copy salaries from one vector to another. This looks impossible without the client code doing the entire job.
To design a relational table that would not have this problem, consider this class template:
template<class T> struct column{
typedef T type;
T value_;
};
An instantiation of this template with any given type (either built-in or user defined) is a class whose instances are capable of storing values of this type. The client code can now define new columns like this:
struct name : column<std::string>{};
struct salary : column<int>
{};
The next thing to do is to combine columns in a tuple. You can use multiple inheritance for this:
struct employee : public name, public salary{};
Once the tuple is defined like this, you can use static_cast to set/access individual fields:
employee emp;static_cast<name>(emp).value_ = "Bill";
Note how a class name (name, salary) plays the role of a column name. In other words, to define a new column, you have to define a new class, then use this class to reference the column.
So far so good, but you still have no way to manipulate the tuple type. What you are trying to do is gain not only the ability to access fields by name (which is provided by a C++ structure), but also the ability to treat fields in a regular way, like array items access them by index, enumerate, concatenate tuples, and so on.
Template meta-programming helps. You can start by combining columns in a typelist:
typedeflist<name,
list<salary> > field_list;
then derive the tuple from every column in the typelist. We wrote a template meta-function, derive_from_all, which is a simplified version of GenScatterHierarchy, described by Andrei Alexandrescu in Modern C++ Design (Addison-Wesley, 2001), and defined our tuple class as in Listing 1. We also added access methods to avoid explicit use of static_cast in the client code. Now you can write something like this:
tuple<field_list> emp;emp[name()] = "Bill";
As a result, we kept readable notation and gained an important benefit the ability to manipulate types of tuples at compile time. In Listing 6, we concatenate arguments' field_lists to determine the type of the cross-product's tuple.
This approach does have a drawback. Every time a field is accessed, a temporary object has to be default-constructed to be passed as the parameter. This parameter is never used, but compilers do need it to properly deduce the template parameter of the member function. The cost of such construction may or may not be significant, depending on the field type. One option would be to pass a pointer rather than a reference, but this makes the notation much worse:
emp[(name*)0] = "Bill";
The solution we chose was to split the column class into two one containing the value itself (column), and another used to denote the column, without any data members (column_name). The cost of construction of an instance of such a class is negligible. After all these refinements, you can define columns like this:
struct name : column_name<column<name, std::string>
>{};
or, if you prefer using macros:
COLUMN(name, std::string);
It's often necessary to perform an operation on a given list of columns in a tuple. For example, projection requires you to copy explicitly listed columns from one relation to another. A column is identified by a type, so the task is to apply a given template function to every type in a given typelist.
Consider the std::for_each algorithm, which applies a given functor to each item in a container. In a similar fashion, our run_for_each (Listing 2) applies a functor to each type in a typelist. For this to make sense, the function call operator of the functor has to be templated:
template<class T>void operator()(T*);
When you copy fields, it makes sense to copy all of them. However, if you compare fields, you have to stop comparison once you encounter the first pair with different values. To address this difference, run_for_each distinguishes between two types of function call operators. If the return type is void, it unconditionally proceeds until all the types in the typelist are processed. If it is Boolean, run_for_each stops once False is returned.
RTL is all about relations. Relations behave like STL collections of tuples. It is possible to iterate over relations.
Relations are either tables (which hold data) or relational operators. Operators don't store their tuples; instead, they hold references to their arguments. Operators produce tuples on-the-fly, as iterators are dereferenced. Since the arguments are, in turn, relations, the structure presents an expression tree, whose leaves are tables.
Examples of unary operators include selection, projection, and groupby, which take one relation as the argument. Binary operators, such as intersection or cross_product, take two relational parameters. Relations form compile-time composite patterns; tables serve as leaves. Relational operators calculate the type of their tuple at compile time. This is the main idea of RTL, which trades the flexibility of the traditional approach, based on SQL queries that are interpreted at runtime, for efficiency and strict type safety of native C++ code.
All relations are coded as pairs of template classes: a collection class and associated with it a bidirectional const_iterator class. The collection class provides begin and end methods, which construct and return const_iterators. It also contains range methods: lower_bound, upper_bound, and equal_range, which allow fast queries when the relation is ordered in a proper way.
Each relation publishes the type of tuple (value_type) and associated const_iterator.
Unlike in pure relational theory, RTL relations are sorted. Sort order is defined by sort_list, which is a typelist of columns. For each element of sort_list, the less functor must be valid. We compare tuples using lexicographic order, and order of field comparisons is coded by sort_list. Two tuples are equivalent (with respect to a given sort_list) if each pair of the fields listed in this sort_list is equivalent. If the sort_list is empty, then the relation is unsorted. If the relation cannot contain two equivalent tuples, we say that it is fully sorted and call the sort_list a key.
To distinguish whether sort_list of a relation is a key, each relation provides is_distinct, a compile-time Boolean:
enum {is_distinct = ...};
Tables get the is_distinct value as a template parameter, while operators calculate it based on is_distinct values of their arguments. Its value is True if we can state that there are no equivalent tuples in the relation.
Tables are the only relations that actually store data (see Listing 3). The only mandatory template parameter is the tuple type. The second parameter is a typelist of columns of the table and defines the sort order. The third parameter lets users choose the table implementation. The fourth parameter says whether the sort_list of the table is a key. In applications, tables can have different implementations such as sorted vector, set, or B-tree. We deliberately keep table implementations out of the scope of the library, but do provide an in-stock, sorted vector-based implementation.
Unlike tables, relational operators do not take the tuple type as a template parameter they calculate this type themselves at compile time. Also unlike tables, relational operators do not store tuples. They keep only pointers to the relations they are based on. This way the memory overhead associated with relational operators is minimized.
Iterators store position in terms of iterators of the base relations. They also store pointer to the relation they belong. In this way, several iterators operating on a single relation can share common, position-independent, data.
We use selection as our first example. Selection is a relation, which selects those tuples from another relation that satisfy a given predicate. The predicate has to provide:
bool operator()(const T& t) const;
where T is the tuple type of the argument. We store the predicate inside the selection object to let all iterators share it (see Listing 4).
Compile-time computations for selection are trivial: The resulting relation has the same type of tuple and same sort_list as base relation. Runtime operations are also trivial, except increment and decrement operators of the selection_iterator. The selection template function is used to construct selection_t objects in a similar fashion, and for the same purpose that std::make_pair or std::bind1st are used in STL.
Listing 5 is a complete example. Assume you have a table with three columns ssn, name, and year and you need to print all tuples with the year <= 1965. The function print is a template function that works with all possible types of relations and is implemented with the help of run_for_each.
The library also provides a few predefined predicates, so in many cases users don't have to define one. The client code:
print(selection(staff,and(
ge<year>(1963),
le<year>(1965))));
prints the tuples where the year is between "1963" and "1965", without defining any predicate.
The cross_product of two relations contains all possible concatenations of tuples of the first relation with tuples that belong to the second relation. cross_product exploits advantages of compile-time type calculations (Listing 6). The tuple of the resulting relation contains all fields from each of the base relations. While it is easy to calculate field_list and value_type, a sort_list of the result is not so evident. It depends on whether the sort_list of the first argument is a key. If the first argument is fully sorted, then sort_list is concatenation of sort_lists of arguments; otherwise, it is simply sort_list of the first argument.
Our implementation does not require tuples to be distinct; however, we have two relations that handle this. The first relational operator that does this scans the base relation, skipping duplicates:
template<class Table>class distinct {
enum {is_distinct = IsDistinct};
...
};
The other relational operator is:
template<class Table,bool IsDistinct>
class declare_distinct
{
enum {is_distinct = IsDistinct};
...
};
This relation does nothing in runtime, instead forwarding all operations to its base relation. Why do we need it? In some situations, users know that a relation is fully sorted, but RTL can't deduce this from the relation's arguments. declare_distinct lets users make the hint.
Ordinary selection scans the whole base relation. However, if a predicate is a simple comparison based on a column and the table is sorted on this column, it is possible to figure out the result much faster. Sorted vectors (which are our model implementation of tables) let you find a range of tuples making O(ln(n)) comparisons, where n is the number of rows in the table. Join relations, based on such predicates, can also benefit from the sort order on tuples.
Simple comparisons based on a column or list of columns have one common property they produce a contiguous range of tuples. RTL's range_t relation describes this contiguous range. The range_t constructor takes two iterators pointing into the same relation the first marks the beginning of the range, and the second its end.
We name methods to resemble their counterparts in STL, but unlike in STL, equal_range does not return std::pair of iterators, but rather a new relation, range_t. All three methods are template methods that take as a parameter any subtuple that forms a prefix of the sort_list.
Listing 7 implements of equal_range for selection_t. First, we create the range on the base relation. Then we construct a new selection, which is based on this range rather than on the whole base relation. This selection contains the same tuples as the desired result. The last two lines form the correct iterators and return the result. This technique (with minor changes) was used for all relational operators. The implementation of lower_bound and upper_bound is similar. Sort order is heavily used by join relations and fast selection functions; see Listing 8.
The sort order plays a significant role in relational computations. However, each relation has only one sort order. To deal with this limitation, RTL provides function invert and two relational operators. The template function invert is defined as:
template<class SortList, class Strategy,
bool IsDistinct, class Table
>
table<typename Table::value_type,
SortList, Strategy, IsDistinct>
invert(const Table& t,
const SortList* s = 0,
const Strategy* p = 0) {...}
It takes a relation and materializes it with a given sort order. Inverted tables are good techniques when the number of columns is small and attributes do not require much memory. Otherwise, it may be better to use indexes. An index is a relation that mimics its base relation in all aspects except the sort order.
There are two index classes: interator_index_t and key_index_t.
template<class Table,class SortList,
class Strategy>
class iterator_index_t {...};
template<class Table,
class SortList,
class Strategy>
class key_index_t {...};
Both classes take three template parameters: base relation, a new SortList, and Strategy. The last parameter tells RTL how to implement internally used tables. These tables keep only fields from SortList, plus information that lets you find a whole tuple in the base relation. Basic difference between two indexes is the way they store this information. iterator_index stores iterators, which mark positions in the base relation. key_index stores a key from the base relation. It is possible to create key_index only when the base relation is fully sorted; for example, when Table::is_distinct is True.