SRDL: A Small Relational Database Language

Dr. Dobb's Sourcebook March/April 1997

A tool for prototyping data structures

By Sergei Savchenko

Sergei is a developer for Counselware in Montreal, Canada. He can be contacted at savs@CS.McGill.CA.

When developing software, we are often confronted with what appears to be a typical database problem, yet ends up requiring operations some database engines can't deliver. Obviously, what we need in such situations is a tool that lets us quickly create custom, specialized features. The tool I present in this article -- a small relational database language called SRDL -- is designed to provide applications with a considerable amount of flexibility in dealing with database problems. If you are familiar with SQL, you will find a lot of similarities in SRDL. There are differences between the two languages, of course, primarily in the definition of relations, structure of operators, and typing scheme.

SRDL is implemented in C++ (using template classes). The current implementation is about 2500 lines of code. It should compile under GCC and most other modern C++ compilers, and run on practically any platform. The implementation can either be used interactively or embedded into C++ applications that call member functions implementing the operators explicitly. The complete SRDL source code (distributed as freeware) is available electronically; see "Programmer's Services," page 3.

There are two distinguishable levels in the SRDL implementation -- the lower-storage level and the upper-manipulation level. There is a minimal interface between the two, and all algorithmic complexity of operators is in the upper level, with the lower level providing a way to read or write into some particular format on some storage. Multiple implementations of the lower level are possible, so that relations can be stored in different formats (or even in different storage areas; for example, placed into memory to speed up execution of consecutive operators).

I find SRDL helpful in prototyping data structures, especially possible operations on data structures. Of course, it can be extended for practical use as an underlying database engine for many kinds of applications.

Relations

Informally, a relation corresponds to a table with repeating rows of numbers or text. More formally, however, a relation is defined as a set of tuples of scalar data. The order of tuples doesn't matter and no tuple appears more than once in the set. Tuple in this notation corresponds to what is often called a "record."

SRDL extends this definition of a relation. Just as with tables on paper, there is often a need for fields that relate to the table as a whole, rather than to any individual record. Table description, totals, and minimum/maximum values are attributes of all tuples combined, rather than any individual record. Thus, SRDL distinguishes between horizontal and vertical fields. The former have instances in every record, whereas the latter are stored separately in a unique instance per relation.

In almost any programming language, you can differentiate between the objects you manipulate and the operations that let you perform manipulations. In this case, relations containing vertical/horizontal fields constitute the manipulated objects and set of relational operators; the relational algebra makes up the operations. Each relational operator takes a single relation or several relations as an argument, and produces a relation as a result. We can say that relations are "closed" under the relational operators.

For most purposes, only a few operators are needed in most database languages. SRDL defines three operators: join, select, and project.

The select operator is straightforward. If you specify a selection criteria, only those tuples of the argument that satisfy the criteria are extracted into the result. In Figure 1, a relation with two fields (ID and value) is applied by select with the criteria value>4. As you can see, the record with ID=B has (value=3.1)<4, thus fails the criteria and doesn't appear in the result.

The join operator combines fields specified in two tables. In Figure 2, two relations have fields that contain the same kind of information, notably the ID field. If two records -- one in each relation -- have the same contents in the join fields, you would combine information from both, placing the augmented record into the result relation. In Figure 2, the record with ID=A from the first table can be joined with two records of the table (numbers 1 and 2) since both have ID=A. This operation is often called a "natural" or "intersection" join. Other kinds of joins can be defined, including a "union" join, which produces somewhat different results.

The purpose of the project operator is to augment or simplify the internal structure of relations. In Figure 3, you want to project the argument relation into another relation continuing to carry the ID field, as well as containing the new tvalue field (computed as items*value). In the next field, you want to add a vertical field into the result. It is recursively defined to be total=0:total+tvalue, meaning that total is to be initialized to 0. For each record, you would readjust total as its previous value plus the contents of the horizontal field tvalue. Since computation proceeds in the order of field definition, tvalue is computed when you evaluate the total+tvalue expression.

In SRDL syntax, horizontal fields are enclosed in curly brackets, while vertical fields are enclosed in square brackets. The length of a field (in bytes) follows the field name.

Although it appears in a project operation that the destination should have exactly the same number of records as arguments, this is not necessarily the case. If you attempt to leave the value field in the result relation, you can potentially get duplicate tuples, since value equals 5 in both record A and record B, which is not allowed by the definition of a relation.

Also, during a project operation, expressions are evaluated in the scope of both source and result relations, with the source relation having priority. In other words, each field is first being looked for in the source relation; the evaluator will only try to locate the field in the destination relation if the first lookup fails.

The project operator is flexible. For example, computing the maximum of the value column for vertical fields is [max:10=0:(!(value>max))*max+(value> max)*value]. Computing the total number of records, on the other hand, is [num:10=0:num+1].

However, expressions with a legal syntax for a vertical field might well be ambiguous. For instance, in the expression total=0:tvalue-total, the subtraction binary function is not commutative (a-b b-a), which makes the expression depend on the order of records in the relation. But by the definition of a relation, there is no guarantee of any particular order. Opportunities for errors do exist, but fortunately, these errors happen when the expression doesn't have any logical meaning anyway. However, considering opportunities for parallel evaluation, you would have to restrict vertical expressions to some subset of the functions.

Upon further examination of the project operator, you can also see domain or field algebra at work. In the specification of fields, you apply operators to fields and get fields as a result. In SRDL, fields are typeless. The only field attribute that should be described is its length, which (in the current syntax) follows the field name in the field definition. However, there is no explicit information on whether data being stored is integer, real, or textual.

Being typeless has several benefits. For one thing, it simplifies notation. More importantly, it allows some polymorphism. Even though SRDL is typeless (as far as definition is concerned), at the time of expression evaluation, the type is based on the current contents of fields. A value is considered to be an integer if it contains only characters from the set {0123456789}. If it contains characters from the set {0123456789.}, it is considered real. The rest is considered to be textual.

When it comes to scalar data types, there is a hierarchy of integer -> real -> text. For example, a Boolean contains less information then an integer. In fact, a Boolean can always be represented as an integer, just as in C. An integer contains less information then a real (and, in fact, can be represented using a real). Finally, anything (including reals) can be represented as text.

Whenever SRDL evaluates a binary function on operands of different type, it uses the aforementioned hierarchy to upgrade an operand of a lesser type to a higher type, and evaluate the binary function in the space of a single type. Table 1 defines the relevant functions and operators. The @ operator in Table 1 provides for the "leaking" of relational algebra into the field algebra; see Figure 4.

Figure 4 implies that, for each record of a relation, you can compute a combined attribute of some other relation parameterizing this computation with the current context. Having this operator, you can avoid (in a somewhat expensive way) the need for a reduction operator. Reduction allows you to compute a vertical expression for groups of tuples, as opposed to all tuples in the relation. Tuples fall into different groups based on the contents of some horizontal field (with which the reduction operator is parameterized). Although SRDL doesn't have a true reduction operator, the same effect can be achieved by using the @ operator within expressions for horizontal fields.

Together, relational and field algebra provide a considerable amount of flexibility. This methodology gives an abstraction in which many problems that require some coding effort in imperative languages like C can be trivially solved.

An Example

Figure 5 illustrates a common database layout that contains three relations. Each item from PROD relation is composed of several items from COMP relation, with the CONT relation setting the correspondence. The typical task for a database engine is to produce a table summarizing the contents of some product. You can do this using the sequence of relational operators in Example 1(a). Note that the COMP relation contains expressions in the cost column. They are being evaluated using the ? operator.

Example 1(b), which is slightly more complex, finds the number of components per product. The total number of components is being fetched by means of the @ operator from a relation being created for each record of the argument as a result of another project operator. Note here that the inner project is parameterized by a variable from the outer project (prod). From a purely practical point of view, there is a danger of shadowing. When both inner and outer scopes have variables with the same name, the evaluator always takes the inner scope according to the scoping rule, which perhaps was not the intended meaning. Providing function calls with dummy variables can ease this complication.

SRDL can be extended both through adding other relational operators (union join, perhaps a true reduction operator), optimization (adding hidden record indexing), or by adding code for different data formats. The language itself requires function calls that would make coding much more practical.

DDJ