Dr. Dobb's Journal August 2001
In the late 1800s, Thomas Edison and George Westinghouse became embroiled in what has become known as "The War of the Currents." Edison had invested heavily in infrastructure, supporting the use of direct current for the distribution of electricity. Westinghouse, having bought patents on the inventions of Tesla, advocated alternating current. For almost a decade, early adopters of electricity were forced to endure a spectacle that included, among other things, public electrocutions of large animals (performed by Edison to show the danger of alternating current; he tried to popularize the slang "Westinghoused" as a synonym for "electrocuted"). For a short period of time, there were two sets of infrastructure that operated under different assumptions about how power should be transported and consumed. Fortunately, the technology was young and the infrastructure was immature, so the cost of competing standards was relatively low. Even so, consumers of electricity were forced to make an expensive choice, balancing the investment in existing infrastructure against superior technology.
Fast forward about 100 years to Christmas Day, 1981. Under the tree waiting for me was a new TI 99/4A home computer. With 16 glorious kilobytes of memory and a blazing 3-MHz processor (not to mention a built-in Basic interpreter), the 99/4A was a future techie's dream come true. Like most computing machinery, the components inside the 99/4A required direct current. Since my house was wired for alternating current at the time (at 60 Hz, no less), this resulted in a mismatch of assumptions. The good news was that through amazing foresight, the engineers at Texas Instruments had included with the computer a transformer, which was capable of adapting the power from the outlet for use with the 99/4A. A year or so later, TI discovered the possibility that if the transformer malfunctioned, a user touching the metal case of a 99/4A could be Westinghoused. You can only imagine Edison screaming, "I told you so!" as TI spent the cost of distributing replacement transformers.
Data is the power that runs a modern business. When consumers of that power have different assumptions, a transformation is required. Very bad things (including the loss of infrastructure investment) can happen if that transformation is not carefully planned. Currently, there are at least three major data management paradigms ISAM, SQL/Relational, and XML in use, with XML poised to explode. Each has different assumptions regarding how data models the organization's view of the real world. When different models try to operate on the same source of power, they must reconcile these differences. The process of reconciling these assumptions can result in data loss, performance degradation, system fragility, or feature unavailability. The ideal solution is for every consumer to enjoy native and natural access to the power source, without one model compromising another.
Data management can be thought of as having two principal ingredients a data model and a system for implementing that model. The data model is responsible for describing the principles by which the data will represent the outside world. This model is implemented through a database management system (DBMS). Over time, DBMS has come to mean a software package that handles database management requirements.
Because data is the power for business, DBMSs (and their associated processes) tend to evolve more slowly than other, less mission-critical segments of computing infrastructure. In fact, database management models have remained relatively unchanged through the emergence and explosion of the Internet. It usually requires a significant shift in business practice to engender any change in database management. Distributed computing (such as web services) appears to be one such shift; it is growing rapidly and many major vendors expect distributed computing to be the next major paradigm in application development. XML is establishing the system for distributed data management, and this system does not neatly fit with the assumptions made by existing infrastructure.
Much as there are some fundamental electromagnetic properties that are the same across different types of current, there are some fundamental data management issues that are addressed by all data models. Abstracting these common issues is the first step to understanding the costs of data model differences, as well as possible remedies. The next step is to determine the assumptions and rules each model places around the basic abstractions; those assumptions and rules are the things that don't always match. The points at which the assumptions differ not only help to describe the problem, but also serve as the first step for understanding and evaluating available solutions. Finally, the mismatched assumptions and the shortcomings of the current solutions serve to help define what a future solution might look like.
The first assumptions involve three relatively basic (and familiar) database concepts: entities, attributes, and relationships. (Although some consider the Entity Relationship model to be a data model, I don't treat it as such here. For one thing, there is no significant implementation of E-R as a model separate from the other models I discuss. The value of E-R for purposes here is that it provides useful terminology for abstracting the common elements of the models that will be discussed.) In addition, each model addresses collections, operational efficiency, and the relationship of these concepts to the rest of the computing environment. There are other issues involved with data management (concurrency, operation atomicity, relational integrity, and so on), but these issues are not necessarily differentiating factors between models.
The first basic concept is that of entity the thing that is being stored and is representative of something in the external world, such as a customer, invoice, or inventory item. It may be thought of as the most granular representation of data that retains context. The question of what exactly comprises an entity is more frequently resolved through data/business analysis than through application of the normalization formulas found in database textbooks.
The second concept is that of attribute a descriptor of an entity. If you have a customer entity, you can use attributes to keep track of important descriptors, such as a person's favorite games or whether he collects butterflies. Depending on your particular prejudices, you may think of attributes as fields or columns. Attributes rely on the entity for context. Is the attribute "Zip Code = 11111" meaningful by itself? Does it refer to a vendor or a customer? Which vendor or customer?
The third concept is that of relationships. A customer entity and three order entities are useless in a business process unless you have some way of making their relationship persistent; that is, some way of denoting that the order entities were made by the person represented by the customer entity. In relational theory, this can be represented by foreign key relationships.
Obviously, a database would not be of much use if it allowed you to have one customer entity, one order entity, and one inventory entity. It would be almost as useless if it did not let you access the collection of customer entities independently from the collection of order entities. Inherent in a database paradigm is the grouping of each of these three concepts: Who are all of my customers? What are the descriptors that I've stored for a particular customer? What other entities are important in relation to this customer and how do I access them?
Although a flat text file could act as the basis of these concepts, the practical requirement is that specific information can be quickly and efficiently retrieved. It is critical that the performance does not degrade as more entities are added. This performance requirement is most often met through the use of indexes or keys.
Finally, there is the relationship between the database system and the rest of the computing environment in particular, the operating system and application. Early on, the database paradigm was represented by a set of procedures and coding standards that dictated how a particular shop's application code interacted with operating-system code. The evolution of database systems has seen at least one consistent trend: the abstraction of the database paradigm from application and operating-system constraints and the encapsulation of that abstraction within a database infrastructure. In other words, relational database application developers typically no longer worry about the offset and length of a particular attribute or the particular OS file that contains the attribute data. Those issues are abstracted within a database infrastructure that generally is viewed as, if not a black box, then a really, really dirty one with very tiny windows. While this trend is common throughout computing and may seem self evident, it has implications for the future state of data management.
Database technology has evolved through several stages, including ISAM, SQL/Relational, and XML:
ISAM. Although ISAM has not been formally standardized as a data model, thanks to the dominance of Cobol and the effect of that dominance on database management, there is a common set of well-understood expectations for an ISAM DBMS. In the ISAM paradigm, entities are records. Attributes are understood to be data stored starting at a specific offset for a specific length. The application is responsible for maintaining relationships, usually performed in much the same way as the relational model, where entities are collected in OS files, and the application (and thus the developer) is responsible for knowing which set of records is in which file. The application can include multiple types of records in one file, but any differences in entity type within a file must be implemented, understood, and maintained by the application. The DBMS does not understand any distinction between different entity types within the same entity collection.
Efficiency is achieved through the use of indexes. Since the DBMS is responsible for maintaining index information and the DBMS does not make any distinction between entity types within an entity collection, an ISAM file indexes the same attributes for an entire collection. This can result in added responsibility for the application if multiple entity types are in the same collection. Furthermore, since the DBMS is unaware of any nonindexed attributes of the entity, the same entity can be viewed as having several different compositions, and there is no guarantee that the attribute indexed by the DBMS is an attribute that is meaningful to the application. (For instance, an index can contain pieces of several attributes, which then will act as a primary key, albeit one that has no true meaning in the real world.)
An ISAM application acts as if it is operating on the physical representation of the record (which it is, in most implementations). As a result, much of the database management of the ISAM paradigm is closely tied to both the operating system and the application.
SQL/Relational. From a theoretical standpoint, the SQL paradigm and relational model are not synonymous. In fact, SQL can be used to build result sets that do not meet relational requirements. However, the average computing professional is not interested in purely theoretical DBMSs, and when most people use a relational database, they are almost invariably using SQL to manipulate the data (whether directly or under the covers, as is often the case with ADO).
In the practical relational view, entities are represented by rows. Collections of entities are represented as tables (or R-tables, or relations, if you want to be exact). Unlike the ISAM practice, the application has no control, knowledge, or concern with the interaction between entity collections and the operating system. Attributes, represented by columns, are understood by the system, and all entities (rows) within the same collection (table) must share the same attribute composition (columns). The application has no knowledge about the physical structure of attributes within the storage of the entity. All of the application's knowledge of entity composition must conform to the DBMS's view of the data. Collections of entities and attributes may be arbitrarily defined at run time through SQL predicates (as well as through views). Relationships are persisted in much the same way as the ISAM model (the constraints, such as primary key uniqueness, are formalized in the relational model, but the ISAM model is similar in practice).
Indexes are also present in the relational paradigm, and apply to the entirety of the system-defined entity collection (not necessarily to derived relations). The system's understanding of indexes is coupled with the system's understanding of attributes, so unlike ISAM, an index is always composed of attributes that are meaningful to the application.
To summarize, the relational model abstracts the database from the operating system and to some extent from the application. There is no longer an exploitable interaction between the operating system and the DBMS. Furthermore, while the application may have foreknowledge of the database composition, it is incapable of using that knowledge in a manner that is not understood beforehand by the DBMS. An application can also be written that derives all of its information about the database at run time, which is certainly not the case with the ISAM paradigm. This abstraction frees the application and the database administrator from a number of concerns regarding the internals of data management, but it also demands that the application conform to the expectations of the model.
XML. XML is not a database paradigm. It is a portable persistence mechanism. However, in a distributed computing environment, XML fragments may be viewed as small, disconnected databases because the integration task required at the receiving end is equivalent to integrating data from two different database paradigms. Each of these fragments (or minidatabases) is self-containing and conforms to a set of rules. Additional rules have been added by the introduction of XML schema, but these are not necessary to the existence of the XML fragment it can be understood by an XML parser independently of the additional rules. The additional rules are used primarily to ease the impedance mismatch that occurs when a system of loosely structured data is incorporated into a system of more highly structured data.
From a data perspective, entities are represented as elements within XML, and attributes (surprisingly enough) as element attributes. Relationships are perhaps even more important in XML than in the other models. While there are additional methods to describe relationships, any given entity represented by XML is enlisted in one primary relationship its place in the hierarchy. An entity cannot be represented in a document without this relationship. Collections of entities are generally worked on as if they were files. Collections of collections of entities (that is, a database) are not defined by XML proper, as standard XML makes the assumption that at any given time there is only one collection of immediate concern. In a striking departure from both ISAM and relational paradigms, entity composition may differ across instances of identically typed entities; that is, two customer descriptions within the same document may look completely different and still be equally meaningful. The compositions and types of the entities that make up a particular collection are understood by the system.
This model is tied less tightly to the application than other models in that a consumer of the data does not need to rely on either application-specific knowledge or a DBMS to understand the entities, attributes, and primary relationships that are represented by XML data.
As Table 1 shows, ISAM, SQL/Relational, and XML each make different assumptions. Thus, if an application works on a collection of entities, it must apply a model to that collection and operate in conformance with that model. Incorporating a different model into an existing infrastructure requires that the differences in these assumptions be addressed.
It is not uncommon for businesses to have both ISAM and SQL applications running side by side; the expected adoption of distributed computing requires that data can be used across all three paradigms as seamlessly as possible. Much like the AC-DC conversion that takes place inside a transformer, the differences in assumptions often result in a series of transformations that adapt data from one paradigm to another. There are already several available mechanisms for accomplishing this task, although no single method is without drawbacks. Much of the data architecture in Microsoft's .NET initiative is based on the expectation that distributed computing must be integrated with existing data centers. It is essential to understand the trade-offs that are inherent in each approach.
There are any number of visions about the future of data storage. However, based on the history of database technology and practice, it would seem that a vision should assume at least the following trends:
The desired evolution of DBMSs, therefore, could take the form of a Postrelational DBMS that decouples the storage mechanism from the access mechanism. One way to think of this is that there is an underlying DBMS that operates only in terms of very generic database concepts, analogous to an abstract class in object-oriented programming. A DBMS implementation could be derived from this base DBMS that applied the constraints and assumptions of a particular model to the data. Additional DBMS implementations could also be placed on top of the base DBMS, all having native and natural access to the data. Another way to look at this is to think of a data model as a type of view it is a user's perception of the data. The underlying storage mechanism does not need to necessarily conform to that view internally, but only to ensure that the view remains consistent for the consumers of that view.
While it is true that each implementation would have the additional overhead of the opaque nature of the underlying DBMS interface (and would thus not be able to take advantage of potential performance efficiencies gained by tight coupling), each derived DBMS (or view) would be equally native and act as a peer rather than a predefined stop in a pipeline. Thus, a business could run applications that used different data models concurrently (even within the same application), all of which run with similar and deterministic performance, scalability, and reliability characteristics.
Will this happen? It's difficult to imagine it happening rapidly. The relational model has been dominant for several Internet generations, and many database implementations are inextricable from the SQL/relational model. Likewise, many of the XML solutions are suffering from the growing pains that the relational database solutions went through a long time ago. However, as the need for efficient distributed data management grows both in numbers and in importance, it is likely that there will be a competitive advantage in having a solution such as this, regardless of how sophisticated the pipeline approach becomes. If that is the case, then we can rely on the axiom that computing nature abhors a competitive vacuum.
Organizations have a great deal of investment in their infrastructure incorporating ISAM and relational models. There are a number of competitive advantages that can be gained b