Wednesday, August 31, 2005

Database Theory

This article appears in slightly different form in the July/August 2005 issue of IEEE Micro © 2005 IEEE.

Relational databases are everywhere, but many of the people who work with them know little of the underlying theory. One of the best known authorities in this field wants to change that state of affairs.

Database in Depth -- Relational Theory for Practitioners by C J Date (O'Reilly, Sebastopol CA, 2005, 228pp, ISBN 0-596-10012-4, www.oreilly.com, $29.95)

In 1969, as is true today, most commercial data processing applications relied on some sort of database management software. A great deal of thought had gone into their development, but they seem ad hoc and overly complex by today's standards. In that year an IBM mathematician, the late E F (Ted) Codd, developed a system, called the relational model, to allow him to attack the problems of database management rigorously. Over time, Codd's model became the basis for all serious database management. 

Chris Date, another mathematician, was an early adopter of the relational model. His An Introduction to Database Systems, now in its eighth edition from Addison-Wesley, first appeared in 1975. It is the standard college textbook on the entire database field. His Databases, Types, and the Relational Model: The Third Manifesto, co-authored with Hugh Darwen and soon to appear in its third edition from Addison-Wesley, is a graduate-level discussion of how to carry forward and extend Codd's principles as database systems evolve. In it he tries to lay a firm foundation for "object-relational" database management. 

The current book lies between these two. It is more advanced and more focused than the former, but chattier and more suitable for self-study than the latter. Date carefully explains how he has positioned these books, lest you be tempted to rely on one of them to do the job of the others. If you're serious about database management, you should have all three.

As relational database management systems became widespread, the various commercial vendors began to diverge from the original model. Much of this divergence became embodied in the SQL language. As an aside for those who care about such things, the name SQL originally stood for Structured Query Language, and many practitioners pronounced it Sequel. In its standardized versions, however, the name does not stand for anything. It is just the three letters S-Q-L, and that is the way Date pronounces it, though he acknowledges that the other pronunciation is still widespread. 

The problems with SQL have little to do with its name. They arise from its abandonment of fundamental principles. For example, SQL fails to keep the data model separate from its implementation. Its basic structures are tables, which differ from relations in that they have ordered rows and columns and can have duplicate rows. They differ in other ways too, but you'll have to read Date's book to find out. 

Date prefers to use the language Tutorial D, which he and Darwen define in the Third Manifesto book. Date believes that the original relational model does not need to -- in fact, should not -- change to accommodate object-oriented features. The original model deals with relations on domains. Domains are arbitrary sets, and a relation on domains T1, . . ., Tn is approximately a set of sets, {t1, . . .,tn}, plus a mechanism (based on a heading and annotations in the body) that allows elements to maintain their identities as members of their respective domain sets. A theory of types and inheritance adds structure to the set of domains. Date regards this structure as orthogonal to the original relational model, because the original model assumes nothing about the domains. The mainfesto uses the term D to represent any language that
  • preserves the original model
  • provides the orthogonal object-oriented extensions called for by the manifesto
Tutorial D is a specific example of D. That means that Tutorial D (unlike SQL) is relational and that it provides object oriented features in a way that is orthogonal to the relational model.

This book uses Tutorial D in places, but it does not go into detail about the theory of types. It focuses on providing a brief, but not terse, presentation of the original relational model, contrasting it with current commercial products and discussing contentious issues such as the wisdom of allowing null attributes. It contains a large number of thought provoking exercises. An O'Reilly website provides provides answers and further discussion. One drawback of the book is its poorly designed index. O'Reilly should invest in a good one for the next printing.

Date complains about the product-oriented nature of computer science instruction, which leads to a number of widespread misconceptions about the relational model. He hopes by his presentation, which relies on principles rather than the quirks of SQL or specific products, to correct these misconceptions and produce a greater understanding of and appreciation for the relational model. 

To help you decide whether you need to read this book, Date suggests that you try to answer the following questions:
  1. What exactly is first normal form?
  2. What's the connection between relations and predicates?
  3. What's semantic optimization?
  4. What's a join dependency?
  5. Why is semidifference important?
  6. Why doesn't deferred integrity checking make sense?
  7. What's a relation variable?
  8. What's nonloss decomposition?
  9. Can a relation have an attribute whose values are relations?
  10. What's the difference between SQL and the relational model?
  11. Why is The Information Principle important?
  12. How does XML fit with the relational model?
Date does not specify the number of questions you must answer correctly to be exempt from reading the book. I suspect that many experienced practitioners will fall far short of 100%. If you are in that category, or if you just want a clear account of the relational model and its underlying principles, you should read this book.