Lesson 7: Data Anomalies

Introduction

Proper table design is a fundamental part of any database implementation. Improperly designed tables and joins are a source of anomalies - incorrect or misleading data. The following are examples of some of these anomalies.

Insertion Anomaly

Given the following table, we can have problems when inserting new tuples. (Assume that the primary key is {studentId, course}).

student Table
studentId course dept surname forename instructor grade
999568440 CP363 Computing Snord Cranston D. Brown F
999568440 CP400 Computing Snord Cranston T. Yang A-
999568440 CP102 Computing Snord Cranston D. Brown C
987859400 PC466 Physics Zzap Zachary B. Pavlova D
987859400 HP202 History Zzap Zachary S. Zeller D
987859400 CP102 Computing Zzap Zachary D. Brown B+
005689250 CP102 Computing Snord Lillibelle D. Brown A+
tuple with incorrect data
999568440 PC100 NULL Snord Bob H. Bezner F
invalid tuples
985768920 NULL NULL Aardvark Anthony NULL NULL
NULL EN228 NULL NULL NULL G. Boire NULL

The tuple noted as having incorrect data (Bob Snord) shows the problem of keeping consistency: Mr. Snord's first name is wrong in the new tuple. Because we are using redundant data we increase the possibility of error creeping in. We are also not entering the department information, and all courses should be related to a department.

The two tuples noted as invalid (Anthony Aardvark and NULL NULL) show the difficulties that arise when we do not yet have all of the data to hand, but the table design requires it. In the first case we add a new student who is not yet taking a course, in the second a course that does not yet have any students. The empty fields must be left as NULL. The problem that arises is that both studentId and course are part of the primary key, and primary keys do not allow NULLs. Thus the last two tuples could not actually be inserted into the table.

The implication of this anomaly is that the table is poorly designed. Student and course data are associated but independent data. Students exist without courses, courses exist without students. Assigning them to the same table is poor design.

Deletion Anomaly

Given the following table, we can have problems when deleting new tuples. (Assume that the primary key is {studentId, course}).

student Table
studentId course dept surname forename instructor grade
999568440 CP363 Computing Snord Cranston D. Brown F
999568440 CP400 Computing Snord Cranston T. Yang A-
999568440 CP102 Computing Snord Cranston D. Brown C
987859400 PC466 Physics Zzap Zachary B. Pavlova D
987859400 HP202 History Zzap Zachary S. Zeller D
987859400 CP102 Computing Zzap Zachary D. Brown B+
005689250 CP102 Computing Snord Lillibelle D. Brown A+

If we delete the last student from a course, we lose the course information. For example, if we delete Zachary Zzap from HP202, the course disappears.

As noted in the Insertion Anomaly, this problem arises due to having associated but independent data in the same table.

Modification Anomaly

When data is repeated, we must implement proper design to avoid generating inconsistencies. For example: if the course instructor for CP102 changes, we must change this information in more than one tuple. If we miss a tuple then the data is no longer valid.

student Table
studentId course dept surname forename instructor grade
999568440 CP363 Computing Snord Cranston D. Brown F
999568440 CP400 Computing Snord Cranston T. Yang A-
999568440 CP102 Computing Snord Cranston X. Sharma C
987859400 PC466 Physics Zzap Zachary B. Pavlova D
987859400 CP102 Computing Zzap Zachary D. Brown B+
005689250 CP102 Computing Snord Lillibelle X. Sharma A+

As noted in the Insertion and Deletion Anomalies, this problem arises due to having associated but independent data in the same table.

Spurious Tuples

If a join loses information (not necessarily tuples - in fact, invalid tuples may be added through a join), it is an invalid join. Valid joins are called non-additive (lossless) joins.

The following is an example of a join that produces spurious tuples:

studentInstructor Table
studentId surname forename instructor grade
999568440 Snord Cranston D. Brown F
999568440 Snord Cranston T. Yang A-
999568440 Snord Cranston D. Brown C
987859400 Zzap Zachary B. Pavlova D
987859400 Zzap Zachary S. Zeller D
987859400 Zzap Zachary D. Brown B+

is joined to the following table by their common field instructor.

courseInstructor Table
course instructor
CP363 D. Brown
CP400 T. Yang
CP102 D. Brown
CP466 B. Pavlova
HP202 S. Zeller

by the SQL statement

Join studentInstructor and courseInstructor
              
SELECT * FROM studentInstructor AS si
  INNER JOIN courseInstructor AS ci
  ON si.instructor = ci.instructor

resulting in:

Result Set
studentId course surname forename instructor grade
999568440 CP363 Snord Cranston D. Brown F
999568440 CP102 Snord Cranston D. Brown F
999568440 CP400 Snord Cranston T. Yang A-
999568440 CP363 Snord Cranston D. Brown C
999568440 CP102 Snord Cranston D. Brown C
987859400 CP466 Zzap Zachary B. Pavlova D
987859400 HP202 Zzap Zachary S. Zeller D
987859400 CP102 Zzap Zachary D. Brown B+
987859400 CP363 Zzap Zachary D. Brown B+

The light-green tuples are spurious tuples, i.e. they contain invalid information that arises out of joining together tuples that do not belong together.

The spurious tuples arise out of this join because of the poor design of the source tables. Students take courses and should be associated with courses, not with instructors. Since instructors are also associated with courses - and potentially with multiple courses - the join between these tables produce tuples that have no existence in reality. (If each instructor only taught one course then this problem wouldn't arise, but that's not the reality.)

The fix is to redesign the tables to properly reflect the real-world association amongst the data. Students should be associated with courses, not instructors, and a studentCourse table joined with the courseInstructor table would not have these spurious tuples. Of course this fix would work only if each course has only one instructor. Courses that have multiple sections and instructors would require another redesign, this time with tables that relate courses, sections, students, and instructors. Keeping track of course sections offered in different terms would require yet another redesign that tracks the terms as well. Complex data requires complex database design.