Lesson 9: Normalization

Introduction

What is Normalization?


Why Normalize?

Remember that these three anomalies can corrupt the contents of a database. (We sill see some examples momentarily.) By reducing data duplication and eliminating these anomalies, we intend to minimize the possibility of data corruption and simplify the development, maintenance, and expandability of the database.


How Do We Normalize?

It is important to remember that normalization techniques are based not only on algorithms, but on the semantics of the data in the database. If we do not understand the data domains and relationships we cannot perform proper normalizations.

A normalized schema should have the properties:


Normal Forms

The following normal forms are listed from least to most normalized:

Normal Forms 1 through 3 are based upon primary keys. Every relation is assumed to have a primary key made up of one or more attributes.

Reminder: A superkey is a set of attributes that uniquely identify a tuple in a relation. A key is a minimal superkey, i.e. a superkey with superfluous attributes removed. Possible keys are called candidate keys. A key chosen from a set of candidates is a primary key.

First Normal Form (1NF)

Fixed by decomposition into a series of single-valued relations. Removes redundancy, and is general, placing no limit on number of values.

It is the nature of a relational database management system to allow only atomic values in an attribute - if an attribute is defined as integer, then only a single integer could be stored in such a field. Thus it is difficult, though possible, to violate 1NF with such a database system. Typically it can be accomplished by misusing strings, as in the following:

Table Not in 1NF
studentCourses Table
studentId courses
987859400 {CP466, HP202, CP102}
999568440 {CP363, CP400}

This data clearly demonstrates its semantics - one student may take many courses and each course may have many students.

You must deliberately set aside enough space to store multiple course names. Setting aside only 5 characters forces you to store only one course name at a time.

Imagine that you wish to determine the total enrolment in all computing courses, i.e. courses that start with 'CP':

          
SELECT COUNT(studentId) FROM ...
  WHERE courses LIKE '%CP%'

        

This returns a value of 2 rather than a correct value of 4 because each tuple is counted only once. A proper counting requires a much more complex SELECT statement.

The following is an interesting, but incorrect, attempt at a solution:

Table Still Not in 1NF
studentCourses Table
studentId course1 course2 course3
987859400 CP466 HP202 CP102
999568440 CP363 CP400 NULL
  • Although each attribute contains an atomic value, each attribute represents the same ‘fact’.
  • This is also very difficult to query or update.

The count from the previous example is now:

        
SELECT COUNT(studentId) FROM ...
  WHERE course1 LIKE 'CP%' OR course2 LIKE 'CP%' OR course3 LIKE 'CP%'

and this still does not give you the right answer. Furthermore, updating means finding the first free attribute of the three - which is non-trivial to put it mildly - and what happens when a student takes four courses?

We can fix this by defining the table to contain only atomic values, and repeating key data as necessary:

1NF Violation - Fixed
studentCourse Table
studentId «PK» course «PK»
987859400 CP466
987859400 HP202
987859400 CP102
999568440 CP363
999568440 CP400

This seems to have introduced some redundancy, but it now means that our enrolment count statement works properly since each tuple is counted once and only once.

        
SELECT COUNT(studentId) FROM ...
  WHERE course LIKE 'CP%'

We'll see how to reduce redundancy with other normal forms.

Note that the semantics of the data are preserved: one student may still take many courses and one course may have many students. If the data semantics are violated by a normalization procedure then we have done something wrong.

Second Normal Form (2NF)

Applies only to primary keys with multiple attributes.

Based upon full functional dependency. A dependency is full functional if when an attribute is removed from a key, the dependency is destroyed. If the dependency remains, this implies the key is not minimal.

i.e. a relation is 2NF if every non-key attribute of a relation is fully functionally dependent on the primary key of the relation. If an attribute is dependent only on a partial key, it is not 2NF. (You cannot have a partial key if the key consists of only a single attribute.)

A prime attribute is a member of a candidate key. An attribute is non-prime if it is not a member of an candidate key. (Non-prime attributes are also referred to as non-key attributes).

Table Not in 2NF
studentCourse Table
studentId «PK» course «PK» surname forename instructor grade
8245 CP102 Kumar Vijay N. Pavlova B-
8245 HP202 Kumar Vijay S. Zeller C+
9995 CP363 Soult Francine D. Brown A-
9995 CP400 Soult Francine T. Yang B+
9995 CP102 Soult Francine N. Pavlova A

This relation is ripe for generating anomalies. Whenever a student is added to a course the student's name as well as studentId must be added to the relation. If it is added incorrectly then we have one studentId with multiple names attached to it. Similar things can happen with the combination of instructor and course. When a tuple is deleted from the relation, if the last student for that course is removed then the course is entirely removed from the database unless we make a point of storing NULL data for the removed student. This is awkward and ugly. Common sense, and normalization techniques, tell us there must be a better way.

Note: NULL values for the grade before the final grade is assigned are perfectly valid.

We have defined {studentId,course} as the primary key for this table, and indeed as a key it uniquely identifies each tuple. However, the tuples above violate 2nd Normal Form. The non-key attributes surname and forename are only partially dependent on the primary key since the attribute course can be removed from the primary key, and the dependency still holds. i.e. surname and forename are fully functionally dependent only on studentId. Similarly, the non-key attribute instructor can be identified by the partial key {course}. Only the grade attribute is fully functionally dependent on the entire primary key, i.e. {studentId,course}→grade.

The relation’s functional dependencies are:

{studentId,course}→grade
studentId→surname,forename
course→instructor

surname, forename, and instructor are functionally dependent upon only partial keys and thus 2NF is violated.

These functional dependencies arise out of the semantics of the data in the relation. We know that students have names as well as numbers and that although a name may be repeated (David Brown is fairly popular, for example), student IDs are never repeated. If we do not understand the semantics of the data we cannot determine the function dependencies.

This violation of 2NF can be fixed by decomposition into relations where all attributes are fully functionally determined. In this particular example, this table could be decomposed into three tables:

2NF Violation - Fixed
student Table
studentId «PK» surname forename
8245 Kumar Vijay
9995 Soult Francine

courseInstructor Table
course «PK» instructor
CP102 N. Pavlova
HP202 S. Zeller
CP363 D. Brown
CP400 T. Yang

studentCourse Table
studentId «PK» course «PK» grade
8245 CP102 B-
8245 HP202 C+
9995 CP363 A-
9995 CP400 B+
9995 CP102 A

Note here the huge potential for reducing data redundancy. Now each combination of course and instructor is given once rather than once for every student in the course. Similarly a student's name appears only once.

The opportunity for data corruption is also greatly reduced. Now a student's name appears in one place only. An instructor's name can still appear against many courses, but that could be fixed by giving each instructor an ID number and storing their name separately, as with students.

The UML schema:

Clearly having courses with multiple sections in different terms requires a more complex design, but the approach of looking at the functional dependency violations and fixing them applies no matter how complex the real world context is.

Third Normal Form (3NF)

Non-key attributes should not be functionally determined by other non-key attributes. If \(Y→A\), and \(Y\) is not a key, then the relation is not in 3NF. (If \(Y\) is a partial key then 2NF is violated).

Restated, a relation \(R\) is in Third Normal Form if it is in 2NF and every non-key attribute of \(R\) is non-transitively dependent on each candidate key of \(R\).

A transitive functional dependency implies that when \(X→Y\) and \(Y→A\) then \(X→A\) where \(X\) is a key and \(Y\) is a non-key.

Fixed by decomposition into relations where the non-key attributes that determined the other non-key attributes become keys.

Table Not in 3NF

The following violates 3NF because at least one of the non-prime attributes is dependent upon another non-prime attribute:

studentCourse Table
studentId «PK» course «PK» grade gpa
8245 CP102 B- 7
8245 HP202 C+ 6
9995 CP363 A- 10
9995 CP400 B+ 9
9995 CP102 A 11

(Note that WLU uses a 12 point GPA system.)

The functional dependencies for grade and gpa are:
grade→gpa
gpa→grade

Both grade and gpa are non-prime attributes because they are not part of any candidate keys - they do not contribute to the uniqueness of a tuple. Thus 3NF is violated because a non-prime attribute (either grade and gpa) is dependent upon a non-prime attribute (either grade and gpa).

Not all violations of 3NF involve functional dependencies that are \(X→Y\) and \(Y→X\) as we have in this example. It is sufficient if only \(X→Y\) and both \(X\) and \(Y\) are non-prime attributes.

Because 3NF is violated this table is subject to update anomalies. What happens if a student receives a grade of A+ but a GPA of 0? Which is the correct mark? (If you are a student the answer is obvious, of course.)

The relation’s functional dependencies are:

{course,studentId}→grade
{course,studentId}→gpa
grade→gpa
gpa→grade

Thus these transitive dependencies are also true:

{course,studentId}→grade→gpa
{course,studentId}→gpa→grade

Having discovered what the functional dependencies are and the fact that we have at least one transitive dependency forces us to fix the table. We can use these functional dependencies to guide our decomposition. Note that we may have to make a choice as to how to decompose a table. Sometimes this choice is arbitrary.

In a general sense, a table or relation should contain information about only one thing, where that 'thing' is determined by the primary key. According to Kent (1983) all facts in the relation are related to the key:
[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.
A transitive dependency tells you that some of the attributes in a tuple are not about the key, but about something else, and the relation should be decomposed in such a way that each table has attributes only about the key.

3NF Violation - Fixed

The original relation can be decomposed into:

studentCourse Table
studentId «PK» course «PK» gpa
8245 CP102 7
8245 HP202 6
9995 CP363 10
9995 CP400 9
9995 CP102 11

gpaGrade Table
gpa «PK» grade
12 A+
11 A
10 A-
9 B+
8 B

The UML schema:

This decomposition removes the update anomaly problem. Now grades appear in one place only and a simple join is enough to connect them back to a student's course.

An equally valid decomposition involves a table where grade is the primary key and gpa is the dependent attribute. The choice in this case is arbitrary.

Boyce-Codd Normal Form (BCNF)

Raymond F. Boyce and Edgar F. Codd developed this form in 1974. Codd developed the first three normal forms in the 1960s and published his seminal paper in 1970.

BCNF is a more restrictive version of 3NF.

Every relation in BCNF is also in 3NF, but the reverse is not necessarily true. 3NF allows attributes to be part of a candidate key that is not the primary key - BCNF does not.

Most relations that are in 3NF are also in BCNF, but not all. The differences are subtle, but important. It might be easier to say what is not in BCNF: A relation is not in BCNF if

  1. the candidate keys in the relation are composite keys (that is, they are not single attribute keys)
  2. there is more than one candidate key in the relation
  3. the keys overlap, that is, some attributes in the keys are common
Table Not in BCNF

The following is in 3NF but not in BCNF:

studentTutor Table
studentId tutorId tutorSin
8245 1254 000 737 999
9995 1567 000 656 999
5643 1254 000 737 999
6179 1742 000 651 999

Note that in the semantics for this data each student may have only one tutor, but each tutor may have many students.

This table is subject to insertion anomalies as both the Tutor ID and SIN must be entered whenever a tutor-student pair is entered. Here again, common sense tells us that this table is not well designed, even if we don't immediately recognize that it violates BCNF.

  • Candidate keys are: {studentId,tutorId} and {studentId,tutorSin}
  • tutorId→tutorSin and tutorSin→tutorId, but because both tutorId and tutorSin are prime attributes (i.e. they are both potentially part of a primary key as in the first point) these FDs do not violate 3NF (which is concerned with non-key attributes).
  • Neither tutorId nor tutorSin alone are superkeys, and thus BCNF is violated.

Both tutorId and tutorSin are prime attributes because they are part of candidate keys. The use of both attributes in this table violates BCNF. The fix is to use only one of these two attributes. As noted before, BCNF is a stronger version of 3NF because it restricts both prime and non-prime attributes, while 3NF only restricts non-prime attributes.

BCNF Violation - Fixed

The original relation is decomposed into:

studentTutor Table
studentId «PK» tutorId «PK»
8245 1254
9995 1567
5643 1254
6179 1742

tutor Table
tutorId «PK» tutorSin
1254 000 737 999
1567 000 656 999
1742 000 651 999

The UML schema:

Why this decomposition over the other possible decomposition, where student IDs are matched against tutor SINs? Here outside considerations help in our decision. SINs are treated specially and should be provided extra protection and privacy. The best way to do this is to minimize their use. Given a choice, then, between the two decompositions, the one that minimizes the use of SINs is the best.

Again, update and insertion anomalies have been removed.

Normalization Algorithm

  1. Suppose that \(R\) is not in BCNF. Let \(X\) be a subset of \(R\), \(A\) a single attribute in \(R\), and \(X→A\) be a non-trivial FD that violates BCNF. Decompose \(R\) into \(R-A\) and \(XA\).

  2. If either \(R-A\) or \(XA\) are not in BCNF, repeat step 1 with the violating relation(s).

This simple recursive algorithm produces BCNF compliant relations every time. Of course, databases development doesn't normally start out with one huge relation of all the attributes in the database. The semantics of the data point the way to how relations should be organized. When semantics are unclear or a developer is simply unsure of where to go next, this algorithm can help the process.


Relation Decomposition

How do you determine which decomposition to use?

Kent is saying that a relation should be centred around one 'fact', and that all attributes in a relation should be connected to that fact. The key provides the foundation for this fact. That a student has a tutor is a fact (and a key). A tutor's SIN is not part of that 'fact'.

Fourth Normal Form

An attribute may multi-determine another (\(X↠Y\)). For example, a student (identified by studentId) may take many courses (identified by course). Thus studentId↠course. This is called a multi-valued dependency.

A table violates fourth normal form when it contains two or more independent multi-valued dependencies.

Table Not in 4NF

For example, assume that a student may take more than one course and have more than one major. Thus student has two multi-valued dependencies studentId↠course and studentId↠major. 4NF is violated by the following table:

studentCourseMajor Table
studentId «PK» course «PK» major «PK»
985235410 PC100 Physics
985235410 PC100 Computing
985235410 CP363 Physics
985235410 CP363 Computing
985235410 EN228 Physics
985235410 EN228 Computing

Note that BCNF is not violated because there are no functional dependencies in this table. In fact the primary key for this table is made up of all three attributes, and thus there are no non-prime attributes.

This table contains two independent multi-valued dependencies. course and major are each multi-determined by studentId, but are independent of each other. In such a case every possible combination of studentId, course, and major must be included in the table to make sure that no data is left out. (Imagine if a student was taking five courses and had two majors - matching each major to all courses is the only way to insure consistency of data). This table is just asking for data anomalies.

4NF Violation - Fixed
This relation can be solved by decomposing the table into two separate multi-determined relationships:

studentMajor Table
studentId «PK» major «PK»
985235410 Computing
985235410 Physics

and

studentCourse Table
studentId «PK» course «PK»
985235410 CP363
985235410 PC100
985235410 EN228

As in the original table, the primary keys for each of these tables consists of all the attributes in the tables.

We will not go beyond 4NF in our discussions.