Introduction

What is Normalization?

It is the process of putting relations in a database into a normal form.
More generally it is a design technique for structuring database relations.

Why Normalize?

To avoid update, insertion, and deletion anomalies.
To minimize redundancy.

Remember that these three anomalies can corrupt the contents of a database. (We sill see some examples momentarily.) By reducing data duplication and eliminating these anomalies, we intend to minimize the possibility of data corruption and simplify the development, maintenance, and expandability of the database.

How Do We Normalize?

Analyze relation schemas based upon their functional dependencies and primary keys.
Relations that do not meet normal form tests must be decomposed into multiple relations that individually satisfy these tests.

It is important to remember that normalization techniques are based not only on algorithms, but on the semantics of the data in the database. If we do not understand the data domains and relationships we cannot perform proper normalizations.

A normalized schema should have the properties:

lossless joins and non-additive joins: joins of decomposed relations reproduce those relations exactly. No loss of tuples, and no generation of spurious tuples.
dependency preservation: all functional dependencies are preserved (although denormalization may be allowed for performance reasons).

Normal Forms

The following normal forms are listed from least to most normalized:

First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)

Normal Forms 1 through 3 are based upon primary keys. Every relation is assumed to have a primary key made up of one or more attributes.

Reminder: A superkey is a set of attributes that uniquely identify a tuple in a relation. A key is a minimal superkey, i.e. a superkey with superfluous attributes removed. Possible keys are called candidate keys. A key chosen from a set of candidates is a primary key.

First Normal Form (1NF)

The domain of an attribute must include only atomic values.
The value of any attribute in a relation must be a single value from that attribute’s domain.

Fixed by decomposition into a series of single-valued relations. Removes redundancy, and is general, placing no limit on number of values.

It is the nature of a relational database management system to allow only atomic values in an attribute - if an attribute is defined as integer, then only a single integer could be stored in such a field. Thus it is difficult, though possible, to violate 1NF with such a database system. Typically it can be accomplished by misusing strings, as in the following:

Table Not in 1NF

**studentCourses** Table
studentId	courses
987859400	{CP466, HP202, CP102}
999568440	{CP363, CP400}

This data clearly demonstrates its semantics - one student may take many courses and each course may have many students.

You must deliberately set aside enough space to store multiple course names. Setting aside only 5 characters forces you to store only one course name at a time.

Imagine that you wish to determine the total enrolment in all computing courses, i.e. courses that start with 'CP':

          
SELECT COUNT(studentId) FROM ...
  WHERE courses LIKE '%CP%'

This returns a value of 2 rather than a correct value of 4 because each tuple is counted only once. A proper counting requires a much more complex SELECT statement.

The following is an interesting, but incorrect, attempt at a solution:

Table Still Not in 1NF

**studentCourses** Table
studentId	course1	course2	course3
987859400	CP466	HP202	CP102
999568440	CP363	CP400	NULL

Although each attribute contains an atomic value, each attribute represents the same ‘fact’.
This is also very difficult to query or update.

The count from the previous example is now:

        
SELECT COUNT(studentId) FROM ...
  WHERE course1 LIKE 'CP%' OR course2 LIKE 'CP%' OR course3 LIKE 'CP%'

and this still does not give you the right answer. Furthermore, updating means finding the first free attribute of the three - which is non-trivial to put it mildly - and what happens when a student takes four courses?

We can fix this by defining the table to contain only atomic values, and repeating key data as necessary:

1NF Violation - Fixed

**studentCourse** Table
studentId «PK»	course «PK»
987859400	CP466
987859400	HP202
987859400	CP102
999568440	CP363
999568440	CP400

This seems to have introduced some redundancy, but it now means that our enrolment count statement works properly since each tuple is counted once and only once.

        
SELECT COUNT(studentId) FROM ...
  WHERE course LIKE 'CP%'

We'll see how to reduce redundancy with other normal forms.

Note that the semantics of the data are preserved: one student may still take many courses and one course may have many students. If the data semantics are violated by a normalization procedure then we have done something wrong.

Second Normal Form (2NF)

The relation must be in 1NF.
All non-prime attributes of a relation are fully functionally dependent upon the candidate keys of the relation.
If all candidate keys of a relation consist of single attributes then the relation is automatically in 2NF.

Applies only to primary keys with multiple attributes.

Based upon full functional dependency. A dependency is full functional if when an attribute is removed from a key, the dependency is destroyed. If the dependency remains, this implies the key is not minimal.

i.e. a relation is 2NF if every non-key attribute of a relation is fully functionally dependent on the primary key of the relation. If an attribute is dependent only on a partial key, it is not 2NF. (You cannot have a partial key if the key consists of only a single attribute.)

A prime attribute is a member of a candidate key. An attribute is non-prime if it is not a member of an candidate key. (Non-prime attributes are also referred to as non-key attributes).

Table Not in 2NF

**studentCourse** Table
studentId «PK»	course «PK»	surname	forename	instructor	grade
8245	CP102	Kumar	Vijay	N. Pavlova	B-
8245	HP202	Kumar	Vijay	S. Zeller	C+
9995	CP363	Soult	Francine	D. Brown	A-
9995	CP400	Soult	Francine	T. Yang	B+
9995	CP102	Soult	Francine	N. Pavlova	A

This relation is ripe for generating anomalies. Whenever a student is added to a course the student's name as well as studentId must be added to the relation. If it is added incorrectly then we have one studentId with multiple names attached to it. Similar things can happen with the combination of instructor and course. When a tuple is deleted from the relation, if the last student for that course is removed then the course is entirely removed from the database unless we make a point of storing NULL data for the removed student. This is awkward and ugly. Common sense, and normalization techniques, tell us there must be a better way.

Note: NULL values for the grade before the final grade is assigned are perfectly valid.

We have defined {studentId,course} as the primary key for this table, and indeed as a key it uniquely identifies each tuple. However, the tuples above violate 2nd Normal Form. The non-key attributes surname and forename are only partially dependent on the primary key since the attribute course can be removed from the primary key, and the dependency still holds. i.e. surname and forename are fully functionally dependent only on studentId. Similarly, the non-key attribute instructor can be identified by the partial key {course}. Only the grade attribute is fully functionally dependent on the entire primary key, i.e. {studentId,course}→grade.

The relation’s functional dependencies are:

{studentId,course}→grade
studentId→surname,forename
course→instructor

surname, forename, and instructor are functionally dependent upon only partial keys and thus 2NF is violated.

These functional dependencies arise out of the semantics of the data in the relation. We know that students have names as well as numbers and that although a name may be repeated (David Brown is fairly popular, for example), student IDs are never repeated. If we do not understand the semantics of the data we cannot determine the function dependencies.

This violation of 2NF can be fixed by decomposition into relations where all attributes are fully functionally determined. In this particular example, this table could be decomposed into three tables:

2NF Violation - Fixed

**student** Table
studentId «PK»	surname	forename
8245	Kumar	Vijay
9995	Soult	Francine

**courseInstructor** Table
course «PK»	instructor
CP102	N. Pavlova
HP202	S. Zeller
CP363	D. Brown
CP400	T. Yang

**studentCourse** Table
studentId «PK»	course «PK»	grade
8245	CP102	B-
8245	HP202	C+
9995	CP363	A-
9995	CP400	B+
9995	CP102	A

Note here the huge potential for reducing data redundancy. Now each combination of course and instructor is given once rather than once for every student in the course. Similarly a student's name appears only once.

The opportunity for data corruption is also greatly reduced. Now a student's name appears in one place only. An instructor's name can still appear against many courses, but that could be fixed by giving each instructor an ID number and storing their name separately, as with students.

The UML schema:

Clearly having courses with multiple sections in different terms requires a more complex design, but the approach of looking at the functional dependency violations and fixing them applies no matter how complex the real world context is.

Third Normal Form (3NF)

The relation must be in 2NF.
Every non-prime attribute of the relation is non-transitively dependent on the candidate keys of the relation.

Non-key attributes should not be functionally determined by other non-key attributes. If \(Y→A\), and \(Y\) is not a key, then the relation is not in 3NF. (If \(Y\) is a partial key then 2NF is violated).

Restated, a relation \(R\) is in Third Normal Form if it is in 2NF and every non-key attribute of \(R\) is non-transitively dependent on each candidate key of \(R\).

A transitive functional dependency implies that when \(X→Y\) and \(Y→A\) then \(X→A\) where \(X\) is a key and \(Y\) is a non-key.

Fixed by decomposition into relations where the non-key attributes that determined the other non-key attributes become keys.

Table Not in 3NF

The following violates 3NF because at least one of the non-prime attributes is dependent upon another non-prime attribute:

**studentCourse** Table
studentId «PK»	course «PK»	grade	gpa
8245	CP102	B-	7
8245	HP202	C+	6
9995	CP363	A-	10
9995	CP400	B+	9
9995	CP102	A	11

(Note that WLU uses a 12 point GPA system.)

The functional dependencies for grade and gpa are:
grade→gpa
gpa→grade

Both grade and gpa are non-prime attributes because they are not part of any candidate keys - they do not contribute to the uniqueness of a tuple. Thus 3NF is violated because a non-prime attribute (either grade and gpa) is dependent upon a non-prime attribute (either grade and gpa).

Not all violations of 3NF involve functional dependencies that are \(X→Y\) and \(Y→X\) as we have in this example. It is sufficient if only \(X→Y\) and both \(X\) and \(Y\) are non-prime attributes.

Because 3NF is violated this table is subject to update anomalies. What happens if a student receives a grade of A+ but a GPA of 0? Which is the correct mark? (If you are a student the answer is obvious, of course.)

The relation’s functional dependencies are:

{course,studentId}→grade
{course,studentId}→gpa
grade→gpa
gpa→grade

Thus these transitive dependencies are also true:

{course,studentId}→grade→gpa
{course,studentId}→gpa→grade

Having discovered what the functional dependencies are and the fact that we have at least one transitive dependency forces us to fix the table. We can use these functional dependencies to guide our decomposition. Note that we may have to make a choice as to how to decompose a table. Sometimes this choice is arbitrary.

In a general sense, a table or relation should contain information about only one thing, where that 'thing' is determined by the primary key. According to Kent (1983) all facts in the relation are related to the key:
[every] non-key [attribute] must provide a fact about the key, the whole key, and nothing but the key.
A transitive dependency tells you that some of the attributes in a tuple are not about the key, but about something else, and the relation should be decomposed in such a way that each table has attributes only about the key.

3NF Violation - Fixed

The original relation can be decomposed into:

**studentCourse** Table
studentId «PK»	course «PK»	gpa
8245	CP102	7
8245	HP202	6
9995	CP363	10
9995	CP400	9
9995	CP102	11

**gpaGrade** Table
gpa «PK»	grade
12	A+
11	A
10	A-
9	B+
8	B

The UML schema:

This decomposition removes the update anomaly problem. Now grades appear in one place only and a simple join is enough to connect them back to a student's course.

An equally valid decomposition involves a table where grade is the primary key and gpa is the dependent attribute. The choice in this case is arbitrary.

Boyce-Codd Normal Form (BCNF)

Raymond F. Boyce and Edgar F. Codd developed this form in 1974. Codd developed the first three normal forms in the 1960s and published his seminal paper in 1970.

BCNF is a more restrictive version of 3NF.

The table must be in 3NF.
Every non-trivial functional dependency must be a dependency on a superkey. (A an example of a trivial functional is \(A→A\).)

Every relation in BCNF is also in 3NF, but the reverse is not necessarily true. 3NF allows attributes to be part of a candidate key that is not the primary key - BCNF does not.

Most relations that are in 3NF are also in BCNF, but not all. The differences are subtle, but important. It might be easier to say what is not in BCNF: A relation is not in BCNF if

the candidate keys in the relation are composite keys (that is, they are not single attribute keys)
there is more than one candidate key in the relation
the keys overlap, that is, some attributes in the keys are common

Table Not in BCNF

The following is in 3NF but not in BCNF:

**studentTutor** Table
studentId	tutorId	tutorSin
8245	1254	000 737 999
9995	1567	000 656 999
5643	1254	000 737 999
6179	1742	000 651 999

Note that in the semantics for this data each student may have only one tutor, but each tutor may have many students.

This table is subject to insertion anomalies as both the Tutor ID and SIN must be entered whenever a tutor-student pair is entered. Here again, common sense tells us that this table is not well designed, even if we don't immediately recognize that it violates BCNF.

Candidate keys are: {studentId,tutorId} and {studentId,tutorSin}
tutorId→tutorSin and tutorSin→tutorId, but because both tutorId and tutorSin are prime attributes (i.e. they are both potentially part of a primary key as in the first point) these FDs do not violate 3NF (which is concerned with non-key attributes).
Neither tutorId nor tutorSin alone are superkeys, and thus BCNF is violated.

Both tutorId and tutorSin are prime attributes because they are part of candidate keys. The use of both attributes in this table violates BCNF. The fix is to use only one of these two attributes. As noted before, BCNF is a stronger version of 3NF because it restricts both prime and non-prime attributes, while 3NF only restricts non-prime attributes.

BCNF Violation - Fixed

The original relation is decomposed into:

**studentTutor** Table
studentId «PK»	tutorId «PK»
8245	1254
9995	1567
5643	1254
6179	1742

**tutor** Table
tutorId «PK»	tutorSin
1254	000 737 999
1567	000 656 999
1742	000 651 999

The UML schema:

Why this decomposition over the other possible decomposition, where student IDs are matched against tutor SINs? Here outside considerations help in our decision. SINs are treated specially and should be provided extra protection and privacy. The best way to do this is to minimize their use. Given a choice, then, between the two decompositions, the one that minimizes the use of SINs is the best.

Again, update and insertion anomalies have been removed.

Normalization Algorithm

Suppose that \(R\) is not in BCNF. Let \(X\) be a subset of \(R\), \(A\) a single attribute in \(R\), and \(X→A\) be a non-trivial FD that violates BCNF. Decompose \(R\) into \(R-A\) and \(XA\).
If either \(R-A\) or \(XA\) are not in BCNF, repeat step 1 with the violating relation(s).

This simple recursive algorithm produces BCNF compliant relations every time. Of course, databases development doesn't normally start out with one huge relation of all the attributes in the database. The semantics of the data point the way to how relations should be organized. When semantics are unclear or a developer is simply unsure of where to go next, this algorithm can help the process.

Relation Decomposition

How do you determine which decomposition to use?

“Each attribute must describe the key, the whole key, and nothing but the key.” – Kent (1983) William Kent, A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM 26(2), Feb. 1983.
Often common sense or external rules can provide a guide to decomposition.

Kent is saying that a relation should be centred around one 'fact', and that all attributes in a relation should be connected to that fact. The key provides the foundation for this fact. That a student has a tutor is a fact (and a key). A tutor's SIN is not part of that 'fact'.

Fourth Normal Form

An attribute may multi-determine another (\(X↠Y\)). For example, a student (identified by studentId) may take many courses (identified by course). Thus studentId↠course. This is called a multi-valued dependency.

A table violates fourth normal form when it contains two or more independent multi-valued dependencies.

Table Not in 4NF

For example, assume that a student may take more than one course and have more than one major. Thus student has two multi-valued dependencies studentId↠course and studentId↠major. 4NF is violated by the following table:

**studentCourseMajor** Table
studentId «PK»	course «PK»	major «PK»
985235410	PC100	Physics
985235410	PC100	Computing
985235410	CP363	Physics
985235410	CP363	Computing
985235410	EN228	Physics
985235410	EN228	Computing

Note that BCNF is not violated because there are no functional dependencies in this table. In fact the primary key for this table is made up of all three attributes, and thus there are no non-prime attributes.

This table contains two independent multi-valued dependencies. course and major are each multi-determined by studentId, but are independent of each other. In such a case every possible combination of studentId, course, and major must be included in the table to make sure that no data is left out. (Imagine if a student was taking five courses and had two majors - matching each major to all courses is the only way to insure consistency of data). This table is just asking for data anomalies.

4NF Violation - Fixed

This relation can be solved by decomposing the table into two separate multi-determined relationships:

**studentMajor** Table
studentId «PK»	major «PK»
985235410	Computing
985235410	Physics

and

**studentCourse** Table
studentId «PK»	course «PK»
985235410	CP363
985235410	PC100
985235410	EN228

As in the original table, the primary keys for each of these tables consists of all the attributes in the tables.

We will not go beyond 4NF in our discussions.

Lesson 9: Normalization

Introduction

What is Normalization?

Why Normalize?

How Do We Normalize?

Normal Forms

First Normal Form (1NF)

Second Normal Form (2NF)

Third Normal Form (3NF)

Boyce-Codd Normal Form (BCNF)

Normalization Algorithm

Relation Decomposition

Fourth Normal Form