It is the process of putting relations in a database into a normal form.
More generally it is a design technique for structuring database relations.
To avoid update, insertion, and deletion anomalies.
To minimize redundancy.
Remember that these three anomalies can corrupt the contents of a database. (We sill see some examples momentarily.) By reducing data duplication and eliminating these anomalies, we intend to minimize the possibility of data corruption and simplify the development, maintenance, and expandability of the database.
Analyze relation schemas based upon their functional dependencies and primary keys.
Relations that do not meet normal form tests must be decomposed into multiple relations that individually satisfy these tests.
It is important to remember that normalization techniques are based not only on algorithms, but on the semantics of the data in the database. If we do not understand the data domains and relationships we cannot perform proper normalizations.
A normalized schema should have the properties:
The following normal forms are listed from least to most normalized:
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
Normal Forms 1 through 3 are based upon primary keys. Every relation is assumed to have a primary key made up of one or more attributes.
Reminder: A superkey is a set of attributes that uniquely identify a tuple in a relation. A key is a minimal superkey, i.e. a superkey with superfluous attributes removed. Possible keys are called candidate keys. A key chosen from a set of candidates is a primary key.
Fixed by decomposition into a series of single-valued relations. Removes redundancy, and is general, placing no limit on number of values.
It is the nature of a relational database management system to allow only atomic values in an attribute - if an attribute is defined as integer, then only a single integer could be stored in such a field. Thus it is difficult, though possible, to violate 1NF with such a database system. Typically it can be accomplished by misusing strings, as in the following:
studentId | courses |
---|---|
987859400 | {CP466, HP202, CP102} |
999568440 | {CP363, CP400} |
This data clearly demonstrates its semantics - one student may take many courses and each course may have many students.
You must deliberately set aside enough space to store multiple course names. Setting aside only 5 characters forces you to store only one course name at a time.
Imagine that you wish to determine the total enrolment in all computing courses, i.e. courses that start with 'CP':
SELECT COUNT(studentId) FROM ...
WHERE courses LIKE '%CP%'
This returns a value of 2 rather than a correct value of 4 because
each tuple is counted only once. A proper counting requires a much
more complex SELECT
statement.
The following is an interesting, but incorrect, attempt at a solution:
studentId | course1 | course2 | course3 |
---|---|---|---|
987859400 | CP466 | HP202 | CP102 |
999568440 | CP363 | CP400 | NULL |
The count from the previous example is now:
SELECT COUNT(studentId) FROM ...
WHERE course1 LIKE 'CP%' OR course2 LIKE 'CP%' OR course3 LIKE 'CP%'
and this still does not give you the right answer. Furthermore, updating means finding the first free attribute of the three - which is non-trivial to put it mildly - and what happens when a student takes four courses?
We can fix this by defining the table to contain only atomic values, and repeating key data as necessary:
studentId «PK» | course «PK» |
---|---|
987859400 | CP466 |
987859400 | HP202 |
987859400 | CP102 |
999568440 | CP363 |
999568440 | CP400 |
This seems to have introduced some redundancy, but it now means that our enrolment count statement works properly since each tuple is counted once and only once.
SELECT COUNT(studentId) FROM ...
WHERE course LIKE 'CP%'
We'll see how to reduce redundancy with other normal forms.
Note that the semantics of the data are preserved: one student may still take many courses and one course may have many students. If the data semantics are violated by a normalization procedure then we have done something wrong.
Applies only to primary keys with multiple attributes.
Based upon full functional dependency. A dependency is full functional if when an attribute is removed from a key, the dependency is destroyed. If the dependency remains, this implies the key is not minimal.
i.e. a relation is 2NF if every non-key attribute of a relation is fully functionally dependent on the primary key of the relation. If an attribute is dependent only on a partial key, it is not 2NF. (You cannot have a partial key if the key consists of only a single attribute.)
A prime attribute is a member of a candidate key. An attribute is non-prime if it is not a member of an candidate key. (Non-prime attributes are also referred to as non-key attributes).
studentId «PK» | course «PK» | surname | forename | instructor | grade |
---|---|---|---|---|---|
8245 | CP102 | Kumar | Vijay | N. Pavlova | B- |
8245 | HP202 | Kumar | Vijay | S. Zeller | C+ |
9995 | CP363 | Soult | Francine | D. Brown | A- |
9995 | CP400 | Soult | Francine | T. Yang | B+ |
9995 | CP102 | Soult | Francine | N. Pavlova | A |
This relation is ripe for generating anomalies. Whenever a student is
added to a course the student's name as well as studentId
must be added to the relation. If it is added incorrectly then we have
one studentId with multiple names attached to it.
Similar things can happen with the combination of instructor and
course. When a tuple is deleted from the relation, if the last student
for that course is removed then the course is entirely removed from
the database unless we make a point of storing NULL
data
for the removed student. This is awkward and ugly. Common sense, and
normalization techniques, tell us there must be a better way.
Note: NULL
values for the grade before the final grade is
assigned are perfectly valid.
We have defined {studentId,course} as the primary key for this table, and indeed as a key it uniquely identifies each tuple. However, the tuples above violate 2nd Normal Form. The non-key attributes surname and forename are only partially dependent on the primary key since the attribute course can be removed from the primary key, and the dependency still holds. i.e. surname and forename are fully functionally dependent only on studentId. Similarly, the non-key attribute instructor can be identified by the partial key {course}. Only the grade attribute is fully functionally dependent on the entire primary key, i.e. {studentId,course}→grade.
The relation’s functional dependencies are:
{studentId,course}→grade
studentId→surname,forename
course→instructor
surname, forename, and instructor are functionally dependent upon only partial keys and thus 2NF is violated.
These functional dependencies arise out of the semantics of the data in the relation. We know that students have names as well as numbers and that although a name may be repeated (David Brown is fairly popular, for example), student IDs are never repeated. If we do not understand the semantics of the data we cannot determine the function dependencies.
This violation of 2NF can be fixed by decomposition into relations where all attributes are fully functionally determined. In this particular example, this table could be decomposed into three tables:
studentId «PK» | surname | forename |
---|---|---|
8245 | Kumar | Vijay |
9995 | Soult | Francine |
course «PK» | instructor |
---|---|
CP102 | N. Pavlova |
HP202 | S. Zeller |
CP363 | D. Brown |
CP400 | T. Yang |
studentId «PK» | course «PK» | grade |
---|---|---|
8245 | CP102 | B- |
8245 | HP202 | C+ |
9995 | CP363 | A- |
9995 | CP400 | B+ |
9995 | CP102 | A |
Note here the huge potential for reducing data redundancy. Now each combination of course and instructor is given once rather than once for every student in the course. Similarly a student's name appears only once.
The opportunity for data corruption is also greatly reduced. Now a student's name appears in one place only. An instructor's name can still appear against many courses, but that could be fixed by giving each instructor an ID number and storing their name separately, as with students.
The UML schema:
Clearly having courses with multiple sections in different terms requires a more complex design, but the approach of looking at the functional dependency violations and fixing them applies no matter how complex the real world context is.
Non-key attributes should not be functionally determined by other non-key attributes. If \(Y→A\), and \(Y\) is not a key, then the relation is not in 3NF. (If \(Y\) is a partial key then 2NF is violated).
Restated, a relation \(R\) is in Third Normal Form if it is in 2NF and every non-key attribute of \(R\) is non-transitively dependent on each candidate key of \(R\).
A transitive functional dependency implies that when \(X→Y\) and \(Y→A\) then \(X→A\) where \(X\) is a key and \(Y\) is a non-key.
Fixed by decomposition into relations where the non-key attributes that determined the other non-key attributes become keys.
The following violates 3NF because at least one of the non-prime attributes is dependent upon another non-prime attribute:
studentId «PK» | course «PK» | grade | gpa |
---|---|---|---|
8245 | CP102 | B- | 7 |
8245 | HP202 | C+ | 6 |
9995 | CP363 | A- | 10 |
9995 | CP400 | B+ | 9 |
9995 | CP102 | A | 11 |
(Note that WLU uses a 12 point GPA system.)
The functional dependencies for grade and gpa
are:
grade→gpa
gpa→grade
Both grade and gpa are non-prime attributes because they are not part of any candidate keys - they do not contribute to the uniqueness of a tuple. Thus 3NF is violated because a non-prime attribute (either grade and gpa) is dependent upon a non-prime attribute (either grade and gpa).
Not all violations of 3NF involve functional dependencies that are \(X→Y\) and \(Y→X\) as we have in this example. It is sufficient if only \(X→Y\) and both \(X\) and \(Y\) are non-prime attributes.
Because 3NF is violated this table is subject to update anomalies. What happens if a student receives a grade of A+ but a GPA of 0? Which is the correct mark? (If you are a student the answer is obvious, of course.)
The relation’s functional dependencies are:
{course,studentId}→grade
{course,studentId}→gpa
grade→gpa
gpa→grade
Thus these transitive dependencies are also true:
{course,studentId}→grade→gpa
{course,studentId}→gpa→grade
Having discovered what the functional dependencies are and the fact that we have at least one transitive dependency forces us to fix the table. We can use these functional dependencies to guide our decomposition. Note that we may have to make a choice as to how to decompose a table. Sometimes this choice is arbitrary.
In a general sense, a table or relation should contain information about
only one thing, where that 'thing' is determined by the primary key.
According to Kent (1983) all facts in the relation are related to the
key:
[every] non-key [attribute] must provide a fact about the
key, the whole key, and nothing but the key
.
A transitive
dependency tells you that some of the attributes in a tuple are not
about the key, but about something else, and the relation should be
decomposed in such a way that each table has attributes only about the
key.
The original relation can be decomposed into:
studentId «PK» | course «PK» | gpa |
---|---|---|
8245 | CP102 | 7 |
8245 | HP202 | 6 |
9995 | CP363 | 10 |
9995 | CP400 | 9 |
9995 | CP102 | 11 |
gpa «PK» | grade |
---|---|
12 | A+ |
11 | A |
10 | A- |
9 | B+ |
8 | B |
The UML schema:
This decomposition removes the update anomaly problem. Now grades appear in one place only and a simple join is enough to connect them back to a student's course.
An equally valid decomposition involves a table where grade is the primary key and gpa is the dependent attribute. The choice in this case is arbitrary.
Raymond F. Boyce and Edgar F. Codd developed this form in 1974. Codd developed the first three normal forms in the 1960s and published his seminal paper in 1970.
BCNF is a more restrictive version of 3NF.
Every relation in BCNF is also in 3NF, but the reverse is not necessarily true. 3NF allows attributes to be part of a candidate key that is not the primary key - BCNF does not.
Most relations that are in 3NF are also in BCNF, but not all. The differences are subtle, but important. It might be easier to say what is not in BCNF: A relation is not in BCNF if
The following is in 3NF but not in BCNF:
studentId | tutorId | tutorSin |
---|---|---|
8245 | 1254 | 000 737 999 |
9995 | 1567 | 000 656 999 |
5643 | 1254 | 000 737 999 |
6179 | 1742 | 000 651 999 |
Note that in the semantics for this data each student may have only one tutor, but each tutor may have many students.
This table is subject to insertion anomalies as both the Tutor ID and SIN must be entered whenever a tutor-student pair is entered. Here again, common sense tells us that this table is not well designed, even if we don't immediately recognize that it violates BCNF.
Both tutorId and tutorSin are prime attributes because they are part of candidate keys. The use of both attributes in this table violates BCNF. The fix is to use only one of these two attributes. As noted before, BCNF is a stronger version of 3NF because it restricts both prime and non-prime attributes, while 3NF only restricts non-prime attributes.
The original relation is decomposed into:
studentId «PK» | tutorId «PK» |
---|---|
8245 | 1254 |
9995 | 1567 |
5643 | 1254 |
6179 | 1742 |
tutorId «PK» | tutorSin |
---|---|
1254 | 000 737 999 |
1567 | 000 656 999 |
1742 | 000 651 999 |
The UML schema:
Why this decomposition over the other possible decomposition, where student IDs are matched against tutor SINs? Here outside considerations help in our decision. SINs are treated specially and should be provided extra protection and privacy. The best way to do this is to minimize their use. Given a choice, then, between the two decompositions, the one that minimizes the use of SINs is the best.
Again, update and insertion anomalies have been removed.
Suppose that \(R\) is not in BCNF. Let \(X\) be a subset of \(R\), \(A\) a single attribute in \(R\), and \(X→A\) be a non-trivial FD that violates BCNF. Decompose \(R\) into \(R-A\) and \(XA\).
If either \(R-A\) or \(XA\) are not in BCNF, repeat step 1 with the violating relation(s).
This simple recursive algorithm produces BCNF compliant relations every time. Of course, databases development doesn't normally start out with one huge relation of all the attributes in the database. The semantics of the data point the way to how relations should be organized. When semantics are unclear or a developer is simply unsure of where to go next, this algorithm can help the process.
How do you determine which decomposition to use?
“Each attribute must describe the key, the whole key, and nothing but the key.” – Kent (1983) William Kent, A Simple Guide to Five Normal Forms in Relational Database Theory, Communications of the ACM 26(2), Feb. 1983.
Often common sense or external rules can provide a guide to decomposition.
Kent is saying that a relation should be centred around one 'fact', and that all attributes in a relation should be connected to that fact. The key provides the foundation for this fact. That a student has a tutor is a fact (and a key). A tutor's SIN is not part of that 'fact'.
An attribute may multi-determine another (\(X↠Y\)). For example, a student (identified by studentId) may take many courses (identified by course). Thus studentId↠course. This is called a multi-valued dependency.
A table violates fourth normal form when it contains two or more independent multi-valued dependencies.
For example, assume that a student may take more than one course and have more than one major. Thus student has two multi-valued dependencies studentId↠course and studentId↠major. 4NF is violated by the following table:
studentId «PK» | course «PK» | major «PK» |
---|---|---|
985235410 | PC100 | Physics |
985235410 | PC100 | Computing |
985235410 | CP363 | Physics |
985235410 | CP363 | Computing |
985235410 | EN228 | Physics |
985235410 | EN228 | Computing |
Note that BCNF is not violated because there are no functional dependencies in this table. In fact the primary key for this table is made up of all three attributes, and thus there are no non-prime attributes.
This table contains two independent multi-valued dependencies. course and major are each multi-determined by studentId, but are independent of each other. In such a case every possible combination of studentId, course, and major must be included in the table to make sure that no data is left out. (Imagine if a student was taking five courses and had two majors - matching each major to all courses is the only way to insure consistency of data). This table is just asking for data anomalies.
studentId «PK» | major «PK» |
---|---|
985235410 | Computing |
985235410 | Physics |
and
studentId «PK» | course «PK» |
---|---|
985235410 | CP363 |
985235410 | PC100 |
985235410 | EN228 |
As in the original table, the primary keys for each of these tables consists of all the attributes in the tables.
We will not go beyond 4NF in our discussions.