Joins create new virtual (or composite) tables that include
information from multiple tables. In SQL you specify which tables
contain the information you need using the FROM
clause. You
must combine the matching attributes of each table. You control how
attributes are matched in the composite table either by specifying a
particular type of join operation or by using the ON
clause.
In a relational database management system relationships among data values are left unstated in the database definition. They become explicit when you query the data, not when you create it. You can ask any question that comes to mind about the data stored in the database, regardless of what your intentions were when the database was set up. (The database organization may have some effect on how you frame such questions - certain questions may be easier or harder to state in SQL depending on the database organization). Joining different tables allows you to explore relationships among diverse data.
Joins normally appear within SELECT
statements. The desired
attributes appear after the SELECT
, the tables and aliases
appear in the FROM
clause, and the joining conditions can
be given in the ON
clause. If two tables share attribute
names, the attribute names must be qualified by the appropriate table
name or alias, as in this example:
SELECT memberSurname, memberForename, pubTitle
FROM member INNER JOIN pub
ON memberId = pubMemberId
Here, member and pub are joined by
their common attributes, memberId and pubMemberId.
This join results in the listing of all members' names and the titles of
their publications. Members without publications are not listed. (INNER
has a special meaning that will be dealt with later.)
This query could be written without the JOIN ON
clause by
using a WHERE
clause instead:
SELECT memberSurname, memberForename, pubTitle
FROM member, pub
WHERE memberId = pubMemberId
Although this statement produces the same result set as the previous one
using the JOIN ON
clause, the ON
keyword
isolates the join constraints and makes your joined statement easier to
read. For example, consider:
SELECT memberSurname, memberForename, pubTitle
FROM member INNER JOIN pub
ON memberId = pubMemberId
WHERE pubPubType = 'B'
versus:
SELECT memberSurname, memberForename, pubTitle
FROM member, pub
WHERE memberId = pubMemberId
AND pubPubType = 'B'
Although again, both statements produce the same result set, the first
explicitly indicates that two tables are being joined and that a
condition is being applied to the result of that join. The second throws
the join and the condition together as part of the WHERE
clause.
Attributes being joined must have the same or compatible data types. (In the previous example the member ID attributes of both tables are clearly of the same type because they have a primary key - foreign key relationship.)
Joins can be categorized according to how they relate tables. We will examine Key joins, Inner joins, Outer joins, Multiple table joins, and Self joins.
Key joins take advantage of the foreign key relationship between two tables. Such queries are optimized by the DBMS engine. (Although query optimization is not a topic of this course, it is useful to know as a rule of thumb that joins made between tables related by primary and foreign keys are more efficient, and thus execute faster, than joins made between tables that do not have such a relationship.) For example:
SELECT pubTypeDesc, pubTitle, pubPublisher
FROM pub INNER JOIN pubType
ON pubPubType = pubType
The tables pub and pubType have a foreign key relationship based upon the attributes pubPubType (the foreign key for its table) and pubType (the primary key for its table) respectively. This query executes faster than if the foreign key relationship had not been made explicit. In a good database design, all tables that are expected to be joined should have their foreign key relationships explicitly defined.
Note that pubType is both a table name and the name of an attribute of that table - this is allowed in SQL.
On an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition.
The following tables illustrate the results of an inner join:
studentId | surname | forename |
---|---|---|
815686230 | Brown | David |
999568440 | Snord | Cranston |
987859400 | Zzap | Zachary |
studentId | course |
---|---|
999568440 | CP363 |
999568440 | CP102 |
987859400 | CP363 |
987859400 | HP202 |
Assuming that these two tables have a foreign key relationship based
upon the studentId attributes, the following SELECT
statement joins the two tables:
SELECT surname, forename, s.studentId, course
FROM student AS s INNER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
produces the following:
surname | forename | studentId | course |
---|---|---|---|
Snord | Cranston | 999568440 | CP363 |
Snord | Cranston | 999568440 | CP102 |
Zzap | Zachary | 987859400 | CP363 |
Zzap | Zachary | 987859400 | HP202 |
Note that none of the data from the student table pertaining to 'David Brown' appears in this result set. The studentId attribute is common to both the student and studentCourse table, but 'David Brown' s ID 815686230 does not appear anywhere in the studentCourse table. Because on an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition, there can be no join based on the attribute value 815686230 .
The same principle applies across more than two tables. In a multi-table join, all tables must have at least one tuple that satisfies the result condition in order for a matching tuple to appear in the result set. If a third table is added to the previous join:
course | lab |
---|---|
CP102 | Tu, Th 4:00 - 5:00 |
with the join:
SELECT surname, forename, s.studentId, cl.course, lab
FROM student AS s
INNER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
INNER JOIN courseLab AS cl
ON sc.course = cl.course
The result is:
surname | forename | studentId | course | lab |
---|---|---|---|---|
Snord | Cranston | 999568440 | CP102 | Tu, Th 4:00 - 5:00 |
There is only one tuple in the final result set because only 'CP102' has tuples in all three tables, i.e. a student, a course, and a lab.
An outer join contains tuples whether or not a tuple exists in
the opposite table to satisfy the join condition. The LEFT
and RIGHT
keywords identify the table that is to appear in
its entirety. A LEFT OUTER JOIN
contains every tuple in the
left-hand table, while a RIGHT OUTER JOIN
contains
every tuple in the right-hand table. A FULL OUTER
JOIN
contains every tuple in both the left- and right-hand tables.
Attributes in the query result set that do not have values from tuples
in the left or right tables contain NULL
. Using the
previous examples, altered to be an outer join:
SELECT surname, forename, s.studentId, course
FROM student AS s LEFT OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
produces the following:
surname | forename | studentId | course |
---|---|---|---|
Brown | David | 815686230 | NULL |
Snord | Cranston | 999568440 | CP363 |
Snord | Cranston | 999568440 | CP102 |
Zzap | Zachary | 987859400 | CP363 |
Zzap | Zachary | 987859400 | HP202 |
'David Brown'
now appears in the result set, but the matching course
attribute is NULL
, as there were no matching values in the
studentCourse table.
If the LEFT OUTER JOIN
phrase was replaced by a RIGHT
OUTER JOIN
phrase, the result set is identical to the result set for
the inner join example - the table studentCourse does
not contain any values in the studentId attribute that
do not have matching values in the student table. Thus,
none of the attributes in the result set have a NULL
value.
Outer joins can be especially useful when grouping and counting occurrences of values, but its use poses subtle dangers. Assume you wish to find out how many courses each student is taking. The inner join in this query:
SELECT surname, forename, COUNT(*) AS courseCount
FROM student AS s INNER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
GROUP BY surname, forename
ORDER BY surname, forename
produces the following:
surname | forename | courseCount |
---|---|---|
Snord | Cranston | 2 |
Zzap | Zachary | 2 |
Unfortunately, 'David Brown' is left out entirely from this result set. This is expected, because as described in the inner join example there is no matching tuple for 'David Brown' in the studentCourse table. In order to include 'David Brown' in the result set you must use an outer join. The query:
SELECT surname, forename, COUNT(*) AS courseCount
FROM student AS s LEFT OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
GROUP BY surname, forename
ORDER BY surname, forename
produces the following:
surname | forename | courseCount |
---|---|---|
Brown | David | 1 |
Snord | Cranston | 2 |
Zzap | Zachary | 2 |
This is not quite what was desired, as it implies that
'David Brown'
is taking one course when he is actually taking none. The problem lies
not in the OUTER JOIN
phrase, but in the COUNT(*)
phrase. The COUNT()
function counts all non-NULL
values in a result set. Using '*
' as its parameter is
simply asking COUNT()
to return the number of tuples that
contain any non-NULL attributes. Since the surname
and forename attributes are non-NULL
, the
resulting single tuple for
'David Brown'
provides a count of 1.
In order to produce the desired result the first line of the query must be replaced by:
SELECT surname, forename, COUNT(course) as courseCount ...
producing the following:
surname | forename | courseCount |
---|---|---|
Brown | David | 0 |
Snord | Cranston | 2 |
Zzap | Zachary | 2 |
The phrase COUNT(course)
asks the COUNT()
function to return the number of non-NULL
occurrences of
the attribute course in its result set. The first outer
join example showed that the only resulting tuple for
'David Brown'
contains a NULL
value for course. Thus COUNT(course)
returns 0 for this tuple.
You may join more than two tables in a SELECT
statement and
mix different types of joins in that SELECT
statement.
These joined tables are normally listed sequentially, and can be thought
of as going from left to right in the order they are listed, in the same
way that outer join statements are considered to be either left or
right. (See the example of the courseLab table under Inner
Joins.)
In addition to the student and studentCourse tables of the previous examples, assume that a course table also exists:
course | name |
---|---|
CP363 | Database I |
CP102 | Intro to Microcomputers |
HP202 | Science in the Modern World |
BW101 | Intro to Basket Weaving |
In order to generate a list of student names and the names of the courses they are taking, the following query suffices:
SELECT surname, forename, name
FROM student AS s
INNER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
INNER JOIN course AS c
ON sc.course = c.course
ORDER BY surname, forename, name
which produces the following:
surname | forename | name |
---|---|---|
Snord | Cranston | Database I |
Snord | Cranston | Intro to Microcomputers |
Zzap | Zachary | Database I |
Zzap | Zachary | Science in the Modern World |
(In this example the table studentCourse acts as a look-up table between student and course.
You can mix two different outer joins, but the query design needs careful thought:
SELECT surname, forename, Name
FROM student AS s
LEFT OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
RIGHT OUTER JOIN course AS c
ON sc.course = c.course
ORDER BY surname, forename, Name
produces the following:
surname | forename | Name |
---|---|---|
NULL | NULL | Intro to Basket Weaving |
Snord | Cranston | Database I |
Snord | Cranston | Intro to Microcomputers |
Zzap | Zachary | Database I |
Zzap | Zachary | Science in the Modern World |
Note that 'David Brown' is still missing. What is important here is the order of the joins. When student and studentCourse are related by a left outer join, 'David Brown' appears. When the right outer join between the results of the previous left outer join and course is performed, the tuple containing 'David Brown' is dropped, as it is part of the left 'table' (the left result set) and not the right table. To get the tuple containing 'David Brown' without any course information, it is necessary to rewrite the previous example (note the parentheses - they change the order in which the joins are executed):
SELECT surname, forename, Name
FROM student AS s
LEFT OUTER JOIN (studentCourse AS sc
ON s.studentId = sc.studentId
RIGHT OUTER JOIN course AS c)
ON sc.course = c.course
ORDER BY surname, forename, Name
produces the following:
surname | forename | Name |
---|---|---|
Brown | David | NULL |
Snord | Cranston | Database I |
Snord | Cranston | Intro to Microcomputers |
Zzap | Zachary | Database I |
Zzap | Zachary | Science in the Modern World |
The same result could have been obtained by starting with the course table and moving 'left' to the student table.
Note, however, that the tuple containing 'Intro to Basket Weaving' is now missing, for the same reasons (reversed) that 'David Brown' was missing in the previous result set. In order to get both 'David Brown' and 'Intro to Basket Weaving' into the final result set needs a full outer join:
SELECT surname, forename, Name
FROM student AS s
FULL OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
FULL OUTER JOIN course AS c
ON sc.course = c.course
ORDER BY surname, forename, Name
This produces:
surname | forename | Name |
---|---|---|
Brown | David | NULL |
Snord | Cranston | Database I |
Snord | Cranston | Intro to Microcomputers |
Zzap | Zachary | Database I |
Zzap | Zachary | Science in the Modern World |
NULL | NULL | Intro to Basket Weaving |
One last note of caution about outer joins - the attributes you choose for projection are extremely important. The following query:
SELECT sc.studentId, sc.name
FROM student AS s
LEFT OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
RIGHT OUTER JOIN course AS c
ON sc.name = c.name
produces the following:
studentId | name |
---|---|
NULL | NULL |
999568440 | CP363 |
999568440 | CP102 |
987859400 | CP363 |
987859400 | HP202 |
This differs from the query:
SELECT s.studentId, c.name
FROM student AS s
LEFT OUTER JOIN studentCourse AS sc
ON s.studentId = sc.studentId
RIGHT OUTER JOIN course AS c
ON sc.name = c.name
which produces the following:
studentId | name |
---|---|
NULL | BW101 |
999568440 | CP363 |
999568440 | CP102 |
987859400 | CP363 |
987859400 | HP202 |
The first query chooses its attributes from the look-up table studentCourse
(as sc), which has NULL
values for some of
the tuples in the outer joins. The second query uses the attributes of
the tables on the 'outside' of the joins, and at least one of these
attributes has values in the result set.
A more complex join involves a series of tables centered around a single table. Imagine the following table relationships:
The table studentCourseTerm is at the center of a relationship amongst the tables student, course, and Term. It is a look-up table for these three other tables, and contains information about the courses a given student takes during any term. Assume that there are appropriate foreign key relationships between studentCourseTerm and the other three tables.
In order to generate a list of courses taken by all students during all terms, the query looks like:
SELECT s.surname, s.forename, c.name AS courseName, t.name AS termName
FROM studentCourseTerm AS sct
INNER JOIN student AS s
ON sct.studentId = s.studentId
INNER JOIN course AS c
ON sct.name = c.name
INNER JOIN term AS t
ON sct.termCode = t.termCode
ORDER BY s.surname, s.forename, c.name, t.name
Note the use of parentheses to group the joins together and the fact that the middle table (studentCourseTerm) is referenced only once.
Note also the aliasing required for c.name and t.name. The term and course tables both contain a name attribute, but these attributes refer to totally different data - there is certainly no primary/foreign key relationship implied between them. Using common names like name, description, or price in many different tables is often a recipe for confusion. Aliasing helps, but is no replacement for a thoughtful naming standard.
Joins can be performed against virtual, or derived tables, as in the following example:
SELECT s.surname, s.forename, course, repeats
FROM student AS s
INNER JOIN
(SELECT studentId, course, COUNT(*)
FROM studentCourseTerm
GROUP BY studentId, course)
AS courseRepeats (studentId, course, repeats)
ON s.studentId = courseRepeats.studentId
WHERE repeats > 1
This query returns a list of all students who have taken a course more than once. The subquery (the query in parentheses) creates a virtual table courseRepeats from studentCourse that is joined against student based upon their common studentId values. The subquery attributes (in this case course and repeats) are displayed as part of the result set.
This is covered in more depth in SQL Subqueries
A table that has a reflexive association (i.e. is joined to itself) may be used in a self join. Assume an employee table contains both an employee ID and a manager ID, where a manager is also part of the employee table:
The managerId is a foreign key to the empId primary key.
The employee table can be joined to itself giving it distinctive aliases, as in this example:
SELECT report.empName AS employeeName, manager.empName AS managerName, manager.department
FROM employee AS report INNER JOIN employee AS manager
ON report.managerId = manager.empId
report and manager are simply two different aliases of the same table, employee. Given this data:
empId | name | managerId | department |
---|---|---|---|
5 | David Brown | 3 | Physics & Computer Science |
24 | Heider Ali | 5 | Physics & Computer Science |
The self-join gives us:
employeeName | managerName | department |
---|---|---|
Heider Ali | David Brown | Physics & Computer Science |
Describing the details of how various SQL engines implement joins is beyond the scope of this course. It is worth noting, however, that providing indices on the columns to be joined improves performance immensely. In the previous examples creating individual indices on the attributes of the various look-up tables helps performance. (ex: on studentId and course in studentCourse.) These attributes, as primary keys in their 'owning' tables, are already indexed).