Lesson 6: Joins

Introduction

Joins create new virtual (or composite) tables that include information from multiple tables. In SQL you specify which tables contain the information you need using the FROM clause. You must combine the matching attributes of each table. You control how attributes are matched in the composite table either by specifying a particular type of join operation or by using the ON clause.

In a relational database management system relationships among data values are left unstated in the database definition. They become explicit when you query the data, not when you create it. You can ask any question that comes to mind about the data stored in the database, regardless of what your intentions were when the database was set up. (The database organization may have some effect on how you frame such questions - certain questions may be easier or harder to state in SQL depending on the database organization). Joining different tables allows you to explore relationships among diverse data.

Joins normally appear within SELECT statements. The desired attributes appear after the SELECT, the tables and aliases appear in the FROM clause, and the joining conditions can be given in the ON clause. If two tables share attribute names, the attribute names must be qualified by the appropriate table name or alias, as in this example:

Members and Their Publications
              
SELECT memberSurname, memberForename, pubTitle
  FROM member INNER JOIN pub
  ON memberId = pubMemberId

Here, member and pub are joined by their common attributes, memberId and pubMemberId. This join results in the listing of all members' names and the titles of their publications. Members without publications are not listed. (INNER has a special meaning that will be dealt with later.)

This query could be written without the JOIN ON clause by using a WHERE clause instead:

Members and Their Publications - No JOIN
              
SELECT memberSurname, memberForename, pubTitle
  FROM member, pub
  WHERE memberId = pubMemberId
  
  

Although this statement produces the same result set as the previous one using the JOIN ON clause, the ON keyword isolates the join constraints and makes your joined statement easier to read. For example, consider:

Members and Their Books
              
SELECT memberSurname, memberForename, pubTitle
  FROM member INNER JOIN pub
  ON memberId = pubMemberId
  WHERE pubPubType = 'B'

versus:

Members and Their Books - No JOIN
              
SELECT memberSurname, memberForename, pubTitle
  FROM member, pub
  WHERE memberId = pubMemberId
  AND pubPubType = 'B'

Although again, both statements produce the same result set, the first explicitly indicates that two tables are being joined and that a condition is being applied to the result of that join. The second throws the join and the condition together as part of the WHERE clause.

Attributes being joined must have the same or compatible data types. (In the previous example the member ID attributes of both tables are clearly of the same type because they have a primary key - foreign key relationship.)

Types of Joins

Joins can be categorized according to how they relate tables. We will examine Key joins, Inner joins, Outer joins, Multiple table joins, and Self joins.

Key Joins

Key joins take advantage of the foreign key relationship between two tables. Such queries are optimized by the DBMS engine. (Although query optimization is not a topic of this course, it is useful to know as a rule of thumb that joins made between tables related by primary and foreign keys are more efficient, and thus execute faster, than joins made between tables that do not have such a relationship.) For example:

Selecting Members and their Publications
              
SELECT pubTypeDesc, pubTitle, pubPublisher
  FROM pub INNER JOIN pubType
  ON pubPubType = pubType
        

The tables pub and pubType have a foreign key relationship based upon the attributes pubPubType (the foreign key for its table) and pubType (the primary key for its table) respectively. This query executes faster than if the foreign key relationship had not been made explicit. In a good database design, all tables that are expected to be joined should have their foreign key relationships explicitly defined.

Note that pubType is both a table name and the name of an attribute of that table - this is allowed in SQL.

Inner Joins

On an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition.

The following tables illustrate the results of an inner join:

student Table
studentId surname forename
815686230 Brown David
999568440 Snord Cranston
987859400 Zzap Zachary

studentCourse Table
studentId course
999568440 CP363
999568440 CP102
987859400 CP363
987859400 HP202

Assuming that these two tables have a foreign key relationship based upon the studentId attributes, the following SELECT statement joins the two tables:

Students and Courses
              
SELECT surname, forename, s.studentId, course
  FROM student AS s INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
        

produces the following:

Result Set
surname forename studentId course
Snord Cranston 999568440 CP363
Snord Cranston 999568440 CP102
Zzap Zachary 987859400 CP363
Zzap Zachary 987859400 HP202

Note that none of the data from the student table pertaining to 'David Brown' appears in this result set. The studentId attribute is common to both the student and studentCourse table, but 'David Brown' s ID 815686230 does not appear anywhere in the studentCourse table. Because on an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition, there can be no join based on the attribute value 815686230 .

The same principle applies across more than two tables. In a multi-table join, all tables must have at least one tuple that satisfies the result condition in order for a matching tuple to appear in the result set. If a third table is added to the previous join:

courseLab Table
course lab
CP102 Tu, Th 4:00 - 5:00

with the join:

Students and Labs
              
SELECT surname, forename, s.studentId, cl.course, lab
  FROM student AS s
  INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  INNER JOIN courseLab AS cl
  ON sc.course = cl.course
        

The result is:

Result Set
surname forename studentId course lab
Snord Cranston 999568440 CP102 Tu, Th 4:00 - 5:00

There is only one tuple in the final result set because only 'CP102' has tuples in all three tables, i.e. a student, a course, and a lab.

Outer Joins

An outer join contains tuples whether or not a tuple exists in the opposite table to satisfy the join condition. The LEFT and RIGHT keywords identify the table that is to appear in its entirety. A LEFT OUTER JOIN contains every tuple in the left-hand table, while a RIGHT OUTER JOIN contains every tuple in the right-hand table. A FULL OUTER JOIN contains every tuple in both the left- and right-hand tables. Attributes in the query result set that do not have values from tuples in the left or right tables contain NULL. Using the previous examples, altered to be an outer join:

Left Outer Join
              
SELECT surname, forename, s.studentId, course
  FROM student AS s LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
        

produces the following:

Result Set
surname forename studentId course
Brown David 815686230 NULL
Snord Cranston 999568440 CP363
Snord Cranston 999568440 CP102
Zzap Zachary 987859400 CP363
Zzap Zachary 987859400 HP202

'David Brown' now appears in the result set, but the matching course attribute is NULL, as there were no matching values in the studentCourse table.

If the LEFT OUTER JOIN phrase was replaced by a RIGHT OUTER JOIN phrase, the result set is identical to the result set for the inner join example - the table studentCourse does not contain any values in the studentId attribute that do not have matching values in the student table. Thus, none of the attributes in the result set have a NULL value.

Outer joins can be especially useful when grouping and counting occurrences of values, but its use poses subtle dangers. Assume you wish to find out how many courses each student is taking. The inner join in this query:

Join with Group By
              
SELECT surname, forename, COUNT(*) AS courseCount
  FROM student AS s INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  GROUP BY surname, forename
  ORDER BY surname, forename
        

produces the following:

Result Set
surname forename courseCount
Snord Cranston 2
Zzap Zachary 2

Unfortunately, 'David Brown' is left out entirely from this result set. This is expected, because as described in the inner join example there is no matching tuple for 'David Brown' in the studentCourse table. In order to include 'David Brown' in the result set you must use an outer join. The query:

Outer Join with Group By
              
SELECT surname, forename, COUNT(*) AS courseCount
  FROM student AS s LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  GROUP BY surname, forename
  ORDER BY surname, forename
        

produces the following:

Result Set
surname forename courseCount
Brown David 1
Snord Cranston 2
Zzap Zachary 2

This is not quite what was desired, as it implies that 'David Brown' is taking one course when he is actually taking none. The problem lies not in the OUTER JOIN phrase, but in the COUNT(*) phrase. The COUNT() function counts all non-NULL values in a result set. Using '*' as its parameter is simply asking COUNT() to return the number of tuples that contain any non-NULL attributes. Since the surname and forename attributes are non-NULL, the resulting single tuple for 'David Brown' provides a count of 1.

In order to produce the desired result the first line of the query must be replaced by:

SELECT surname, forename, COUNT(course) as courseCount ...

producing the following:

Result Set
surname forename courseCount
Brown David 0
Snord Cranston 2
Zzap Zachary 2

The phrase COUNT(course) asks the COUNT() function to return the number of non-NULL occurrences of the attribute course in its result set. The first outer join example showed that the only resulting tuple for 'David Brown' contains a NULL value for course. Thus COUNT(course) returns 0 for this tuple.

Multi-Table Joins

You may join more than two tables in a SELECT statement and mix different types of joins in that SELECT statement. These joined tables are normally listed sequentially, and can be thought of as going from left to right in the order they are listed, in the same way that outer join statements are considered to be either left or right. (See the example of the courseLab table under Inner Joins.)

In addition to the student and studentCourse tables of the previous examples, assume that a course table also exists:

course Table
course name
CP363 Database I
CP102 Intro to Microcomputers
HP202 Science in the Modern World
BW101 Intro to Basket Weaving

In order to generate a list of student names and the names of the courses they are taking, the following query suffices:

Students and Courses
              
SELECT surname, forename, name
  FROM student AS s
  INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  INNER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, name
        

which produces the following:

Result Set
surname forename name
Snord Cranston Database I
Snord Cranston Intro to Microcomputers
Zzap Zachary Database I
Zzap Zachary Science in the Modern World

(In this example the table studentCourse acts as a look-up table between student and course.

You can mix two different outer joins, but the query design needs careful thought:

Left and Right Outer Joins
              
SELECT surname, forename, Name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, Name
        

produces the following:

Result Set
surname forename Name
NULL NULL Intro to Basket Weaving
Snord Cranston Database I
Snord Cranston Intro to Microcomputers
Zzap Zachary Database I
Zzap Zachary Science in the Modern World

Note that 'David Brown' is still missing. What is important here is the order of the joins. When student and studentCourse are related by a left outer join, 'David Brown' appears. When the right outer join between the results of the previous left outer join and course is performed, the tuple containing 'David Brown' is dropped, as it is part of the left 'table' (the left result set) and not the right table. To get the tuple containing 'David Brown' without any course information, it is necessary to rewrite the previous example (note the parentheses - they change the order in which the joins are executed):

Left and Right Outer Joins
              
SELECT surname, forename, Name
  FROM student AS s
  LEFT OUTER JOIN (studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c)
  ON sc.course = c.course
  ORDER BY surname, forename, Name
        

produces the following:

Result Set
surname forename Name
Brown David NULL
Snord Cranston Database I
Snord Cranston Intro to Microcomputers
Zzap Zachary Database I
Zzap Zachary Science in the Modern World

The same result could have been obtained by starting with the course table and moving 'left' to the student table.

Note, however, that the tuple containing 'Intro to Basket Weaving' is now missing, for the same reasons (reversed) that 'David Brown' was missing in the previous result set. In order to get both 'David Brown' and 'Intro to Basket Weaving' into the final result set needs a full outer join:

Full Outer Joins
              
SELECT surname, forename, Name
  FROM student AS s
  FULL OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  FULL OUTER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, Name
        

This produces:

Result Set
surname forename Name
Brown David NULL
Snord Cranston Database I
Snord Cranston Intro to Microcomputers
Zzap Zachary Database I
Zzap Zachary Science in the Modern World
NULL NULL Intro to Basket Weaving

One last note of caution about outer joins - the attributes you choose for projection are extremely important. The following query:

Outer Join Issues
              
SELECT sc.studentId, sc.name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.name = c.name
        

produces the following:

Result Set
studentId name
NULL NULL
999568440 CP363
999568440 CP102
987859400 CP363
987859400 HP202

This differs from the query:

More Outer Joins
              
SELECT s.studentId, c.name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.name = c.name
        

which produces the following:

Result Set
studentId name
NULL BW101
999568440 CP363
999568440 CP102
987859400 CP363
987859400 HP202

The first query chooses its attributes from the look-up table studentCourse (as sc), which has NULL values for some of the tuples in the outer joins. The second query uses the attributes of the tables on the 'outside' of the joins, and at least one of these attributes has values in the result set.

A more complex join involves a series of tables centered around a single table. Imagine the following table relationships:

Complex Student Join

The table studentCourseTerm is at the center of a relationship amongst the tables student, course, and Term. It is a look-up table for these three other tables, and contains information about the courses a given student takes during any term. Assume that there are appropriate foreign key relationships between studentCourseTerm and the other three tables.

In order to generate a list of courses taken by all students during all terms, the query looks like:

Complex Student Join
              
SELECT s.surname, s.forename, c.name AS courseName, t.name AS termName
  FROM studentCourseTerm AS sct
  INNER JOIN student AS s
  ON sct.studentId = s.studentId
  INNER JOIN course AS c
  ON sct.name = c.name
  INNER JOIN term AS t
  ON sct.termCode = t.termCode
  ORDER BY s.surname, s.forename, c.name, t.name
        

Note the use of parentheses to group the joins together and the fact that the middle table (studentCourseTerm) is referenced only once.

Note also the aliasing required for c.name and t.name. The term and course tables both contain a name attribute, but these attributes refer to totally different data - there is certainly no primary/foreign key relationship implied between them. Using common names like name, description, or price in many different tables is often a recipe for confusion. Aliasing helps, but is no replacement for a thoughtful naming standard.

Virtual Table Joins

Joins can be performed against virtual, or derived tables, as in the following example:

Join on Subqueries
              
SELECT s.surname, s.forename, course, repeats
  FROM student AS s
  INNER JOIN
  (SELECT studentId, course, COUNT(*)
  FROM studentCourseTerm
  GROUP BY studentId, course)
  AS courseRepeats (studentId, course, repeats)
  ON s.studentId = courseRepeats.studentId
  WHERE repeats > 1
        

This query returns a list of all students who have taken a course more than once. The subquery (the query in parentheses) creates a virtual table courseRepeats from studentCourse that is joined against student based upon their common studentId values. The subquery attributes (in this case course and repeats) are displayed as part of the result set.

This is covered in more depth in SQL Subqueries

Self Joins

A table that has a reflexive association (i.e. is joined to itself) may be used in a self join. Assume an employee table contains both an employee ID and a manager ID, where a manager is also part of the employee table:

Employee Table UML

The managerId is a foreign key to the empId primary key.

The employee table can be joined to itself giving it distinctive aliases, as in this example:

Selecting Employees and their Managers
              
SELECT report.empName AS employeeName, manager.empName AS managerName, manager.department
  FROM employee AS report INNER JOIN employee AS manager
  ON report.managerId = manager.empId
        

report and manager are simply two different aliases of the same table, employee. Given this data:

employee Table
empId name managerId department
5 David Brown 3 Physics & Computer Science
24 Heider Ali 5 Physics & Computer Science

The self-join gives us:

Result Set
employeeName managerName department
Heider Ali David Brown Physics & Computer Science

Join Performance

Describing the details of how various SQL engines implement joins is beyond the scope of this course. It is worth noting, however, that providing indices on the columns to be joined improves performance immensely. In the previous examples creating individual indices on the attributes of the various look-up tables helps performance. (ex: on studentId and course in studentCourse.) These attributes, as primary keys in their 'owning' tables, are already indexed).