Introduction

Joins create new virtual (or composite) tables that include information from multiple tables. In SQL you specify which tables contain the information you need using the FROM clause. You must combine the matching attributes of each table. You control how attributes are matched in the composite table either by specifying a particular type of join operation or by using the ON clause.

In a relational database management system relationships among data values are left unstated in the database definition. They become explicit when you query the data, not when you create it. You can ask any question that comes to mind about the data stored in the database, regardless of what your intentions were when the database was set up. (The database organization may have some effect on how you frame such questions - certain questions may be easier or harder to state in SQL depending on the database organization). Joining different tables allows you to explore relationships among diverse data.

Joins normally appear within SELECT statements. The desired attributes appear after the SELECT, the tables and aliases appear in the FROM clause, and the joining conditions can be given in the ON clause. If two tables share attribute names, the attribute names must be qualified by the appropriate table name or alias, as in this example:

Members and Their Publications

              
SELECT memberSurname, memberForename, pubTitle
  FROM member INNER JOIN pub
  ON memberId = pubMemberId

Here, member and pub are joined by their common attributes, memberId and pubMemberId. This join results in the listing of all members' names and the titles of their publications. Members without publications are not listed. (INNER has a special meaning that will be dealt with later.)

This query could be written without the JOIN ON clause by using a WHERE clause instead:

Members and Their Publications - No JOIN

              
SELECT memberSurname, memberForename, pubTitle
  FROM member, pub
  WHERE memberId = pubMemberId

Although this statement produces the same result set as the previous one using the JOIN ON clause, the ON keyword isolates the join constraints and makes your joined statement easier to read. For example, consider:

Members and Their Books

              
SELECT memberSurname, memberForename, pubTitle
  FROM member INNER JOIN pub
  ON memberId = pubMemberId
  WHERE pubPubType = 'B'

versus:

Members and Their Books - No JOIN

              
SELECT memberSurname, memberForename, pubTitle
  FROM member, pub
  WHERE memberId = pubMemberId
  AND pubPubType = 'B'

Although again, both statements produce the same result set, the first explicitly indicates that two tables are being joined and that a condition is being applied to the result of that join. The second throws the join and the condition together as part of the WHERE clause.

Attributes being joined must have the same or compatible data types. (In the previous example the member ID attributes of both tables are clearly of the same type because they have a primary key - foreign key relationship.)

Types of Joins

Joins can be categorized according to how they relate tables. We will examine Key joins, Inner joins, Outer joins, Multiple table joins, and Self joins.

Key Joins

Key joins take advantage of the foreign key relationship between two tables. Such queries are optimized by the DBMS engine. (Although query optimization is not a topic of this course, it is useful to know as a rule of thumb that joins made between tables related by primary and foreign keys are more efficient, and thus execute faster, than joins made between tables that do not have such a relationship.) For example:

Selecting Members and their Publications

              
SELECT pubTypeDesc, pubTitle, pubPublisher
  FROM pub INNER JOIN pubType
  ON pubPubType = pubType

The tables pub and pubType have a foreign key relationship based upon the attributes pubPubType (the foreign key for its table) and pubType (the primary key for its table) respectively. This query executes faster than if the foreign key relationship had not been made explicit. In a good database design, all tables that are expected to be joined should have their foreign key relationships explicitly defined.

Note that pubType is both a table name and the name of an attribute of that table - this is allowed in SQL.

Inner Joins

On an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition.

The following tables illustrate the results of an inner join:

**student** Table
studentId	surname	forename
815686230	Brown	David
999568440	Snord	Cranston
987859400	Zzap	Zachary

**studentCourse** Table
studentId	course
999568440	CP363
999568440	CP102
987859400	CP363
987859400	HP202

Assuming that these two tables have a foreign key relationship based upon the studentId attributes, the following SELECT statement joins the two tables:

Students and Courses

              
SELECT surname, forename, s.studentId, course
  FROM student AS s INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId

produces the following:

*Result Set*
surname	forename	studentId	course
Snord	Cranston	999568440	CP363
Snord	Cranston	999568440	CP102
Zzap	Zachary	987859400	CP363
Zzap	Zachary	987859400	HP202

Note that none of the data from the student table pertaining to 'David Brown' appears in this result set. The studentId attribute is common to both the student and studentCourse table, but 'David Brown' s ID 815686230 does not appear anywhere in the studentCourse table. Because on an inner join tuples appear in the result set only if both tables contain at least one row that satisfies the join condition, there can be no join based on the attribute value 815686230 .

A Short Note on Naming

In the DCRIS database all attribute names are unique within the database. With some minor exceptions, all attributes start with the name of the table followed by the rest of the attribute name. Thus the primary key for the member table is memberId, and the matching foreign key in the pub table is pubMemberId. (This can make for some awkward names, such as memberBroadBroadId, the foreign key in the memberBroad table relating to the primary key broadId in the broad table.)

The advantage of this approach is that explicitly using table names or aliasing is not necessary when joining two table on their primary/foreign keys since their attribute names are guaranteed to be unique. Thus the ON clause in:
FROM member INNER JOIN pub ON memberId = pubMemberId
simply refers to the required attributes with no other information necessary.

In the student and studentCourse example given above, both tables share an attribute name, studentId. Thus when joining them together, we have to explicitly indicate which table each attribute comes from:
FROM student INNER JOIN studentCourse ON student.studentId = studentCourse.studentId
or use aliases:
FROM student AS s INNER JOIN studentCourse AS sc ON s.studentId = sc.studentId
otherwise SQL won't know which studentId is being referred to.

Which approach should you use? There is no universal naming standard, and clearly SQL supports both approaches. When designing your own database, use whichever approach you prefer, just be consistent with the database. We are showing both approaches simply because you see both in actual database designs.

In a related note, whether you use CamelCase (ex: studentId), PascalCase (ex: StudentId), or SnakeCase (ex: student_id) for attribute and table naming is up to you. Note that SQL is case sensitive for table and attribute names.

The same principle applies across more than two tables. In a multi-table join, all tables must have at least one tuple that satisfies the result condition in order for a matching tuple to appear in the result set. If a third table is added to the previous join:

**courseLab** Table
course	lab
CP102	Tu, Th 4:00 - 5:00

with the join:

Students and Labs

              
SELECT surname, forename, s.studentId, cl.course, lab
  FROM student AS s
  INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  INNER JOIN courseLab AS cl
  ON sc.course = cl.course

The result is:

*Result Set*
surname	forename	studentId	course	lab
Snord	Cranston	999568440	CP102	Tu, Th 4:00 - 5:00

There is only one tuple in the final result set because only 'CP102' has tuples in all three tables, i.e. a student, a course, and a lab.

Outer Joins

An outer join contains tuples whether or not a tuple exists in the opposite table to satisfy the join condition. The LEFT and RIGHT keywords identify the table that is to appear in its entirety. A LEFT OUTER JOIN contains every tuple in the left-hand table, while a RIGHT OUTER JOIN contains every tuple in the right-hand table. A FULL OUTER JOIN contains every tuple in both the left- and right-hand tables. Attributes in the query result set that do not have values from tuples in the left or right tables contain NULL. Using the previous examples, altered to be an outer join:

Left Outer Join

              
SELECT surname, forename, s.studentId, course
  FROM student AS s LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId

produces the following:

*Result Set*
surname	forename	studentId	course
Brown	David	815686230	NULL
Snord	Cranston	999568440	CP363
Snord	Cranston	999568440	CP102
Zzap	Zachary	987859400	CP363
Zzap	Zachary	987859400	HP202

'David Brown' now appears in the result set, but the matching course attribute is NULL, as there were no matching values in the studentCourse table.

If the LEFT OUTER JOIN phrase was replaced by a RIGHT OUTER JOIN phrase, the result set is identical to the result set for the inner join example - the table studentCourse does not contain any values in the studentId attribute that do not have matching values in the student table. Thus, none of the attributes in the result set have a NULL value.

Outer joins can be especially useful when grouping and counting occurrences of values, but its use poses subtle dangers. Assume you wish to find out how many courses each student is taking. The inner join in this query:

Join with Group By

              
SELECT surname, forename, COUNT(*) AS courseCount
  FROM student AS s INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  GROUP BY surname, forename
  ORDER BY surname, forename

produces the following:

*Result Set*
surname	forename	courseCount
Snord	Cranston	2
Zzap	Zachary	2

Unfortunately, 'David Brown' is left out entirely from this result set. This is expected, because as described in the inner join example there is no matching tuple for 'David Brown' in the studentCourse table. In order to include 'David Brown' in the result set you must use an outer join. The query:

Outer Join with Group By

              
SELECT surname, forename, COUNT(*) AS courseCount
  FROM student AS s LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  GROUP BY surname, forename
  ORDER BY surname, forename

produces the following:

*Result Set*
surname	forename	courseCount
Brown	David	1
Snord	Cranston	2
Zzap	Zachary	2

This is not quite what was desired, as it implies that 'David Brown' is taking one course when he is actually taking none. The problem lies not in the OUTER JOIN phrase, but in the COUNT(*) phrase. The COUNT() function counts all non-NULL values in a result set. Using '*' as its parameter is simply asking COUNT() to return the number of tuples that contain any non-NULL attributes. Since the surname and forename attributes are non-NULL, the resulting single tuple for 'David Brown' provides a count of 1.

In order to produce the desired result the first line of the query must be replaced by:

SELECT surname, forename, COUNT(course) as courseCount ...

producing the following:

*Result Set*
surname	forename	courseCount
Brown	David	0
Snord	Cranston	2
Zzap	Zachary	2

The phrase COUNT(course) asks the COUNT() function to return the number of non-NULL occurrences of the attribute course in its result set. The first outer join example showed that the only resulting tuple for 'David Brown' contains a NULL value for course. Thus COUNT(course) returns 0 for this tuple.

Multi-Table Joins

You may join more than two tables in a SELECT statement and mix different types of joins in that SELECT statement. These joined tables are normally listed sequentially, and can be thought of as going from left to right in the order they are listed, in the same way that outer join statements are considered to be either left or right. (See the example of the courseLab table under Inner Joins.)

In addition to the student and studentCourse tables of the previous examples, assume that a course table also exists:

**course** Table
course	name
CP363	Database I
CP102	Intro to Microcomputers
HP202	Science in the Modern World
BW101	Intro to Basket Weaving

In order to generate a list of student names and the names of the courses they are taking, the following query suffices:

Students and Courses

              
SELECT surname, forename, name
  FROM student AS s
  INNER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  INNER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, name

which produces the following:

*Result Set*
surname	forename	name
Snord	Cranston	Database I
Snord	Cranston	Intro to Microcomputers
Zzap	Zachary	Database I
Zzap	Zachary	Science in the Modern World

(In this example the table studentCourse acts as a look-up table between student and course.

You can mix two different outer joins, but the query design needs careful thought:

Left and Right Outer Joins

              
SELECT surname, forename, Name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, Name

produces the following:

*Result Set*
surname	forename	Name
NULL	NULL	Intro to Basket Weaving
Snord	Cranston	Database I
Snord	Cranston	Intro to Microcomputers
Zzap	Zachary	Database I
Zzap	Zachary	Science in the Modern World

Note that 'David Brown' is still missing. What is important here is the order of the joins. When student and studentCourse are related by a left outer join, 'David Brown' appears. When the right outer join between the results of the previous left outer join and course is performed, the tuple containing 'David Brown' is dropped, as it is part of the left 'table' (the left result set) and not the right table. To get the tuple containing 'David Brown' without any course information, it is necessary to rewrite the previous example (note the parentheses - they change the order in which the joins are executed):

Left and Right Outer Joins

              
SELECT surname, forename, Name
  FROM student AS s
  LEFT OUTER JOIN (studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c)
  ON sc.course = c.course
  ORDER BY surname, forename, Name

produces the following:

*Result Set*
surname	forename	Name
Brown	David	NULL
Snord	Cranston	Database I
Snord	Cranston	Intro to Microcomputers
Zzap	Zachary	Database I
Zzap	Zachary	Science in the Modern World

The same result could have been obtained by starting with the course table and moving 'left' to the student table.

Note, however, that the tuple containing 'Intro to Basket Weaving' is now missing, for the same reasons (reversed) that 'David Brown' was missing in the previous result set. In order to get both 'David Brown' and 'Intro to Basket Weaving' into the final result set needs a full outer join:

Full Outer Joins

              
SELECT surname, forename, Name
  FROM student AS s
  FULL OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  FULL OUTER JOIN course AS c
  ON sc.course = c.course
  ORDER BY surname, forename, Name

This produces:

*Result Set*
surname	forename	Name
Brown	David	NULL
Snord	Cranston	Database I
Snord	Cranston	Intro to Microcomputers
Zzap	Zachary	Database I
Zzap	Zachary	Science in the Modern World
NULL	NULL	Intro to Basket Weaving

One last note of caution about outer joins - the attributes you choose for projection are extremely important. The following query:

Outer Join Issues

              
SELECT sc.studentId, sc.name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.name = c.name

produces the following:

*Result Set*
studentId	name
NULL	NULL
999568440	CP363
999568440	CP102
987859400	CP363
987859400	HP202

This differs from the query:

More Outer Joins

              
SELECT s.studentId, c.name
  FROM student AS s
  LEFT OUTER JOIN studentCourse AS sc
  ON s.studentId = sc.studentId
  RIGHT OUTER JOIN course AS c
  ON sc.name = c.name

which produces the following:

*Result Set*
studentId	name
NULL	BW101
999568440	CP363
999568440	CP102
987859400	CP363
987859400	HP202

The first query chooses its attributes from the look-up table studentCourse (as sc), which has NULL values for some of the tuples in the outer joins. The second query uses the attributes of the tables on the 'outside' of the joins, and at least one of these attributes has values in the result set.

A more complex join involves a series of tables centered around a single table. Imagine the following table relationships:

Complex Student Join

The table studentCourseTerm is at the center of a relationship amongst the tables student, course, and Term. It is a look-up table for these three other tables, and contains information about the courses a given student takes during any term. Assume that there are appropriate foreign key relationships between studentCourseTerm and the other three tables.

In order to generate a list of courses taken by all students during all terms, the query looks like:

Complex Student Join

              
SELECT s.surname, s.forename, c.name AS courseName, t.name AS termName
  FROM studentCourseTerm AS sct
  INNER JOIN student AS s
  ON sct.studentId = s.studentId
  INNER JOIN course AS c
  ON sct.name = c.name
  INNER JOIN term AS t
  ON sct.termCode = t.termCode
  ORDER BY s.surname, s.forename, c.name, t.name

Note the use of parentheses to group the joins together and the fact that the middle table (studentCourseTerm) is referenced only once.

Note also the aliasing required for c.name and t.name. The term and course tables both contain a name attribute, but these attributes refer to totally different data - there is certainly no primary/foreign key relationship implied between them. Using common names like name, description, or price in many different tables is often a recipe for confusion. Aliasing helps, but is no replacement for a thoughtful naming standard.

Virtual Table Joins

Joins can be performed against virtual, or derived tables, as in the following example:

Join on Subqueries

              
SELECT s.surname, s.forename, course, repeats
  FROM student AS s
  INNER JOIN
  (SELECT studentId, course, COUNT(*)
  FROM studentCourseTerm
  GROUP BY studentId, course)
  AS courseRepeats (studentId, course, repeats)
  ON s.studentId = courseRepeats.studentId
  WHERE repeats > 1

This query returns a list of all students who have taken a course more than once. The subquery (the query in parentheses) creates a virtual table courseRepeats from studentCourse that is joined against student based upon their common studentId values. The subquery attributes (in this case course and repeats) are displayed as part of the result set.

This is covered in more depth in SQL Subqueries

Self Joins

A table that has a reflexive association (i.e. is joined to itself) may be used in a self join. Assume an employee table contains both an employee ID and a manager ID, where a manager is also part of the employee table:

Employee Table UML

The managerId is a foreign key to the empId primary key.

The employee table can be joined to itself giving it distinctive aliases, as in this example:

Selecting Employees and their Managers

              
SELECT report.empName AS employeeName, manager.empName AS managerName, manager.department
  FROM employee AS report INNER JOIN employee AS manager
  ON report.managerId = manager.empId

report and manager are simply two different aliases of the same table, employee. Given this data:

**employee** Table
empId	name	managerId	department
5	David Brown	3	Physics & Computer Science
24	Heider Ali	5	Physics & Computer Science

The self-join gives us:

*Result Set*
employeeName	managerName	department
Heider Ali	David Brown	Physics & Computer Science

Join Performance

Describing the details of how various SQL engines implement joins is beyond the scope of this course. It is worth noting, however, that providing indices on the columns to be joined improves performance immensely. In the previous examples creating individual indices on the attributes of the various look-up tables helps performance. (ex: on studentId and course in studentCourse.) These attributes, as primary keys in their 'owning' tables, are already indexed).

Lesson 6: Joins

Introduction

Types of Joins

Key Joins

Inner Joins

Outer Joins

Multi-Table Joins

Virtual Table Joins

Self Joins

Join Performance