Schemas

The Three Schema Architecture

A schema is a description of data contents, structure, and possibly other aspects of a database

external schema: describes data as seen by an application program or by an end user
conceptual schema: describes the base logical structure of all data
internal schema: describes how the database is physically encoded, including selection of files, indexes, etc.

The separation of external schema from conceptual schema enables logical data independence. The separation of conceptual schema from internal schema enables physical data independence.

Data Independence

Isolates applications from data description and changes to data. Data structure description is stored separately from the access programs. In a database, if a record structure is changed by adding a field, then the new field shows up on the next access of the database. The DBMS itself does not need to be changed in order for it to access the data. The details of storage implementation are hidden from the user.

data about data is called meta-data (catalog and description of data)
- independent of the data itself, and allows program independence
- separates interface from implementation
- as long as you know that the data exists, and what it looks like, you can access it without worrying about how it is accessed
design pattern abstractions or concerns - two concerns
- basic concern: functionality -what is the basic purpose of software or a database
  - specialization (is-a) - involved in inheritance. ex: a TA is a special version of a student. They have all the attributes of a student, and perhaps more
  - aggregation (has a) - an element is made up of sub elements, and the sub elements have no existence outside of the upper level element. ex: a Mark is an element of a student. Without the student, there can be no Mark
  - association (uses a) - an element may use another element, but both exist independently. ex: Course and student each have independent existences, but they are associated
- special purpose concern: user interface, control, or timing
  - views-a - the relationship between modules, i.e. the way they look at each other. ex: a student and an Instructor have different views of a Course. A student may examine portions of a Course, whereas an Instructor may alter Course data. This view is not inherent in the data, but rather is imposed from outside independent of the data. (This is a relatively 'new' software primitive).
  - can have multiple user interfaces (analog and digital clock show same data using different interface)
- Separating these concerns makes software easier to design, implement, and reuse. (Why?)
Databases may contain virtual data and/or computed data - a different view of the data
- In the DCRIS database, a member is associated with a narrow expertise by the memberNarrow table
- The memberNarrow table contains only the memberNarrowMemberId and the memberNarrowNarrowId, which is not particularly informative when read by a human. User can define a view that gives the member name and the narrow expertise description, narrowDesc. These are never actually stored together in the same table, but exist together virtually in the view.
- student assignment marks are stored as raw marks and assignment value - percent result is a computed field that is never stored, but derived from two other fields
Concurrency insures that simultaneous updates are not committed by separate users, and data integrity is maintained
Redundancy is reduced by efficient data storage
- (see sample member, List, and narrow data)
- members each appear once, and therefore occupy a full record
- Each member may have many areas of expertise, and are related through memberNarrow. Use of ID fields minimizes (but does not eliminate) redundancy
- Redundancy reduction improves data integrity:
  - allows limiting of choice for data fields
  - by storing information in only one place, eliminates possibility that two or more pieces of data that should match do not
- reduces storage space necessary
- reduces duplication of effort - if the data is already there, why re-enter it?
Rules can be designed into databases
- Moves procedures (code) from the application program into the DBMS
- Application programs do not need to worry about applying the rules, and thus the rules do not have to be programmed into every application
- Triggers can be applied whenever data is inserted, updated, or deleted. A trigger is an event that is 'triggered' by the insertion, update, or deletion.
- Constraints or checks can be set upon data so that insert or update attempts that violate the checks can be rejected

Relational Model

A domain is a set of atomic values. An atomic value is a value that is indivisible; i.e. it cannot be broken down into smaller values. This depends partly on how the value is to be used. A Name attribute (ex: "David B. Brown") may be considered atomic if only the entire name value is used. However a Name could be broken down into Surname , Forename , and Initial should there be a use for the individual attributes. On the other hand a student ID would be irreducibly atomic in that it cannot be broken down further.

A domain has a name, data type, and format. (INTEGER, CHAR[n], VARCHAR[n], SMALLINT, amongst others.)
A column header or field is an attribute. The content of an attribute is a value, where the value is an element of the domain of the attribute. (i.e. if the attribute domain is INTEGER, then the attribute value must be INTEGER). A NULL is a special value, and is not necessarily allowed in all domains.
A row or record is a tuple. A tuple is a set of <attribute, value> pairs. At the abstract level, the order of <attribute, value> pairs is irrelevant. (It is important to a particular view of the data).
A table is a relation. A relation is a set of tuples. Sets are logically unordered, though they may have a physical order when stored

A simple sample relation:

student	id	surname	forename
relation name	attributes
tuples	165465	Brown	David
	458965	Zzap	Zachary
	532595	Snord	Cranston

A key is an attribute or set of attributes that can uniquely identify a tuple. In the above example ID is the key attribute because although names may be repeated, a student ID is unique.

If an attribute is multivalued it should be represented by separate relations. Thus if a supervisor has many employees this information cannot be stored in a single table because employees is multivalued. The relationship between the two must be broken up into at least two tables.

Keys

Generally all tuples in a relation should be distinct

this distinctiveness can be based upon some subset of the relation's attributes: this set of attributes is a superkey
A key is a minimal superkey: it is the smallest set of attributes that can uniquely identify a tuple in a relation.
A relation may have more than one key - each of these possible keys is a candidate key.
One of a relation's keys can be designated as the primary key.
Ex: a student table may contain the attributes studentId, OHIPNumber and SIN. Taken together these three attributes form a superkey. As well any two of these attributes taken together form a superkey. Individually all three are candidate keys because each can be used to uniquely identify a tuple in student. Only one can be chosen to be the primary key.
Ex: In the case of a lookup table, like that between member and narrow the primary key is a combination of the attributes memberId and narrowId because only the combination of these two attributes uniquely identifies a tuple in the memberNarrow table.
No primary key value should be null. (This is the entity integrity constraint).

Key constraints between tables are called referential integrity constraints.

A tuple in one relation may refer to an existing tuple in another relation.
An attribute in one table is called a foreign key if its values refer to the values of a primary key in another table.
A foreign key may or may not be allowed to have a null value.
These key constraints must hold when table data is inserted, deleted, or updated.

DDL / DML

DDL: Data Definition Language

specifies schemas
may be different DDLs for external, conceptual and internal schemas
meta-data is stored in a data dictionary

DML: Data Manipulation Language

specifies data queries and updates
two general ways of querying and updating a database
- through "stand alone" DML facilities
- from within application programs
two kinds of DMLs: result sets may be
- navigational (one record at a time)
- non-navigational

SQL (Structured Query Language) is used as both a DDL and a DML.

Constraints

A referential key constraint can protect data in one table from deletions or updates in another table.

If a tuple in a parent table has children (i.e. tuples related by a foreign key) then attempts to delete the parent before the children are deleted fail. This is because the children are linked to the parent's primary key by their foreign key value, and if the primary key is deleted, their foreign key values become invalid. Thus the deletion is rejected. Similarly this can prevent the primary key of a parent from being updated. Once all the children are deleted, however, the parent tuple may also be deleted.

A referential key constraint can also make deletions and updates easier.

Rather than rejecting the deletion of a parent, the deletion can be cascaded. In this case when a parent tuple is deleted all child tuples with matching foreign keys are deleted as well. Thus a single command can be used to delete large amounts of data. Similarly an update to a particular primary key value can be cascaded to all child foreign keys.

The database designer must choose whether a particular relation rejects or cascades deletions and updates.

Lesson 3: Data and Relations