CP363 : XML

While formatting pages is useful, many users are starting to realize that web sites are only marginally more useful than printed or faxed material. Although it's possible to cut and paste information out of a web browser, XML opens up the prospect of reusable page content. With appropriate supporting applications, a user could extract the XML data from a document and keep it in their own private data store, making it easy to manipulate the information later. This information could include site maps, price lists, product information, or nearly any kind of data that can be represented as text. Content-based XML markup enhances searchability as well, making it possible for agents and search engines to categorize data instead of wasting processing power on context-based full-text searches.

At the same time, XML is useful for much more than Web pages. XML's potential as a universal transfer format, allowing even applications of different types to exchange data smoothly, holds as much promise as its role as a document system. XML browsers are a key opening for XML, allowing users to read XML documents freely, but browsers are only the beginning. XML provides a gateway for communication between applications, even applications on wildly different systems. As long as applications can share data (through HTTP, file sharing, or another mechanism), and have an XML parser, they can share structured information that is easily processed. Databases can trade tables, business applications can trade updates, and document systems can share information.

XML provides a core set of standards developers can use to create their own standards. While some have declared that XML will 'balkanize' the Web, the effects are likely to be much more complex - and less destructive - than that suggestion implies. By hammering down a common document syntax, but allowing developers to go their own ways on markup elements, XML makes it possible to create new systems for data management and organization without the many of the incompatibilities and complexities that plagued older systems.


XML Family


Structure

Like HTML (an application of SGML), XML uses elements and attributes, which are indicated in a document using tags. Tags begin with a < and close with a >. End tags include a / before the name of the element; empty tags include a / before the closing >. For example, the following bit of a document includes three elements: two elements with content, and one empty tag.

<FIGURE DESCRIPTION="Harvey">
  <IMAGE/>
  <CAPTION>This is a picture of my invisible friend!</CAPTION>
</FIGURE>

The first start tag opens the FIGURE element, which has the attribute DESCRIPTION set to "Harvey", and contains an empty IMAGE element and the CAPTION element with its content. End tags close both the CAPTION and the FIGURE elements, producing a nested structure. These nested structures are fairly good at representing typical document and data structures, and a very easy for computer programs to store and manipulate. XML enforces its rules harshly. Unlike HTML browsers, which have been extremely forgiving of bad markup, XML parsers are supposed to produce error messages for illegal or malformed markup. Forcing the author to clean up their markup allows the parsers on the receiving end to do much less work. It also provides authors with confidence that their work will be interpreted consistently, without having to wonder how multiple browsers would interpret the same document.

In addition to providing syntax for document markup, XML provides syntax for specifying document structure. The Document Type Definition (DTD) provides XML parsers a set of rules with which they can validate the document. Validation doesn't imply that the contents of the document are correct, or that certain data fields are numbers or text; rather, it means that all the elements of the document fit into the structure specified by the DTD. For example, the fragment below specifies the structure used in the example above.

<!ELEMENT FIGURE (IMAGE, CAPTION)>
<!ATTLIST FIGURE 
DESCRIPTION CDATA #IMPLIED>
<!ELEMENT IMAGE EMPTY>
<!ELEMENT CAPTION (#PCDATA)>

The FIGURE element must contain an IMAGE and a CAPTION element, and the FIGURE element may have a DESCRIPTION attribute. The IMAGE element must be empty, and would probably include a set of attributes providing information about the image if this example was more complicated than an 'invisible friend'. The CAPTION element may contain text, entities, processing instructions, and any other valid XML text except other elements. The full syntax for DTDs provides many more options for declaring elements and attributes and their location in the document structure, as well as entities, which allow the developer to define a chunk of XML content or DTD information and use it by reference.

XML permits the use of documents, called 'well-formed documents' that use only its rules for document syntax, without specifying a DTD. Documents that contain (and/or refer to) a properly written DTD, and meet the requirements it sets, are referred to as 'valid'. Validation can be an important step in the authoring process, and may also be performed at any step in processing. Developers can choose how often, and when, to screen a document to check its structure. Applications which need to process lots of information quickly, or which can't afford the additional processing requirements imposed by validation, can stick to well-formed documents. Well-formed documents also provide an easy bottom rung on the XML learning ladder - by sticking to the basic syntax, developers can create parseable documents with any structure they choose, moving up to more formal DTDs when the need arises.

In practice, though, XML itself should disappear into the background, hiding behind tools for most users. Most people won't need to create their own DTDs - once a standard DTD is created, users can simply apply it, making modifications when it becomes clear the structure needs improvement. As XML becomes ubiquitous (and tools improve), it should become invisible to all but a few, buried underneath authoring tools and plug-in parsers.

<!ELEMENT html (head, body)>
<!ELEMENT ul (li)+>
<!ELEMENT form %form.content;>   <!-- forms shouldn't be nested -->
<!ATTLIST form
  %attrs;
  action      %URI;          #REQUIRED
  method      (get|post)     "get"
  enctype     %ContentType;  "application/x-www-form-urlencoded"
  onsubmit    %Script;       #IMPLIED
  onreset     %Script;       #IMPLIED
  accept      %ContentTypes; #IMPLIED
  accept-charset %Charsets;  #IMPLIED
  >

In effect, a DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD when editing files means you can be certain that all documents which belong to a particular type will be constructed and named in a consistent and conformant manner.

Unfortunately, a DTD is not an XML document and therefore cannot be parsed and validated as an XML document. An alternative way of defining the structure of an XML document uses XSD (XML Schema Document). An XSD schema is an XML document itself, and can be parsed and validated in the same way any XML document can. It is also, arguably, easier to understand than a DTD definition. The following example shows the html and ul elements from the DTD above translated into XSD.

   
<schema
  xmlns='http://www.w3.org/2000/10/XMLSchema'
  targetNamespace='http://www.w3.org/namespace/'
  xmlns:t='http://www.w3.org/namespace/'>

 <element name='html'>
  <complexType>
   <sequence>
    <element ref='t:head'/>
    <element ref='t:body'/>
   </sequence>
  </complexType>
 </element>

 <element name='ul'>
  <complexType>
   <sequence maxOccurs='unbounded'>
    <element ref='t:li'/>
   </sequence>
  </complexType>
 </element>
</schema>

There are thousands of SGML DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own (see the question on creating your own DTD. Existing SGML DTDs need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, and expect to see announcements of popular DTDs becoming available in XML format.

Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally, at the penalty of losing automated control over the structuring of additional documents of the same type.

To make this work, a DTDless file in effect defines its own markup informally, by the simple existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure while it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules.


Well Formed:


Valid XML

Valid XML files are those which have a Document Type Definition (DTD) like other SGML applications, and which adhere to it. They must already be well-formed.

A valid file begins like any other SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:

<?xml version="1.0"?>
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">
<advert>
<headline>...<pic/>...</headline>
<text>...</text>
</advert>

The XML Specification defines an SGML Declaration for XML which is fixed for all instances (the declaration has been removed from the text of the Specification and is now in a separate document). An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network. You can specify this by supplying the URL for the DTD in a System Identifier (as in the example above). It is possible (many people would say preferable) to supply a Formal Public Identifier, but if used, this must precede the System Identifier, which must still be given (and only the PUBLIC keyword is used),

<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN"
"http://www.foo.org/ad.dtd">

The defaults for the other attributes of the XML Declaration are version="1.0" and encoding="UTF-8".


XQuery

XQuery is a programming language designed to query collections of XML documents. It can be used stand alone or as an API with other languages, such as Java. It can extract data from XML using expressions. Here is a simple example of an XQuery that extracts all authors whose last name is "Brown" from 'books.xml':

doc("books.xml")/library/book/author[last="Brown"]

"library" and "book" are XML tags within the "books.xml" document.

XQuery can also be used to construct new XML documents from existing ones by wrapping the XQuery expressions in XML tags:

<titles>
{
doc("books.xml")//title
}
</titles>

which creates:

<titles>
  <title>TCP/IP Illustrated</title>
  <title>Advanced Programming in the Unix Environment</title>
  <title>Data on the Web</title>
  <title>The Economics of Technology and Content for Digital TV</title>
</titles>

See http://www.datadirect.com/developer/xml/xquery/docs/katz_c01.pdf for a good introductory tutorial.