2020-12-16 07:12

CP363 : XML

XML (eXtensible Markup Language)
it's a metalanguage -- a language for describing other languages -- which lets you design your own markup
puts structured data in a text file
XML is a set of rules, guidelines, conventions, for designing text formats for such data
- produces files that are easy to generate and read (by a computer)
- are unambiguous
- are extensible: First, it allows developers to create their own DTDs, effectively creating 'extensible' tag sets that can be used for multiple applications. Second, XML itself is being extended with several additional standards that add styles, linking, and referencing ability to the core XML set of capabilities. As a core standard, XML provides a solid foundation around which other standards may grow.
- support for internationalization/localization
- avoid platform-dependency
a subset or restricted form of SGML (Standard Generalized Markup Language) (ISO 8879) the international standard for defining descriptions of the structure and content of different types of electronic document
is not HTML
- both use tags (words bracketed by '<' and '>') and attributes (of the form name="value")
- HTML specifies what each tag & attribute means (and often how the text between them will look in a browser)
- but XML uses the tags only to delimit pieces of data, and leaves the interpretation of the data completely to the application that reads it
- HTML markup is fixed
- XML markup is user defined, custom markup can be created for any need
- rules for XML files are much stricter than for HTML: just because something is done a certain way in HTML in a HTML browser does not mean it's correct
XML can be thought of as SGML-- rather than HTML++
Valid XML files are kosher SGML, so they can be used outside the Web as well, in an SGML environment
XML is license-free, platform-independent and well-supported
Users from a database or computer science background should be aware that SGML systems -- and that includes XML -- are not database management systems: they are text markup systems. While there are many similarities some of the concepts of one are simply non-existent in the other: XML does not possess some database-like features in the same way that DBMSs do not possess markup-like ones. It is a common error to believe that XML is some kind of database like Oracle or Access.

While formatting pages is useful, many users are starting to realize that web sites are only marginally more useful than printed or faxed material. Although it's possible to cut and paste information out of a web browser, XML opens up the prospect of reusable page content. With appropriate supporting applications, a user could extract the XML data from a document and keep it in their own private data store, making it easy to manipulate the information later. This information could include site maps, price lists, product information, or nearly any kind of data that can be represented as text. Content-based XML markup enhances searchability as well, making it possible for agents and search engines to categorize data instead of wasting processing power on context-based full-text searches.

At the same time, XML is useful for much more than Web pages. XML's potential as a universal transfer format, allowing even applications of different types to exchange data smoothly, holds as much promise as its role as a document system. XML browsers are a key opening for XML, allowing users to read XML documents freely, but browsers are only the beginning. XML provides a gateway for communication between applications, even applications on wildly different systems. As long as applications can share data (through HTTP, file sharing, or another mechanism), and have an XML parser, they can share structured information that is easily processed. Databases can trade tables, business applications can trade updates, and document systems can share information.

XML provides a core set of standards developers can use to create their own standards. While some have declared that XML will 'balkanize' the Web, the effects are likely to be much more complex - and less destructive - than that suggestion implies. By hammering down a common document syntax, but allowing developers to go their own ways on markup elements, XML makes it possible to create new systems for data management and organization without the many of the incompatibilities and complexities that plagued older systems.

XML Family

Xlink describes a standard way to add hyperlinks to an XML file.
XPointer & XFragments are syntaxes for pointing to parts of an XML document. (An Xpointer is a bit like a URL, but instead of pointing to documents on the Web, it points to pieces of data inside an XML file.)
CSS (), the style sheet language, is applicable to XML as it is to HTML.
XSL () is the advanced language for expressing style sheets. It is based on XSLT, a transformation language that is often useful outside XSL as well, for rearranging, adding or deleting tags & attributes.
The DOM (Document Object Model) is a standard set of function calls for manipulating XML (and HTML) files from a programming language.
XML Namespaces is a specification that describes how you can associate a URL with every single tag and attribute in an XML document. What that URL is used for is up to the application that reads the URL, though. (RDF, W3C's standard for metadata, uses it to link every piece of metadata to a file defining the type of that data.)
XML Schemas 1 and 2 help developers to precisely define their own XML-based formats. There are several more modules and tools available or under development.
SVG (Scalable Vector Graphics) is a language for describing two-dimensional graphics in XML
XHTML a reformulation of HTML-4 as an XML 1.0 application

Structure

Like HTML (an application of SGML), XML uses elements and attributes, which are indicated in a document using tags. Tags begin with a < and close with a >. End tags include a / before the name of the element; empty tags include a / before the closing >. For example, the following bit of a document includes three elements: two elements with content, and one empty tag.

<FIGURE DESCRIPTION="Harvey">
  <IMAGE/>
  <CAPTION>This is a picture of my invisible friend!</CAPTION>
</FIGURE>

The first start tag opens the FIGURE element, which has the attribute DESCRIPTION set to "Harvey", and contains an empty IMAGE element and the CAPTION element with its content. End tags close both the CAPTION and the FIGURE elements, producing a nested structure. These nested structures are fairly good at representing typical document and data structures, and a very easy for computer programs to store and manipulate. XML enforces its rules harshly. Unlike HTML browsers, which have been extremely forgiving of bad markup, XML parsers are supposed to produce error messages for illegal or malformed markup. Forcing the author to clean up their markup allows the parsers on the receiving end to do much less work. It also provides authors with confidence that their work will be interpreted consistently, without having to wonder how multiple browsers would interpret the same document.

In addition to providing syntax for document markup, XML provides syntax for specifying document structure. The Document Type Definition (DTD) provides XML parsers a set of rules with which they can validate the document. Validation doesn't imply that the contents of the document are correct, or that certain data fields are numbers or text; rather, it means that all the elements of the document fit into the structure specified by the DTD. For example, the fragment below specifies the structure used in the example above.

<!ELEMENT FIGURE (IMAGE, CAPTION)>
<!ATTLIST FIGURE 
DESCRIPTION CDATA #IMPLIED>
<!ELEMENT IMAGE EMPTY>
<!ELEMENT CAPTION (#PCDATA)>

The FIGURE element must contain an IMAGE and a CAPTION element, and the FIGURE element may have a DESCRIPTION attribute. The IMAGE element must be empty, and would probably include a set of attributes providing information about the image if this example was more complicated than an 'invisible friend'. The CAPTION element may contain text, entities, processing instructions, and any other valid XML text except other elements. The full syntax for DTDs provides many more options for declaring elements and attributes and their location in the document structure, as well as entities, which allow the developer to define a chunk of XML content or DTD information and use it by reference.

XML permits the use of documents, called 'well-formed documents' that use only its rules for document syntax, without specifying a DTD. Documents that contain (and/or refer to) a properly written DTD, and meet the requirements it sets, are referred to as 'valid'. Validation can be an important step in the authoring process, and may also be performed at any step in processing. Developers can choose how often, and when, to screen a document to check its structure. Applications which need to process lots of information quickly, or which can't afford the additional processing requirements imposed by validation, can stick to well-formed documents. Well-formed documents also provide an easy bottom rung on the XML learning ladder - by sticking to the basic syntax, developers can create parseable documents with any structure they choose, moving up to more formal DTDs when the need arises.

In practice, though, XML itself should disappear into the background, hiding behind tools for most users. Most people won't need to create their own DTDs - once a standard DTD is created, users can simply apply it, making modifications when it becomes clear the structure needs improvement. As XML becomes ubiquitous (and tools improve), it should become invisible to all but a few, buried underneath authoring tools and plug-in parsers.

<!ELEMENT html (head, body)>
<!ELEMENT ul (li)+>
<!ELEMENT form %form.content;>   <!-- forms shouldn't be nested -->
<!ATTLIST form
  %attrs;
  action      %URI;          #REQUIRED
  method      (get|post)     "get"
  enctype     %ContentType;  "application/x-www-form-urlencoded"
  onsubmit    %Script;       #IMPLIED
  onreset     %Script;       #IMPLIED
  accept      %ContentTypes; #IMPLIED
  accept-charset %Charsets;  #IMPLIED
  >

In effect, a DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD when editing files means you can be certain that all documents which belong to a particular type will be constructed and named in a consistent and conformant manner.

Unfortunately, a DTD is not an XML document and therefore cannot be parsed and validated as an XML document. An alternative way of defining the structure of an XML document uses XSD (XML Schema Document). An XSD schema is an XML document itself, and can be parsed and validated in the same way any XML document can. It is also, arguably, easier to understand than a DTD definition. The following example shows the html and ul elements from the DTD above translated into XSD.

   
<schema
  xmlns='http://www.w3.org/2000/10/XMLSchema'
  targetNamespace='http://www.w3.org/namespace/'
  xmlns:t='http://www.w3.org/namespace/'>

 <element name='html'>
  <complexType>
   <sequence>
    <element ref='t:head'/>
    <element ref='t:body'/>
   </sequence>
  </complexType>
 </element>

 <element name='ul'>
  <complexType>
   <sequence maxOccurs='unbounded'>
    <element ref='t:li'/>
   </sequence>
  </complexType>
 </element>
</schema>

There are thousands of SGML DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own (see the question on creating your own DTD. Existing SGML DTDs need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, and expect to see announcements of popular DTDs becoming available in XML format.

Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally, at the penalty of losing automated control over the structuring of additional documents of the same type.

To make this work, a DTDless file in effect defines its own markup informally, by the simple existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure while it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules.

Well Formed:

all tags must be balanced: that is, all elements which may contain character data must have both start- and end-tags present (omission is not allowed except for empty elements, see below);
all attribute values must be in quotes (the single-quote character [the apostrophe] may be used if the value contains a double-quote character, and vice versa): if you need both, use ' or " (do not under any circumstances use the typographic (curly) 'inverted commas' for quoting attribute values);
any EMPTY element tags (eg those with no end-tag like HTML's <IMG>, <HR>, and and others) must either end with '/>' or you have to make them appear non-EMPTY by adding a real end-tag;
Example: would become either or .
there must not be any isolated markup-start characters (< or &) in your text data (ie they must be given as < and &), and the sequence ]]> must be given as ]]> unless you really are using it as the end of a CDATA marked section;
elements must nest inside each other properly (no overlapping markup, same rule as for all SGML, including HTML);
Well-formed documents with no DTD may use attributes on any element, but the attributes are assumed to be all of type CDATA. You cannot use ID/IDREF attributes in DTDless documents.
XML files with no DTD are considered to have <, >, ', ", and & predefined and thus available for use even without a DTD. Valid XML files must declare them explicitly if they use them. If you want to use more than these five default character entities, but you want to avoid having to write a full DTD, it is possible to declare just character entities on their own in the internal subset of an otherwise standalone XML file (thanks to Richard Lander for this) without the document having a full DTD, and without the need to reference anything other than the root element type:
```
<?xml version="1.0" standalone="yes"?>
<!DOCTYPE example [
<!ENTITY nbsp "&#032;">
]>
<example>Three&nbsp;&nbsp;&nbsp;blanks.</example>
```

Valid XML

Valid XML files are those which have a Document Type Definition (DTD) like other SGML applications, and which adhere to it. They must already be well-formed.

A valid file begins like any other SGML file with a Document Type Declaration, but may have an optional XML Declaration prepended:

<?xml version="1.0"?>
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">
<advert>
<headline>...<pic/>...</headline>
<text>...</text>
</advert>

The XML Specification defines an SGML Declaration for XML which is fixed for all instances (the declaration has been removed from the text of the Specification and is now in a separate document). An XML version of the specified DTD must be accessible to the XML processor, either by being available locally (ie the user already has a copy on disk), or by being retrievable via the network. You can specify this by supplying the URL for the DTD in a System Identifier (as in the example above). It is possible (many people would say preferable) to supply a Formal Public Identifier, but if used, this must precede the System Identifier, which must still be given (and only the PUBLIC keyword is used),

<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN"
"http://www.foo.org/ad.dtd">

The defaults for the other attributes of the XML Declaration are version="1.0" and encoding="UTF-8".

XQuery

XQuery is a programming language designed to query collections of XML documents. It can be used stand alone or as an API with other languages, such as Java. It can extract data from XML using expressions. Here is a simple example of an XQuery that extracts all authors whose last name is "Brown" from 'books.xml':

doc("books.xml")/library/book/author[last="Brown"]

"library" and "book" are XML tags within the "books.xml" document.

XQuery can also be used to construct new XML documents from existing ones by wrapping the XQuery expressions in XML tags:

<titles>
{
doc("books.xml")//title
}
</titles>

which creates:

<titles>
  <title>TCP/IP Illustrated</title>
  <title>Advanced Programming in the Unix Environment</title>
  <title>Data on the Web</title>
  <title>The Economics of Technology and Content for Digital TV</title>
</titles>

See http://www.datadirect.com/developer/xml/xquery/docs/katz_c01.pdf for a good introductory tutorial.