XML

   

XML (eXtensible Markup Language) is a W3C recommendation for creating special-purpose markup languages. It is a simplified subset of SGML, capable of describing many different kinds of data. Its primary purpose is to facilitate the sharing of structured text and information across the Internet. Languages based on XML (for example, RDF, RSS, MathML, XSIL and SVG) are themselves described in a formal way, allowing programs to modify and validate documents in these languages without prior knowledge of their form.

Strengths and weaknesses

The features of XML that make it particularly appropriate for data transfer are:

XML is also heavily used for document storage and processing, both online and offline, and offers several benefits:

  • robust, logically-verifiable format based on international standards
  • hierarchical structure suitable for most (but not all) types of document
  • plain text files, unencumbered by licenses or restrictions
  • platform-independent, and so relatively immune to changes in technology
  • has already been in use (as SGML) for long over a decade, and is very popular by itself, so there is extensive experience and software available.

For certain applications, the format also has the following weaknesses:

  • XML syntax is fairly verbose and partially redundant. This can hurt human readability and application efficiency, and yields higher storage costs. It can also make XML difficult to apply in cases where bandwidth is limited, though compression can reduce the problem in some cases.
  • XML syntax contains a number of obscure features due to its legacy of SGML compatibility.
  • XML still often requires further parsing to extract individual values.
  • Modelling overlapping (non-hierarchical) data structures requires extra effort.
  • Mapping XML to the object oriented or relational paradigms may be cumbersome.

Syntax rules in XML

An XML document is text, usually a particular encoding of Unicode such as UTF-8 or UTF-16, although other encodings may be used.

Unlike, for example, HTML, XML is highly dependent upon structure, content and integrity for its efficacy. In order for a document to be considered "well-formed" W3C Recommendation XML 1.0 (Third Edition) (http://www.w3.org/TR/REC-xml#sec-well-formed), it must conform (at the very least) to the following:

  • It must have one (and only one) root element.
  • Non-empty elements must be delimited by a start-tag and an end-tag. Empty elements may be marked with an empty-element tag.
  • All attribute values must be quoted (either single (') or double (") quotes, but a single quote closes a single quote and a double quote a double quote. The other pair can then be used inside values.)
  • Tags may be nested but may not overlap, that is each non-root element must be completely contained in another element.

Element names in XML are case-sensitive: for example <Example> and </Example> are a well-formed matching pair whereas <Example> and </example> are not.

Also, clever choice of XML element names allows the meaning of the data to be retained as part of the markup. This makes it more easily interpreted by humans while also consumable by software programs.

As a concrete example, a simple recipe expressed in an XML representation might be:

       <?xml version="1.0" encoding="UTF-8"?>
       <Recipe name="bread" prep_time="5 mins" cook_time="3 hours">
          <title>Basic bread</title>
          <ingredient amount="3" unit="cups">Flour</ingredient>
          <ingredient amount="0.25" unit="ounce">Yeast</ingredient>
          <ingredient amount="1.5" unit="cups">Warm Water</ingredient>
          <ingredient amount="1" unit="teaspoon">Salt</ingredient>
          <Instructions>
             <step>Mix all ingredients together, and knead thoroughly.</step>
             <step>Cover with a cloth, and leave for one hour in warm room.</step>
             <step>Knead again, place in a tin, and then bake in the oven.</step>
          </Instructions>
       </Recipe>

Giving logical names to elements and attributes allows an author unfamiliar with a particular document type to quickly grasp the meaning of elements and attributes without having to refer to documentation or having to spend several minutes studying a document to understand its structure. This can, however, also lead to excess verbosity (which can complicate authoring) and greatly increase file size (which decreases efficiency when content is transfered over a network).

An XML document that complies with an associated schema (such as a DTD) in addition to being well-formed is said to be "valid".

XML schema languages

Before the advent of generalised data description languages such as SGML and XML, software designers had to define special file formats or small languages to share data between programs. This required writing detailed specifications and special-purpose parsers and writers.

XML's regular structure and strict parsing rules allows software designers to leave parsing to standard tools, and since XML provides a general, data model-oriented framework for the development of application-specific languages, software designers need only concentrate on the development of schemas for their data, at relatively high levels of abstraction.

An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic constraints imposed by XML itself. A number of standard and proprietary XML schema languages have emerged for the purpose of formally expressing such schemas, and some of these languages are XML-based, themselves.

Well-tested tools exist to validate XML files against a schema in order to automatically verify whether the document conforms to constraints expressed in the schema. Other usages of schemas exist: XML editors, for instance, can use schemas to support the editing process.

The oldest XML schema format is the DTD, which is inherited from SGML. While DTD support is ubiquitous due to its inclusion in the XML 1.0 standard, it is seen as limited for the following reasons:

  • No support for newer features of XML, most importantly namespaces.
  • Lack of expressivity. Certain formal aspects of an XML document cannot be captured in a DTD.
  • Custom non-XML syntax to describe the schema, inherited from SGML.

A newer XML schema language, described by the W3C as the successor of DTDs, is simply called XML Schema, also referred to as XML Schema Definition (XSD). XSD schemas are far more powerful than DTDs in describing XML languages. Additionally XSD uses an XML based format, which makes it possible to use the XML toolset to help process XML schema. It also becomes possible to write a schema for the schema language itself. Criticisms of XSD are:

  • Standard is very large, which makes it difficult to understand and implement.
  • XML-based syntax leads to verbosity in schema description, which makes XSDs harder to read and write.

Another XML popular schema language is RELAX NG. Initially standardized by OASIS and now also a ISO international standard (as part of DSDL), RELAX NG comes in two formats, an XML based syntax and a non-XML compact syntax. The compact syntax aims to increase readability and writability, but since there is a well-defined way to translate compact syntax to the XML syntax and back again the advantage of using standard XML tools is not lost. RELAX NG has a more compact definition which makes it easier to implement than XSD.

Some schema languages not only describe the structure of a particular XML format but also offer limited facilities to influence processing of individual XML files that conform to this format. DTDs and XSDs both have this ability; they can for instance provide attribute defaults. RELAX NG intentionally does not provide these facilities.

Displaying XML on the web

Extensible stylesheet language (XSL) is a further adjunct to XML that allows users to describe visual properties and transformations of XML data without embedding those instructions into the data itself. The resulting document can then be displayed by a browser in analogy to an HTML document which uses CSS for rendering. One way to achieve this, is to include the following line in the XML document:

<?xml-stylesheet type="text/xsl" href="transform.xsl"?>

which declares that the named XSLT style sheet should be used to transform the XML into HTML. This process may of course also occur on the server side as well as in the browser.

An XML document may also be rendered directly in some browsers such as e.g. Internet Explorer 5 or Mozilla with the stylesheet language CSS. This process is still not yet stable as of March 2004 in those browsers; in other browsers, such as Opera, this works very well. In order to allow CSS styling, the XML document must include a special reference to a style sheet:

<?xml-stylesheet type="text/css" href="myStyleSheet.css"?>

Note that this differs greatly from the standard HTML way to call a stylesheet, where it is usually done by the <link /> element.

While browser-based XML rendering develops, the alternative is conversion into HTML or PDF or other formats on the server. Programs like Cocoon process an XML file against a stylesheet (and can perform other processing as well) and send the output back to the user's browser without the user needing to be aware of what has been going on in the background.

XML extensions

  • XPath It is possible to refer to individual components of an XML document using XPath. This allows stylesheets in (for example) XSL and XSLT to dynamically "cherry-pick" pieces of a document in any sequence needed in order to compose the required output.
  • XQuery is to XML what SQL is to relational databases.
  • XML namespaces enable the same document to contain XML elements and attributes taken from different vocabularies, without any naming collisions occurring.
  • XML Signature defines the syntax and processing rules for creating digital signatures on XML content.
  • XML Encryption defines the syntax and processing rules for encrypting XML content.

Processing XML files

The APIs widely used in processing XML data by programming languages are SAX and DOM. SAX is used for serial processing whereas DOM is used for random-access processing. Another form of XML Processing API is data binding, where XML data is made available as a strongly typed programming language data structure, in constrast to the DOM. Example data binding systems are the Java Architecture for XML Binding (JAXB) [1] (http://java.sun.com/xml/jaxb/) and the Strathclyde Novel Architecture for Querying XML (SNAQue) [2] (http://www.cis.strath.ac.uk/research/snaque/).

An extensible stylesheet language (XSL) processor may be used to render an XML file for displaying or printing. XSL itself is intended for creating PDF files. XSLT is for transforming to other formats, including HTML, other vocabularies of XML, and any other plain-text format. XQuery [3] (http://www.w3.org/TR/xquery/) is a W3C language for querying, constructing and transforming XML data. XPath [4] (http://www.w3.org/TR/xpath) is a path expression language for selecting data within an XML file. XPath is a sublanguage of both XQuery and XSLT.

The native file format of OpenOffice.org and AbiWord is XML. Some parts of Microsoft Office 11 will also be able to edit XML files with a user-supplied schema (but not a DTD). There are dozens of other XML editors available.

Versions of XML

The current version of XML is 1.1 (as of February 4, 2004). The first version XML 1.0 currently exists in its third revision. XML 1.0 and XML 1.1 differ in the requirements of characters used for element names, attribute names etc.: XML 1.0 only allows characters which are valid Unicode 2.0, which includes most world scripts, but excludes scripts which only entered in a later Unicode version, such as Mongolian, Cambodian, Amharic, Burmese, etc.. XML 1.1 only disallows certain control characters, which means that any other character can be used, even if the Unicode standard grows exponentially.

It should be noted here that the restriction present in XML 1.0 only applies to element/attribute names: both XML 1.0 and XML 1.1 allow for the use of full Unicode in the content itself. Thus XML 1.1 is only needed if in addition to using a script added after Unicode 2.0 you also wish to write the elements in that script.

Other minor changes between XML 1.0 and XML 1.1 are that control characters are now allowed to be included but only when escaped, and two special 'form-feed' characters are included, which must be treated as whitespace.

All XML 1.0 documents will be valid XML 1.1 documents, with one exception: XML documents declaring themselves as being ISO-8859-1 encoded which are actually CP1252 encoded may now be invalid: this is because CP1252 uses the control characters block of ISO-8859-1 for special glyphs like €, Œ, and ™. XML 1.0 documents which declare CP1252 encoding will remain valid.

There are also discussions on an XML 2.0, although it remains to be seen if such will ever come about. XML-SW (SW for skunk works), written by one of the original developers of XML, contains some proposals for what an XML 2.0 might look like: elimination of DTDs from syntax, integration of namespaces, XML Base and XML Information Set into the base standard.

See also

External links

Our sister project, Wikibooks, provides an electronic book on XML.





ca:XML cs:XML da:Extensible_Markup_Language de:Extensible_Markup_Language es:XML eo:XML fr:Extensible_markup_language ko:XML ia:XML it:XML he:XML nl:Extensible_Markup_Language ja:拡張可能なマーク付け言語 pl:XML pt:XML ru:XML sl:XML sr:XML fi:XML sv:XML zh:XML

Retrieved from "http://www.mywiseowl.com/articles/XML"

This page has been accessed 3018 times. This page was last modified 00:39, 25 Nov 2004. All text is available under the terms of the GNU Free Documentation License (see Copyrights for details).