[Introduction][Goals of XML][XML vs HTML][How to write XML]

XML Syntax

The Syntax rules of XML are very simple and very strict. The rules are very easy to learn, and very easy to use.

Because of this, creating software that can read and manipulate XML is very easy to do.

An example XML document

XML documents use a self-describing and simple syntax.

<?xml version="1.0"?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

The first line in the document - the XML declaration - defines the XML version of the document. In this case the document conforms to the 1.0 specification of XML.

The next line describes the root element of the document (like it was saying: "this document is a note"):

<note>

The next 4 lines describe 4 child elements of the root (to, from, heading, and body):

<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
And finally the last line defines the end of the root element:
</note>

Can you detect from this example that the XML document contains a Note to Tove from Jani?

All XML elements must have a closing tag

With XML, it is illegal to omit the closing tag.

In HTML some elements do not have to have a closing tag. The following code is legal in HTML:

<p>This is a paragraph
<p>This is another paragraph
In XML all elements must have a closing tag like this:
<p>This is a paragraph</p>
<p>This is another paragraph</p>


XML tags are case sensitive

Unlike HTML, XML tags are case sensitive.

With XML, the tag <Letter> is different from the tag <letter>.

Opening and closing tags must therefore be written with the same case:

<Message>This is incorrect</message>
<message>This is correct</message>

All XML elements must be properly nested

Improper nesting of tags make no sense to XML.

In HTML some elements can be improperly nested within each other like this:

<b><i>This text is bold and italic</b></i>
In XML all elements must be properly nested within each other like this:
<b><i>This text is bold and italic</i></b>

All XML documents must have a root tag

The first tag in an XML document is the root tag.

All XML documents must contain a single tag pair to define the root element. All other elements must be nested within the root element. All elements can have sub (children) elements. Sub elements must be correctly nested within their parent element:

<root>
<child>
<subchild>.....</subchild>
</child>
</root>

Attribute values must always be quoted

With XML, it is illegal to omit quotation marks around attribute values.

XML elements can have attributes in name/value pairs just like in HTML. In XML the attribute value must always be quoted. Study the two XML documents below. The first one is incorrect, the second is correct:

<?xml version="1.0"?>
<note date=12/11/99>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
<?xml version="1.0"?>
<note date="12/11/99">
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

The error in the first document is that the date attribute in the note element is not quoted.

This is correct: date="12/11/99". This is incorrect: date=12/11/99.

With XML, White Space is Conserved

With XML, the white space in your document is not truncated.

This is unlike HTML. With HTML, a sentence like this: Hello my name is Tove, will be displayed like this: Hello my name is Tove, because HTML strips off the white space.

With XML, CR / LF is converted to LF

With XML, a new line is always stored as LF.

Have you ever heard of a typewriter. Well, a typewriter is a type of mechanical device they used in the previous century :-)

After you have typed one line of text on a typewriter, you have to manually return the printing carriage to the left margin position and manually feed the paper up one line.

In Windows applications, a new line in the text is normally stored as a pair of CR LF (carriage return, line feed) characters. In Unix applications, a new line is normally stored as a LF character. Some applications use only a CR character to store a new line.

XML Elements

XML Elements are extensible and they have relationships.

XML Elements have simple naming rules.

XML Elements are Extensible

XML documents can be extended to carry more information.

Look at the following XML NOTE example:

<note>
<to>Tove</to>
<from>Jani</from>
<body>Don't forget me this weekend!</body>
</note>

Let's imagine that we created an application that extracted the <to>, <from>, and <body> elements from the XML document to produce this output:

MESSAGE

To: Tove
From: Jani

Don't forget me this weekend!

Imagine that the author of the XML document added some extra information to it:

<note>
<date>1999-08-01</date>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>

Should the application break or crash?

No. The application should still be able to find the <to>, <from>, and <body> elements in the XML document and produce the same output.

XML documents are Extensible.

XML Elements have Relationships

Elements are related as parents and children.

To understand XML terminology, you have to know how relationships between XML elements are named, and how element content is described.

Imagine that this is a description of a book:

Book Title: My First XML

Book Title: My first XML

Chapter 1: Introduction to XML


What is HTML


What is XML

Chapter 2: XML Syntax


Elements must have a closing tag
Elements must be correctly nested
Imagine that this XML document describes the book:

<book>
<title>My First XML</title>
<prod id="33-657" media="paper"></prod>
<chapter>Introduction to XML
<para>What is HTML</para>
<para>What is XML</para>
</chapter>
<chapter>XML Syntax
<para>Elements must have a closing tag</para>
<para>Elements must be properly nested</para>
</chapter>
</book>

Book is the root element. Title and chapter are child elements of book. Book is the parent element of both title and chapter. Title and vchapter are siblings (or sister elements) because they have the same parent.

Elements have Content

Elements can have different content types.

An XML element is everything from (including) the element's start to (including) the element's end tag.

An element can have element content, mixed content, simple content, or empty content. An element can also have attributes.

In the example above, book has element content, because it contains other elements. Chapter has mixed content because it contains both text and other elements. Para has simple content (or text content) because it contains only text. Prod has empty content, because it carries no information.

In the example above only the prod element has attributes. The attribute named id has the value "33-657". The attribute named media has the value "paper".

Element Naming

XML elements must follow these naming rules:


Names can contain letters, numbers, and other characters


Names must not start with a number or "_" (underscore)


Names must not start with the letters xml (or XML or Xml ..)


Names can not contain spaces

Take care when you "invent" elements names and follow these simple rules:

Any name can be used, no words are reserved, but the idea is to make names descriptive. Names with an underscore separator are nice.

Examples: <first_name>, <last_name>.

Avoid "-" and "." in names. It could be a mess if your software tried to subtract name from first (first-name) or think that "name" is a property of the object "first" (first.name).

Element names can be as long as you like, but don't exaggerate. Names should be short and simple, like this: <book_title> not like this:

<the_title_of_the_book>.

XML documents often have a parallel database, where fieldnames parallel with element names. A good rule is to use the naming rules of your databases.

Non-English letters like йтб are perfectly legal in XML element names, but watch out for problems if your software vendor doesn't support it.

The ":" should not be used in element names because it is reserved to be used for something called namespaces (more later).