DTD/CDATA

[XML Attributes][Well Formed XML][DTD/CDATA]

XML DTD

A DTD defines the legal elements of an XML document.

The purpose of a DTD is to define the legal building blocks of an XML document. It defines the document structure with a list of legal elements.

XML Schema

XSchema is an XML based alternative to DTD.

W3C supports an alternative to DTD called XML Schema.

Errors will Stop you

Errors in XML documents will stop the XML program.

The W3C XML specification states that a program should not continue to process an XML document if it finds a validation error. The reason is that XML software should be easy to write, and that all XML documents should be compatible.

With HTML it was possible to create documents with lots of errors (like when you forget an end tag). One of the main reasons that HTML browsers are so big and incompatible is that they have their own ways to figure out what a document should look like.

With XML this should not be possible.

XML in Netscape Navigator

Netscape has promised full XML support in its next browser.

We hope that Netscape will include standard support for the W3C XML, in its next version of the browser.

Based on previous experience we can only hope that Navigator and Explorer will be compatible in the future XML field.

Your option at the moment - if you want to work with cross browser XML - is to work with XML on your server and transform your XML to HTML before it is sent to the browser.

XML in Internet Explorer 5.0

Internet Explorer 5.0 supports the XML 1.0 standard.

Internet Explorer 5.0 supports most of the international standards for XML 1.0 and the XML DOM (Document Object Model). These standards are set by the World Wide Web Consortium (W3C).

Internet Explorer 5.0 has the following XML support:

Viewing of XML documents
Full support for W3C DTD standards
XML embedded in HTML as Data Islands
Binding XML data to HTML elements
Transforming and displaying XML with XSL
Displaying XML with CSS
Access to the XML DOM
Internet Explorer 5.0 also has support for Behaviors:
Behaviors is a Microsoft only Technology
Behaviors can separate scripts from an HTML page.
Behaviors can store XML data on the clients disk.

Viewing XML files with IE 5.0

Raw XML files can be viewed in Internet Explorer 5.0, but to make it display like a home page you have to add extra display information.

You must have Internet Explorer 5.0 or later to view the example XML files.

Viewing XML with Internet Explorer 5.0

You can use IE 5.0 to view any XML document.

To view an XML document, you can click on a link, type the URL in the address bar, or double-click on the name of an XML file in a files folder.

If you open an XML document in IE, it will display the document with color coded root and child elements. A plus (+) or minus sign (-) to the left of the elements can be clicked to expand or collapse the element structure.

If you want to view the raw XML source, you must select "View Source" from the browser menu.

Note: Do not expect the XML file to be formatted like an HTML document !

Viewing an invalid XML file

If an erroneous XML file is opened with IE, IE will report the error.

Why does XML display like this?

XML documents do not carry information about how to display the data.

Since XML tags are "made up" or "invented" by the author of the XML document, we cannot know if a tag like <table> describes a HTML type of table, or if it describes a wooden kitchen table.

Without any information about how to display the data, most browsers will just display the XML document as it is.

Displaying XML with CSS

With CSS you can add display information to an XML document.

XML PCDATA and CDATA

Parsable Character Data (PCDATA) is text that is parsed by the parser.

Character Data (CDATA) is the text that is not parsed by the parser.

PCDATA

XML parsers treat all text as Parsable Characters (PCDATA).

When an XML element is parsed, the text between the XML tags is also parsed:

<message>This text is also parsed</message>

The parser does this because XML elements can contain other elements, like in this example, where the <name> element contains two other elements (first and last):

<name><first>Bill</first><last>Gates</last></name>
and the parser will break it up into sub-elements like this:
<name>
<first>Bill</first>
<last>Gates</last>
</name>

Escape Characters

Illegal XML characters have to be escaped by entity references.

If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element. You cannot write something like this:

<message>if salary < 1000 then</message>
To avoid this, you have to escape the "<" character with an entity reference, like this:
<message>if salary < 1000 then</message>
There are 5 predefined entity references in XML:
<
<
less than
>
>
greater than
&
&
ampersand
'
'
apostrophe
&quote;
"
quotation mark

Entity references always starts with the "&" character and ends with the ";" character.

Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes, quotation marks and greater than signs are legal, but it is a good habit to escape them.

CDATA

Everything inside a CDATA section is ignored by the parser.

If your text contains a lot of "<" or "&" characters - like program code often does - the XML element can be defined as a CDATA section.

A CDATA section starts with "<![CDATA[" and ends with "]]>":
<script>
<![CDATA[
function matchwo(a,b)
{
if (a < b && a < 0) then
{
return 1
}
else
{
return 0
}
}
]]>
</script>

In the previous example, everything inside the CDATA section is ignored by the parser.

XML Encoding

XML documents can contain foreign characters like Norwegian ?oa, or french eee.

To let your XML parser understand these characters, you should save your XML documents as Unicode.

Windows 95/98 Notepad

Windows 95/98 Notepad cannot save files in Unicode format.

You can use Notepad to edit and save XML documents that contain foreign characters (like Norwegian or French ?oa and eee),

<?xml version="1.0"?>
<note>
<from>Jani</from>
<to>Tove</to>
<message>Norwegian: ?oa. French: eee</message>
</note>

But if you save the file and open it with IE 5.0, you will get an ERROR MESSAGE.

Windows 95/98 Notepad with Encoding

Windows 95/98 Notepad files must be saved with an encoding attribute.

To avoid this error you can add an encoding attribute to your XML declaration, but you cannot use Unicode.

This encoding (open it with IE 5.0), will NOT give an error message:

<?xml version="1.0" encoding="windows-1252"?>
This encoding (open it with IE 5.0), will NOT give an error message:
<?xml version="1.0" encoding="ISO-8859-1"?>
This encoding (open it with IE 5.0), WILL give an error message:
<?xml version="1.0" encoding="UTF-8"?>
This encoding (open it with IE 5.0), WILL give an error message:
<?xml version="1.0" encoding="UTF-16"?>

Windows 2000 Notepad

Windows 2000 Notepad can save files as Unicode.

The Notepad editor in Windows 2000 supports Unicode. If you select to save this XML file as Unicode (note that the document does not contain any encoding attribute):

<?xml version="1.0"?>
<note>
<from>Jani</from>
<to>Tove</to>
<message>Norwegian: ?oa. French: eee</message>
</note>

you can open it with IE 5.0, WITHOUT getting an error message.

Error Messages

If you try to load an XML document into Internet Explorer 5, you can get two different errors indicating encoding problems:

An invalid character was found in text content.

You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified.

Switch from current encoding to specified encoding not supported.

You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16.

Conclusion

The conclusion is that the encoding attribute has to specify the encoding used when the document was saved. My best advice to avoid errors is this:

Always save XML files as Unicode, without any encoding information.

Use an editor that supports Unicode (Windows 2000 Notepad does) and always skip the encoding attribute.

CDATA sections

A block of text in which you can freely insert any characters except the ]]>
<TITLE>The Adventures of Huckleberry Finn
Author: Mike
<![CDATA[
Document Name: “How to enter the < and & character”
]]>
</TITLE>

Processing Instructions

Comments

Empty elements

An empty element is one with no content.

You can enter it like <HR></HR> or like <HR/>

When do we use empty elements?

To tell XML to perform an action or display an object

To store information through attributes (like the IMG empty element)

Create Different Types of Elements

<?xml version="1.0"?>

<?xml-stylesheet type="text/css" href="Inventory02.css"?>
<INVENTORY> 
<BOOK>
<COVER_IMAGE Source="Huck.gif" />
<TITLE>The Adventures of Huckleberry Finn</TITLE>
<AUTHOR>Mark Twain</AUTHOR>
<BINDING>mass market paperback</BINDING>
<PAGES>298</PAGES>
<PRICE>$5.49</PRICE>
</BOOK>
<BOOK>
<COVER_IMAGE Source="Moby.gif" />
<TITLE>
Moby-Dick
<SUBTITLE>Or, the Whale</SUBTITLE>
</TITLE>
<AUTHOR>Herman Melville</AUTHOR>
<BINDING>hardcover</BINDING>
<PAGES>724</PAGES>
<PRICE>$9.95</PRICE>
</BOOK>
</INVENTORY>

Can you identify the types of elements it uses?

Why is the image not displayed? To display such an element you need to open it directly from an HTML page or a XSL Stylesheet

Adding Attributes to Elements

In the start-tag of an element you can include one or more attribute specifications.

<BOOK Category=”Fiction” Display=”emphasize”>
<COVER_IMAGE Source="Moby.gif" />
<TITLE>
Moby-Dick
<SUBTITLE>Or, the Whale</SUBTITLE>
</TITLE>
<AUTHOR>Herman Melville</AUTHOR>
<BINDING>hardcover</BINDING>
<PAGES>724</PAGES>
<PRICE type =”retail”>$9.95</PRICE>
</BOOK>
We have the attributes
Category
Display
Source
Type

A particular name can appear only once in the same start or empty tag

<ANIMATION FileName=”Waldo1.ani” FileName=”Waldo2.ani”> //duplicate attribute name
<LIST 2stPlace=”Sam” //Digit not allowed as first character
<ITEM A:Category=”cookware”> //only allowed if A is declared as a namespace

Name spaces are used to differentiate attributes that have the same name.
The value you assign to an attribute is a series of characters delimited with quotes.

<ANIMATION FileName=””Waldo1.ani”” > //can’t use delimited quote within a string<
LIST 2stPlace=”<Sam>” //can’t use < within a string
<ITEM A:Category=”cookware & hardware”>//can’t use & except to start a reference

Remember one advantage to store data in elements than in attributes is that you have far more control on the elements than the values.

<?xml version="1.0"?>

<?xml-stylesheet type="text/css" href="Inventory02.css"?>
<INVENTORY>
<BOOK Binding="mass market paperback">
<TITLE>The Adventures of Huckleberry Finn</TITLE>
<AUTHOR Born="1835">Mark Twain</AUTHOR>
<PAGES>298</PAGES>
<PRICE>$5.49</PRICE>
</BOOK>
<BOOK Binding="hardcover">
<TITLE>Leaves of Grass</TITLE>
<AUTHOR Born="1819">Walt Whitman</AUTHOR>
<PAGES>462</PAGES>
<PRICE>$7.75</PRICE>
</BOOK>
<BOOK Binding="trade paperback">
<TITLE>The Marble Faun</TITLE>
<AUTHOR Born="1804">Nathaniel Hawthorne</AUTHOR>
<PAGES>473</PAGES>
<PRICE>$10.95</PRICE>
</BOOK>
<BOOK Binding="hardcover">
<TITLE>Moby-Dick</TITLE>
<AUTHOR Born="1819">Herman Melville</AUTHOR>
<PAGES>724</PAGES>
<PRICE>$9.95</PRICE>
</BOOK>
</INVENTORY>

Here the BINDING content is converted to an element. You would do this if you wanted to store the binding information but not want to display it.