XML PCDATA and CDATA 

Parsable Character Data (PCDATA) is text that is parsed by the parser.

Character Data (CDATA) is the text that is not parsed by the parser.

 

PCDATA

XML parsers treat all text as Parsable Characters (PCDATA).

When an XML element is parsed, the text between the XML tags is also parsed:

<message>This text is also parsed</message>

The parser does this because XML elements can contain other elements, like in this example, where the <name> element contains two other elements (first and last):

<name><first>Bill</first><last>Gates</last></name>

and the parser will break it up into sub-elements like this:

<name>

   <first>Bill</first>

   <last>Gates</last>

</name>

Escape Characters

Illegal XML characters have to be escaped by entity references.

If you place a character like "<" inside an XML element, it will generate an error because the parser interprets it as the start of a new element. You cannot write something like this:

<message>if salary < 1000 then</message>

To avoid this, you have to escape the "<" character with an entity reference, like this:

<message>if salary &lt; 1000 then</message>

There are 5 predefined entity references in XML:

&lt;

<

less than

&gt;

>

greater than

&amp;

&

ampersand

&apos;

'

apostrophe

&quote;

"

quotation mark


Entity references always starts with the "&" character and ends with the ";" character.

Note: Only the characters "<" and "&" are strictly illegal in XML. Apostrophes, quotation marks and greater than signs are legal, but it is a good habit to escape them.

 

CDATA

Everything inside a CDATA section is ignored by the parser.

If your text contains a lot of "<" or "&" characters - like program code often does - the XML element can be defined as a CDATA section.

A CDATA section starts with "<![CDATA[" and ends with "]]>":

<script>

<![CDATA[

function matchwo(a,b)

{

if (a < b && a < 0) then

   {

   return 1

   }

else

   {

   return 0

   }

}

]]>

</script>

In the previous example, everything inside the CDATA section is ignored by the parser.

 

XML Encoding

XML documents can contain foreign characters like Norwegian æøå, or french êèé.

To let your XML parser understand these characters, you should save your XML documents as Unicode.

 

Windows 95/98 Notepad

Windows 95/98 Notepad cannot save files in Unicode format.

You can use Notepad to edit and save XML documents that contain foreign characters (like Norwegian or French æøå and êèé),

<?xml version="1.0"?>

<note>

  <from>Jani</from>

  <to>Tove</to>

  <message>Norwegian: æøå. French: êèé</message>

</note>

But if you save the file and open it with IE 5.0, you will get an ERROR MESSAGE.  


Windows 95/98 Notepad with Encoding

Windows 95/98 Notepad files must be saved with an encoding attribute.

To avoid this error you can add an encoding attribute to your XML declaration, but you cannot use Unicode.

This encoding (open it with IE 5.0), will NOT give an error message:

<?xml version="1.0" encoding="windows-1252"?>

This encoding (open it with IE 5.0), will NOT give an error message:

<?xml version="1.0" encoding="ISO-8859-1"?>

This encoding (open it with IE 5.0), WILL give an error message:

<?xml version="1.0" encoding="UTF-8"?>

This encoding (open it with IE 5.0), WILL give an error message:

<?xml version="1.0" encoding="UTF-16"?>

Windows 2000 Notepad

Windows 2000 Notepad can save files as Unicode.

The Notepad editor in Windows 2000 supports Unicode. If you select to save this XML file as Unicode (note that the document does not contain any encoding attribute):

<?xml version="1.0"?>

<note>

  <from>Jani</from>

  <to>Tove</to>

  <message>Norwegian: æøå. French: êèé</message>

</note>

you can open it with IE 5.0, WITHOUT getting an error message.

 

Error Messages

If you try to load an XML document into Internet Explorer 5, you can get two different errors indicating encoding problems:

An invalid character was found in text content.

You will get this error message if a character in the XML document does not match the encoding attribute. Normally you will get this error message if your XML document contains "foreign" characters, and the file was saved with a single-byte encoding editor like Notepad, and no encoding attribute was specified.


Switch from current encoding to specified encoding not supported.

You will get this error message if your file was saved as Unicode/UTF-16 but the encoding attribute specified a single-byte encoding like Windows-1252, ISO-8859-1 or UTF-8. You can also get this error message if your document was saved with single-byte encoding, but the encoding attribute specified a double-byte encoding like UTF-16.  

Conclusion

The conclusion is that the encoding attribute has to specify the encoding used when the document was saved. My best advice to avoid errors is this:

Always save XML files as Unicode, without any encoding information.

Use an editor that supports Unicode (Windows 2000 Notepad does) and always skip the encoding attribute.