XML
PCDATA and CDATA
Parsable
Character Data (PCDATA) is text that is parsed
by the parser.
Character
Data (CDATA) is the text that is not
parsed by the parser.
PCDATA
XML parsers treat all text as Parsable Characters
(PCDATA).
When
an XML element is parsed, the text between the XML tags is also
parsed:
<message>This
text is also parsed</message>
|
The
parser does this because XML elements can contain other elements,
like in this example, where the <name> element contains two
other elements (first and last):
<name><first>Bill</first><last>Gates</last></name>
|
and
the parser will break it up into sub-elements like this:
<name>
<first>Bill</first>
<last>Gates</last>
</name>
|
Escape
Characters
Illegal XML characters have to be escaped by entity
references.
If
you place a character like "<" inside an XML element,
it will generate an error because the parser interprets it as the
start of a new element. You cannot write something like this:
<message>if
salary < 1000 then</message>
|
To
avoid this, you have to escape
the "<" character with an entity
reference, like this:
<message>if
salary < 1000 then</message>
|
There
are 5 predefined entity references in XML:
<
|
<
|
less
than
|
>
|
>
|
greater
than
|
&
|
&
|
ampersand
|
'
|
'
|
apostrophe
|
"e;
|
"
|
quotation
mark
|
Entity references always starts with the "&" character
and ends with the ";" character.
Note: Only the characters "<" and
"&" are strictly illegal in XML. Apostrophes,
quotation marks and greater than signs are legal, but it is a good
habit to escape them.
CDATA
Everything inside a CDATA section is ignored by the
parser.
If
your text contains a lot of "<" or "&"
characters - like program code often does - the XML element can be
defined as a CDATA section.
A
CDATA section starts with "<![CDATA["
and ends with "]]>":
<script>
<![CDATA[
function
matchwo(a,b)
{
if (a <
b && a < 0) then
{
return 1
}
else
{
return 0
}
}
]]>
</script>
|
In
the previous example, everything inside the CDATA section is ignored
by the parser.
XML
Encoding
XML
documents can contain foreign characters like Norwegian
æøå, or french êèé.
To
let your XML parser understand these characters, you should save
your XML documents as Unicode.
Windows
95/98 Notepad
Windows 95/98 Notepad cannot save files in Unicode
format.
You
can use Notepad to edit and save XML documents that contain foreign
characters (like Norwegian or French æøå and
êèé),
<?xml
version="1.0"?>
<note>
<from>Jani</from>
<to>Tove</to>
<message>Norwegian: æøå.
French: êèé</message>
</note>
|
But
if you save the file and open
it with IE 5.0, you will get an ERROR
MESSAGE.
Windows
95/98 Notepad with Encoding
Windows 95/98 Notepad files must be saved with an
encoding attribute.
To
avoid this error you can add an encoding attribute to your XML
declaration, but you cannot use Unicode.
This
encoding (open
it with IE 5.0), will NOT give an error message:
<?xml version="1.0"
encoding="windows-1252"?>
|
This
encoding (open
it with IE 5.0), will NOT give an error message:
<?xml version="1.0"
encoding="ISO-8859-1"?>
|
This
encoding (open
it with IE 5.0), WILL give an error message:
<?xml version="1.0"
encoding="UTF-8"?>
|
This
encoding (open
it with IE 5.0), WILL give an error message:
<?xml version="1.0"
encoding="UTF-16"?>
|
Windows
2000 Notepad
Windows 2000 Notepad can save files as Unicode.
The
Notepad editor in Windows 2000 supports Unicode. If you select to
save this XML file as Unicode (note that the document does not
contain any encoding attribute):
<?xml version="1.0"?>
<note>
<from>Jani</from>
<to>Tove</to>
<message>Norwegian:
æøå. French: êèé</message>
</note>
|
you
can open
it with IE 5.0, WITHOUT getting an error message.
Error
Messages
If
you try to load an XML document into Internet Explorer 5, you can
get two different errors indicating encoding problems:
An invalid character was found in text content.
You
will get this error message if a character in the XML document does
not match the encoding attribute. Normally you will get this error
message if your XML document contains "foreign"
characters, and the file was saved with a single-byte encoding
editor like Notepad, and no encoding attribute was specified.
Switch
from current encoding to specified encoding not supported.
You
will get this error message if your file was saved as Unicode/UTF-16
but the encoding attribute specified a single-byte encoding like
Windows-1252, ISO-8859-1 or UTF-8. You can also get this error
message if your document was saved with single-byte encoding, but
the encoding attribute specified a double-byte encoding like UTF-16.
Conclusion
The
conclusion is that the encoding attribute has to specify the
encoding used when the document was saved. My best advice to avoid
errors is this:
Always save XML files as Unicode, without any
encoding information.
Use
an editor that supports Unicode (Windows 2000 Notepad does) and
always skip the encoding attribute.
|