Special Characters, Unicode and XML

XMetal 2 and 3 provide Special Characters and Symbol toolbars which feature many of the most commonly required non-Latin characters. If you need to produce a character in your XML not found in these toolbars, construct a “character reference” for the character as explained below.

A special character can be very broadly defined as any non-Latin character that you want to include in an XML file. These characters cannot be inserted as-is into an XML file without breaking attempts to display the file or any HTML translation of it. They need to be represented in XML in a standard way known as a “character reference” — a numeric value which makes it possible to preserve the character information across platforms, languages and document iterations. Character references is separate but related to “entity references” which are easier to remember since they approximate words but are of limited use since Netscape Navigator ignores them.

XMetal 2 and 3 will insert characters and character references for many of the most commonly required special characters via a set of toolbars within the software. There may be times you need to represent a character not included in these toolbars. You can type character references into your XML documents by hand via the directions below.

TARO participants decided early in the planning stages to use Unicode character references when necessary. Character reference tables are accessible on the Internet which list Unicode numeric values in “decimal” and “hexadecimal” notation. The hexadecimal value should be used for special characters in XML finding aids prepared for submission to the TARO Archive. The most complete charts, listing “hex” values, are available here in PDF format:

The 4 place alpha/numeric value you find for a character in these tables is pasted into your document using this notation:


example hexadecimal notation character reference entity reference (for comparison and completeness — DO NOT USE THIS) description
hellip; horizontal ellipsis
& & amp; ampersand

It needs to be stressed that this coding in no way insures proper display of the special character in any given environment. Keep this in mind as you decide how to represent any given character in your XML document. It is still the case that beyond the extended Latin character set, many characters will fail to display on most systems.

Error Checking Special Character Displays

What browser are you using?

When we receive emails or calls regarding characters not displaying correctly, this is the first question we now ask. Netscape Navigator has a particular problem displaying well formed, standard entity references (ie. things like &#ldquo; and &#rdquo; and &#hellip;) which are rendered without incident in Opera or Internet Explorer. We STRONGLY suggest that any worthwhile error ehecking needs to be done with something other than Navigator.

In reference to the TARO project, Apex inserted proper Unicode values (in decimal notation such as ¥ to represent yen) for most, but not all, special characters. We will leave it to the repositories’ discretion as to whether they wish to replace the Apex decimal notation with the preferred hexadecimal.

Repositories need to examine their XML files for two distinct types of errors:

  1. Special characters left uncoded by Apex
    These may be difficult to detect without comparison to the paper originals.
  2. Special characters miscoded by Apex
    Here we are referring to instances where the Unicode data for a special character is included in the XML file but it is either misapplied or misspelled. One example which Apex consistently miscoded is the ampersand in Texas A&M. Unfortunately, this type of error is harder to detect. A misapplied Unicode representation will not break the raw XML. Rather than rendering the special character, a typographical error in Unicode data will simply appear as text within the XML document. One possible solution is to load the XML file into Internet Explorer 5 or higher and use the browser’s “FIND” function to scan for the pound sign (#) within the document.

Leave a Reply