Scholarly Societies Project

Editorial, 2001, December 20:

Implementing the Unicode Standard for Encoding Character Sets

Table of Contents

Border

The Problem

Although English is the language in which the Scholarly Societies Project is presented, a large number of the societies covered have names that require either the use of diacritics, or non-Latin scripts, if they are to be properly displayed.

The problem is most acute with non-Latin scripts. Given a word written in a non-Latin script, it is indeed possible to use a transliteration scheme to provide a unique representation, called a transliteration, for that word in the Latin script. This transliteration, incidentally, will also capture some of the phonetic value of the word in the original language.

But, for the convenience of users who are familiar with the non-Latin script in question, recording the word as given in the original script allows the user a more familiar and direct representation of the word.

Border

Graphical vs Textual Solutions
Solution
Advantages
Disadvantages
Graphical
[Creating an image file that represents the words in the non-Latin script]
  • no special fonts are required
  • no change in font size when user changes Latin font size
  • no choice of font face by user is possible
  • no copying & pasting of characters by user is possible
Textual
[Using a textual encoding standard to create HTML code that represents the words in the non-Latin script]
  • change in font size when user changes Latin font size
  • choice of font face by user is possible
  • copying & pasting of characters by user is possible
  • special fonts must be installed

Border

Existing Standards for Textual Encoding of Non-Latin Scripts
Over the years, numerous incompatible Internet encodings have been developed for the major character sets in the world. A good overview of existing encodings (other than the Unicode standard) for non-Latin scripts is given at the Non-Roman Scripts page at the Computing with Accents, Symbols & Foreign Scripts site maintained at Pennsylvania State University by Elizabeth J. Pyatt for the Center for Education Technology Services .

Until the emergence of the Unicode Standard, any character encoding would, at best, allow the representation of one non-Latin script and also the Latin script on a single webpage. With the Unicode standard, it is now possible to represent numerous different scripts on a single webpage. [See the Unicode page at the Computing with Accents, Symbols & Foreign Scripts site maintained at Pennsylvania State University by Elizabeth J. Pyatt for the Center for Education Technology Services .]

It is this capability of representing numerous different scripts on a single webpage that makes the Unicode Standard the ideal choice for encoding non-Latin scripts in the Scholarly Societies Project.

Border

Implementation Procedures Followed in the Project
Work in encoding non-Latin scripts in the Scholarly Societies Project began near the end of June, 2001. Much of the encoding was completed by the end of that summer.

Detailed information on procedures followed in implementing the Unicode Standard in the Scholarly Societies Project may found in the Character Encoding section of the Linguistic Considerations page.

Border

Proper Viewing of the Encodings
Detailed information on the proper way to view the Unicode encodings in the Scholarly Societies Project may found in the Proper Viewing of the Encodings section of the Linguistic Considerations page.

Border

Future Enhancements

Encoding of a More Comprehensive Character Set

As noted in the Character Encoding section of the Linguistic Considerations page, the current encodings have been limited to those that may be viewed with the Arial Unicode font, which is currently the single most comprehensive Unicode font. At the present time, the Arial Unicode font does not cover additional characters that were added to the Unicode Standard 2.1 to create the Unicode Standard 3.0, much less later versions.

In future, encodings will be expanded to include characters that go beyond the Unicode Standard 2.1. In particular, one area that would be enhanced by access to the full Unicode Standard 3.0 is the encoding of Arabic; the Unicode Standard 2.1 lacks certain special Arabic characters that the Editor has had to replace with related, more common, characters.


Search Engine Handling of Diacritical Marks and Non-Latin Characters

The Scholarly Societies Project currently does a reasonable job of encoding Latin characters with diacritical marks, or non-Latin characters, on web pages. The Project does not yet allow the user:

  • to copy a search string with Latin characters that have diacritical marks into the Search Engine search box, and have the Search Engine interpret the request correctly, nor
  • to copy a search string of non-Latin characters into the Search Engine search box, and have the search engine interpret the request correctly.

In general these are more difficult problems to solve than the encoding problem. The problem of allowing Latin characters with diacritical marks in a search string is likely to be the more tractable of the two problems; it is hoped that this will be solved within the next several months, as we move to a more powerful production system. The problem of allowing non-Latin characters in a search string is likely to take rather longer.

Border

Published 2001, December 20
Jim Parrott, Editor
Scholarly Societies Project.

Home